First of all, I am aware that what I am trying to do might be outside the C standard.
I'd like to know if it is possible to make a uint4_t/int4_t or uint128_t/int128_t type in C.
I know I could do this using bitshifts and complex functions, but can I do it without those?
You can use bitfields within a structure to get fields narrower than a uint8_t, but, the base datatype they're stored in will not be any smaller.
struct SmallInt
{
unsigned int a : 4;
};
will give you a structure with a member called a that is 4 bits wide.
Individual storage units (bytes) are no less than CHAR_BITS bits wide1; even if you create a struct with a single 4-bit bitfield, the associated object will always take up a full storage unit.
There are multiple precision libraries such as GMP that allow you to work with values that can't fit into 32 or 64 bits. Might want to check them out.
8 bits minimum, but may be wider.
In practice, if you want very wide numbers (but that is not specified in standard C11) you probably want to use some arbitrary-precision arithmetic external library (a.k.a. bignums). I recommend using GMPlib.
In some cases, for tiny ranges of numbers, you might use bitfields inside struct to have tiny integers. Practically speaking, they can be costly (the compiler would emit shift and bitmask instructions to deal with them).
See also this answer mentioning __int128_t as an extension in some compilers.
I have a little VM for a programming language implemented in C. It supports being compiled under both 32-bit and 64-bit architectures as well as both C and C++.
I'm trying to make it compile cleanly with as many warnings enabled as possible. When I turn on CLANG_WARN_IMPLICIT_SIGN_CONVERSION, I get a cascade of new warnings.
I'd like to have a good strategy for when to use int versus either explicitly unsigned types, and/or explicitly sized ones. So far, I'm having trouble deciding what that strategy should be.
It's certainly true that mixing them—using mostly int for things like local variables and parameters and using narrower types for fields in structs—causes lots of implicit conversion problems.
I do like using more specifically sized types for struct fields because I like the idea of explicitly controlling memory usage for objects in the heap. Also, for hash tables, I rely on unsigned overflow when hashing, so it's nice if the hash table's size is stored as uint32_t.
But, if I try to use more specific types everywhere, I find myself in a maze of twisty casts everywhere.
What do other C projects do?
Just using int everywhere may seem tempting, since it minimizes the need for casting, but there are several potential pitfalls you should be aware of:
An int might be shorter than you expect. Even though, on most desktop platforms, an int is typically 32 bits, the C standard only guarantees a minimum length of 16 bits. Could your code ever need numbers larger than 216−1 = 32,767, even for temporary values? If so, don't use an int. (You may want to use a long instead; a long is guaranteed to be at least 32 bits.)
Even a long might not always be long enough. In particular, there is no guarantee that the length of an array (or of a string, which is a char array) fits in a long. Use size_t (or ptrdiff_t, if you need a signed difference) for those.
In particular, a size_t is defined to be large enough to hold any valid array index, whereas an int or even a long might not be. Thus, for example, when iterating over an array, your loop counter (and its initial / final values) should generally be a size_t, at least unless you know for sure that the array is short enough for a smaller type to work. (But be careful when iterating backwards: size_t is unsigned, so for(size_t i = n-1; i >= 0; i--) is an infinite loop! Using i != SIZE_MAX or i != (size_t) -1 should work, though; or use a do/while loop, but beware of the case n == 0!)
An int is signed. In particular, this means that int overflow is undefined behavior. If there's ever any risk that your values might legitimately overflow, don't use an int; use an unsigned int (or an unsigned long, or uintNN_t) instead.
Sometimes, you just need a fixed bit length. If you're interfacing with an ABI, or reading / writing a file format, that requires integers of a specific length, then that's the length you need to use. (Of course, is such situations, you may also need to worry about things like endianness, and so may sometimes have to resort to manually packing data byte-by-byte anyway.)
All that said, there are also reasons to avoid using the fixed-length types all the time: not only is int32_t awkward to type all the time, but forcing the compiler to always use 32-bit integers is not always optimal, particularly on platforms where the native int size might be, say, 64 bits. You could use, say, C99 int_fast32_t, but that's even more awkward to type.
Thus, here are my personal suggestions for maximum safety and portability:
Define your own integer types for casual use in a common header file, something like this:
#include <limits.h>
typedef int i16;
typedef unsigned int u16;
#if UINT_MAX >= 4294967295U
typedef int i32;
typedef unsigned int u32;
#else
typedef long i32;
typedef unsigned long i32;
#endif
Use these types for anything where the exact size of the type doesn't matter, as long as they're big enough. The type names I've suggested are both short and self-documenting, so they should be easy to use in casts where needed, and minimize the risk of errors due to using a too-narrow type.
Conveniently, the u32 and u16 types defined as above are guaranteed to be at least as wide as unsigned int, and thus can be used safely without having to worry about them being promoted to int and causing undefined overflow behavior.
Use size_t for all array sizes and indexing, but be careful when casting between it and any other integer types. Optionally, if you don't like to type so many underscores, typedef a more convenient alias for it too.
For calculations that assume overflow at a specific number of bits, either use uintNN_t, or just use u16 / u32 as defined above and explicit bitmasking with &. If you choose to use uintNN_t, make sure to protect yourself against unexpected promotion to int; one way to do that is with a macro like:
#define u(x) (0U + (x))
which should let you safely write e.g.:
uint32_t a = foo(), b = bar();
uint32_t c = u(a) * u(b); /* this is always unsigned multiply */
For external ABIs that require a specific integer length, again define a specific type, e.g.:
typedef int32_t fooint32; /* foo ABI needs 32-bit ints */
Again, this type name is self-documenting, with regard to both its size and its purpose.
If the ABI might actually require, say, 16- or 64-bit ints instead, depending on the platform and/or compile-time options, you can change the type definition to match (and rename the type to just fooint) — but then you really do need to be careful whenever you cast anything to or from that type, because it might overflow unexpectedly.
If your code has its own structures or file formats that require specific bitlengths, consider defining custom types for those too, exactly as if it was an external ABI. Or you could just use uintNN_t instead, but you'll lose a little bit of self-documentation that way.
For all these types, don't forget to also define the corresponding _MIN and _MAX constants for easy bounds checking. This might sound like a lot of work, but it's really just a couple of lines in a single header file.
Finally, remember to be careful with integer math, especially overflows.
For example, keep in mind that the difference of two n-bit signed integers may not fit in an n-bit int. (It will fit into an n-bit unsigned int, if you know it's non-negative; but remember that you need to cast the inputs to an unsigned type before taking their difference to avoid undefined behavior!)
Similarly, to find the average of two integers (e.g. for a binary search), don't use avg = (lo + hi) / 2, but rather e.g. avg = lo + (hi + 0U - lo) / 2; the former will break if the sum overflows.
You seem to know what you are doing, judging from the linked source code, which I took a glance at.
You said it yourself - using "specific" types makes you have more casts. That's not an optimal route to take anyway. Use int as much as you can, for things that do not mandate a more specialized type.
The beauty of int is that it is abstracted over the types you speak of. It is optimal in all cases where you need not expose the construct to a system unaware of int. It is your own tool for abstracting the platform for your program(s). It may also yield you speed, size and alignment advantage, depending.
In all other cases, e.g. where you want to deliberately stay close to machine specifications, int can and sometimes should be abandoned. Typical cases include network protocols where the data goes on the wire, and interoperability facilities - bridges of sorts between C and other languages, kernel assembly routines accessing C structures. But don't forget that sometimes you would want to in fact use int even in these cases, as it follows platforms own "native" or preferred word size, and you might want to rely on that very property.
With platform types like uint32_t, a kernel might want to use these (although it may not have to) in its data structures if these are accessed from both C and assembler, as the latter doesn't typically know what int is supposed to be.
To sum up, use int as much as possible and resort to moving from more abstract types to "machine" types (bytes/octets, words, etc) in any situation which may require so.
As to size_t and other "usage-suggestive" types - as long as syntax follows semantics inherent to the type - say, using size_t for well, size values of all kinds - I would not contest. But I would not liberally apply it to anything just because it is guaranteed to be the largest type (regardless if it is actually true). That's an underwater stone you don't want to be stepping on later. Code has to be self-explanatory to the degree possible, I would say - having a size_t where none is naturally expected, would raise eyebrows, for a good reason. Use size_t for sizes. Use offset_t for offsets. Use [u]intN_t for octets, words, and such things. And so on.
This is about applying semantics inherent in a particular C type, to your source code, and about the implications on the running program.
Also, as others have illustrated, don't shy away from typedef, as it gives you the power to efficiently define your own types, an abstraction facility I personally value. A good program source code may not even expose a single int, nevertheless relying on int aliased behind a multitude of purpose-defined types. I am not going to cover typedef here, the other answers hopefully will.
Keep large numbers that are used to access members of arrays, or control buffers as size_t.
For an example of a project that makes use of size_t, refer to GNU's dd.c, line 155.
Here are a few things I do. Not sure they're for everyone but they work for me.
Never use int or unsigned int directly. There always seems to be a more appropriately named type for the job.
If a variable needs to be a specific width (e.g. for a hardware register or to match a protocol) use a width-specific type (e.g. uint32_t).
For array iterators, where I want to access array elements 0 thru n, this should also be unsigned (no reason to access any index less than 0) and I use one of the fast types (e.g. uint_fast16_t), selecting the type based on the minimum size required to access all array elements. For example, if I have a for loop that will iterate through 24 elements max, I'll use uint_fast8_t and let the compiler (or stdint.h, depending how pedantic we want to get) decide which is the fastest type for that operation.
Always use unsigned variables unless there is a specific reason for them to be signed.
If your unsigned variables and signed variables need to play together, use explicit casts and be aware of the consequences. (Luckily this will be minimized if you avoid using signed variables except where absolutely necessary.)
If you disagree with any of those or have recommended alternatives please let me know in the comments! That's the life of a software developer... we keep learning or we become irrelevant.
Always.
Unless you have specific reasons for using a more specific type, including you're on a 16-bit platform and need integers greater than 32767, or you need to ensure proper byte order and signage for data exchange over a network or in a file (and unless you're resource constrained, consider transferring data in "plain text," meaning ASCII or UTF8 if you prefer).
My experience has shown that "just use 'int'" is a good maxim to live by and makes it possible to turn out working, easily maintained, correct code quickly every time. But your specific situation may differ, so take this advice with a bit of well-deserved scrutiny.
Most of the time, using int is not ideal. The main reason is that int is signed and signed can cause UB, signed integers can also be negative, something that you don't need for most integers. Prefer unsigned integers. Secondly, data types reflect meaning and a, very limited, way to document the used range and values this variable may have. If you use int, you imply that you expect this variable to sometimes hold negative values, that this values probably do not always fit into 8 bit but always fit into INT_MAX, which can be as low as 32767. Do not assume a int is 32 bit.
Always, think about the possible values of a variable and choose the type accordingly. I use the following rules:
Use unsigned integers except when you need to be able to handle negative numbers.
If you want to index an array, from the start, use size_t except when there are good reasons not to. Almost never use int for it, a int can be too small and there is a high chance of creating a UB bug that isn't found during testing because you never tested arrays large enough.
Same for array sizes and sizes of other object, prefer size_t.
If you need to index array with negative index, which you may need for image processing, prefer ptrdiff_t. But be aware, ptrdiff_t can be too small, but that is rare.
If you have arrays that never exceed a certain size, you may use uint_fastN_t, uintN_t, or uint_leastN_t types. This can make a lot of sense especially on a 8 bit microcontroller.
Sometimes, unsigned int can be used instead of uint_fast16_t, similarly int for int_fast16_t.
To handle the value of a single byte (or character, but this is not a real character because of UTF-8 and Unicode sometimes using more than one code pointer per character), use int. int can store -1 if you need an indicator for error or not set and a character literal is of type int. (This is true for C, for C++ you may use a different strategy). There is the extremely rare possibility that a machine uses sizeof(int)==1 && CHAR_MIN==0 where a byte can not be handled with a int, but i never saw such a machine.
It can make sense to define your own types for different purposes.
Use explicit cast where casts are needed. This way the code is well defined and has the least amount of unexpected behaviour.
After a certain size, a project needs a list/enum of the native integer data types. You can use macros with the _Generic expression from C11, that only needs to handle bool, signed char, short, int, long, long long and their unsigned counterparts to get the underlying native type from a typedefed one. This way your parsers and similar parts only need to handle 11 integer types and not 56 standard integer (if i counted correctly), and a bunch of other non-standard types.
Data types are system depended in C and their bit lengths may change for different machines. I am aware of the < inttypes.h > header which provide fixed width integer datatypes. However, this header guarantee that provided data type has at least specified number N of bits. (Wiki page)
But I need data types with exact bit lengths in my applications. For example, if data type is uint16_t it should be 16 bits, not at least 16 bits. Now my question is: Can i define new integer data types using "unsigned char" and "char" data types (Since they will be 8 bits in every machine) as main bulding blocks? Can I implement related arithmetic operations and overload arithmetic operators like "+"? Or are there other solutions already?
Edit: My exact problem is about implementations of cryptographic algorithms like DES which require fixed bits.
you can do so by using bit-fields in structures. You can set the number of bit to a desired length.
example:-
struct defineInt{
int a:16;
};
If you need types with exact number of bits, use types like int24_t or uint16_t - they are guaranteed to have exact number of bits. C provides types like int_least8_t separately, but what you need is uint16_t and you do not need to implement anything.
Nothing in the c spec guarantees a char is 8 bits, nice idea though.
don't (re)use char. define your own type.
For Cxx11, have a look to char16_t and family.
- char is for 8-bit code units,
- char16_t is for 16-bit code units, and
- char32_t is for 32-bit code units.
There already is an int type which guarantees it's "at least" N bits namely int_leastNN_t available in stdint.h.
Also you can "roll your own" by declaring a typedef which aliases a particular int type based on either the current architecture or the sizeof the particular int size.
In C integer and short integer variables are identical: both range from -32768 to 32767, and the required bytes of both are also identical, namely 2.
So why are two different types necessary?
Basic integer types in C language do not have strictly defined ranges. They only have minimum range requirements specified by the language standard. That means that your assertion about int and short having the same range is generally incorrect.
Even though the minimum range requirements for int and short are the same, in a typical modern implementation the range of int is usually greater than the range of short.
The standard only guarantees sizeof(short) <= sizeof(int) <= sizeof(long) as far as I remember. So both short and int can be the same but don't have to. 32 bit compilers usually have 2 bytes short and 4 bytes int.
The C++ standard (and the C standard, which has a very similar paragraph, but the quote is from the n3337 version of the C++11 draft specification):
Section 3.9.1, point 2:
There are five standard signed integer types : “signed char”, “short int”, “int”, “long int”, and “long
long int”. In this list, each type provides at least as much storage as those preceding it in the list.
There may also be implementation-defined extended signed integer types. The standard and extended signed
integer types are collectively called signed integer types. Plain ints have the natural size suggested by the
architecture of the execution environment ; the other signed integer types are provided to meet special
needs.
Different architectures have different size "natural" integers, so a 16-bit architecture will naturally calculate a 16-bit value, where a 32- or 64-bit architecture will use either 32 or 64-bit int's. It's a choice for the compiler producer (or the definer of the ABI for a particular architecture, which tends to be a decision formed by a combination of the OS and the "main" Compiler producer for that architecture).
In modern C and C++, there are types along the lines of int32_t that is guaranteed to be exactly 32 bits. This helps portability. If these types aren't sufficient (or the project is using a not so modern compiler), it is a good idea to NOT use int in a data structure or type that needs a particular precision/size, but to define a uint32 or int32 or something similar, that can be used in all places where the size matters.
In a lot of code, the size of a variable isn't critical, because the number is within such a range, that a few thousand is way more than you ever need - e.g. number of characters in a filename is defined by the OS, and I'm not aware of any OS where a filename/path is more than 4K characters - so a 16, 32 or 64 bit value that can go to at least 32K would be perfectly fine for counting that - it doesn't really matter what size it is - so here we SHOULD use int, not try to use a specific size. int should, in a compiler be a type that is "efficient", so should help to give good performance, where some architectures will run slower if you use short, and certainly 16-bit architectures will run slower using long.
The guaranteed minimum ranges of int and short are the same. However an implementation is free to define short with a smaller range than int (as long as it still meets the minimum), which means that it may be expected to take the same or smaller storage space than int1. The standard says of int that:
A ‘‘plain’’ int object has the natural size suggested by the
architecture of the execution environment.
Taken together, this means that (for values that fall into the range -32767 to 32767) portable code should prefer int in almost all cases. The exception would be where the a very large number of values are being stored, such that the potentially smaller storage space occupied by short is a consideration.
1. Of course a pathological implementation is free to define a short that has a larger size in bytes than int, as long as it still has equal or lesser range - there is no good reason to do so, however.
They both are identical for 16 bit IBM compatible PC. However it is not sure that it will be identical on other hardwares as well.
VAX type of system which is known as virtual address extension they treat all these 2 variables in different manner. It occupies 2 bytes for short integer and 4 bytes for integer.
So this is the reason that we have 2 different but identical variables and their property.
for general purpose in desktops and laptops we use integer.
I want to declare a bitfield with the size specified using the a colon (I can't remember what the syntax is called). I want to write this:
void myFunction()
{
unsigned int thing : 12;
...
}
But GCC says it's a syntax error (it thinks I'm trying to write a nested function). I have no problem doing this though:
struct thingStruct
{
unsigned int thing : 4;
};
and then putting one such struct on the stack
void myFunction()
{
struct thingStruct thing;
...
}
This leads me to believe that it's being prevented by syntax, not semantic issues.
So why won't the first example work? What am I missing?
The first example won't work because you can only declare bitfields inside structs. This is syntax, not semantics, as you said, but there it is. If you want a bitfield, use a struct.
Why would you want to do such a thing? A bit field of 12 would on all common architectures be padded to at least 16 or 32 bits.
If you want to ensure the width of an integer variable use the types in inttypes.h, e.g int16_t or int32_t.
As others have said, bitfields must be declared inside a struct (or union, but that's not really useful). Why? Here are two reasons.
Mainly, it's to make the compiler writer's job easier. Bitfields tend to require more machine instructions to extract the bits from the bytes. Only fields can be bitfields, and not variables or other objects, so the compiler writer doesn't have to worry about them if there is no . or -> operator involved.
But, you say, sometimes the language designers make the compiler writer's job harder in order to make the programmer's life easier. Well, there is not a lot of demand from programmers for bitfields outside structs. The reason is that programmers pretty much only bother with bitfields when they're going to cram several small integers inside a single data structure. Otherwise, they'd use a plain integral type.
Other languages have integer range types, e.g., you can specify that a variable ranges from 17 to 42. There isn't much call for this in C because C never requires that an implementation check for overflow. So C programmers just choose a type that's capable of representing the desired range; it's their job to check bounds anyway.
C89 (i.e., the version of the C language that you can find just about everywhere) offers a limited selection of types that have at least n bits. There's unsigned char for 8 bits, unsigned short for 16 bits and unsigned long for 32 bits (plus signed variants). C99 offers a wider selection of types called uint_least8_t, uint_least16_t, uint_least32_t and uint_least64_t. These types are guaranteed to be the smallest types with at least that many value bits. An implementation can provide types for other number of bits, such as uint_least12_t, but most don't. These types are defined in <stdint.h>, which is available on many C89 implementations even though it's not required by the standard.
Bitfields provide a consistent syntax to access certain implementation-dependent functionality. The most common purpose of that functionality is to place certain data items into bits in a certain way, relative to each other. If two items (bit-fields or not) are declared as consecutive items in a struct, they are guaranteed to be stored consecutively. No such guarantee exists with individual variables, regardless of storage class or scope. If a struct contains:
struct foo {
unsigned bar: 1;
unsigned boz: 1;
};
it is guaranteed that bar and boz will be stored consecutively (most likely in the same storage location, though I don't think that's actually guaranteed). By contrast, 'bar' and 'boz' were single-bit automatic variables, there's no telling where they would be stored, so there'd be little benefit to having them as bitfields. If they did share space with some other variable, it would be hard to make sure that different functions reading and writing different bits in the same byte didn't interfere with each other.
Note that some embedded-systems compilers do expose a genuine 'bit' type, which are packed eight to a byte. Such compilers generally have an area of memory which is allocated for storing nothing but bit variables, and the processors for which they generate code have atomic instructions to test, set, and clear individual bits. Since the memory locations holding the bits are only accessed using such instructions, there's no danger of conflicts.