I am a student currently learning the C programming language through a book called "C Primer Plus, 5th edition". I am learning it because I am pursuing a career in programming for embedded systems and devices, device drivers, low-level stuff, etc. My question is very simple, but I have not yet gotten a straight answer from the textbook & from various posts on SO that are similar to my question.
How do you determine the size of integer data types like SHORT, INT, or LONG? I know that this is a simple question that has been asked a lot, but everyone seems to answer the question with "depends on architecture/compiler", which leaves me clueless and doesn't help someone like me who is a novice.
Is there a hidden chart somewhere on the internet that will clearly describe these incompatibilities or is there some numerical method of looking at a compiler (16-bit, 24-bit, 32-bit, 64-bit, etc) and being able to tell what the data type will be? Or is manually using the sizeof operator with a compiler on a particular system the only way to tell what these data types will hold?
You just need the right docs, in your case you need the document that defines the standard, and you should name at least 1 version of it while asking this kind of questions; for example the C99 is one of the most popular version of the language and it's defined in the ISO-IEC 9899-1999 document.
The C standard doesn't define the size in absolute terms, it goes more for a minimum size expressed in bytes, and sometimes not even that.
The notable exception is char, which is a type that is guaranteed to be 1 byte in size, but here it is another potential pitfall for you, the C standard doesn't defines how big a byte is, so it says that char is 1 byte, but you can't say anything for sure without knowing your platform.
You always need to know both the standard and your platform, if you want to do this programmatically there is the limits.h header with macros for your platform .
You're looking for limits.h. It defines various macros such as INT_MAX (the maximum value of type int) or CHAR_BIT (the number of bits in a char). You can use these values to calculate the size of each type.
Related
From Modern C by Jens Gustedt,
Representations of values on a computer can vary “culturally” from architecture to architecture or are determined by the type the programmer gave to the value. Therefore, we should try to reason primarily about values and not about representations if we want to write portable code.
If you already have some experience in C and in manipulating bytes and bits, you will need to make an effort to actively “forget” your knowledge for most of this section. Thinking about concrete representations of values on your computer will inhibit you more
than it helps.
Takeaway - C programs primarily reason about values and not about their representation.
Question 1: What kind of 'representations' of values, is author talking about? Could I be given an example, where this 'representation' varies from architecture to architecture and also an example of how representations of values are determined by type programmer gave to value?
Question 2: What's the purpose of specifying a data type in C language, I mean that's the rule of the language but I have heard that's how a compiler knows how much memory to allocate to an object? Is that the only use, albeit crucial? I've heard there isn't a need to specify a data type in Python.
What kind of 'representations' of values, is author talking about?
https://en.wikipedia.org/wiki/Two%27s_complement vs https://en.wikipedia.org/wiki/Ones%27_complement vs https://en.wikipedia.org/wiki/Offset_binary. Generally https://en.wikipedia.org/wiki/Signed_number_representations.
But also the vast space of floating point number formats https://en.wikipedia.org/wiki/Floating-point_arithmetic#IEEE_754:_floating_point_in_modern_computers - IEEE 745, minifloat, bfloat16, etc. etc. .
Could I be given an example, where this 'representation' varies from architecture to architecture
Your PC uses twos complement vs https://superuser.com/questions/1137182/is-there-any-existing-cpu-implementation-which-uses-ones-complement .
Ach - but of course, most notably https://en.wikipedia.org/wiki/Endianness .
also an example of how representations of values are determined by type programmer gave to value?
(float)1 is represented in IEEE 745 as 0b00111111100000000000000000000000 https://www.h-schmidt.net/FloatConverter/IEEE754.html .
(unsigned)1 with 32-bit int is represented as 0b00.....0001.
What's the purpose of specifying a data type in C language,
Use computer resources efficiently. There is no point in reserving 2 gigabytes to store 8-bits of data. Type determines the range of values that can be "contained" in a variable. You communicate that "upper/lower range" of allowed values to the compiler, and the compiler generates nice and fast code. (There is also ADA where you literally specify the range of types, like type Day_type is range 1 .. 31;).
Programs are written using https://en.wikipedia.org/wiki/Harvard_architecture . Variables at block scope are put on stack https://en.wikipedia.org/wiki/Stack_(abstract_data_type)#Hardware_stack . The idea is that you have to know in advance how many bytes to reserve from the stack. Types communicate just that.
have heard that's how a compiler knows how much memory to allocate to an object?
Type communicates to the compiler how much memory to allocate for an object, but it also communicates the range of values, the representation (float vs _Float32 might be similar, but be different). Overflowing addition of two int's is invalid, overflowing addition of two unsigned is fine and wraps around. There are differences.
Is that the only use, albeit crucial?
The most important use of types is to clearly communicate the purpose of your code to other developers.
char character;
int numerical_variable;
uint_least8_t variable_with_8_bits_that_is_optimized_for_size;
uint_fast8_t variable_with_8_bits_that_is_optimized_for_speed;
wchar_t wide_character;
FILE *this_is_a_file;
I've heard there isn't a need to specify a data type in Python.
This is literally the difference between statically typed programming languages and dynamically typed programming languages. https://en.wikipedia.org/wiki/Type_system#Type_checking
I read here that it depends on the specific compiler so you always have to use sizeof() to find out the size, however when reading about datatype sizes i always read things such as "on a typical 64-bit machine datatype_x is x_bytes long" "on a 16-bit machine..."
Why is it like this?
What's the correlation between datatype size and the machine architecture?
Edit: The reason why i posted this question despite there being similar duplicates is because im not content with the answer "It depends on the compiler, and the compiler is usually made to achieve the best performance on the system". I wanted to know why certain size for the datatype on the given bit system is considered to give the best performance. Which I guess has to do with how instructions are processed by the cpu but I didn't want to go and read a bunch about CPUs and such, just want to know the part that's relevant to this question.
"What's the correlation between datatype size and the machine architecture?"
There is no defined correlation. There is a tendency or more like guidelines than actual rules that int corresponds to the processor's integer width and is at least 16 bits. Minimum "limit of size_t ... 65535" C11 §7.20.3 1
Why is it like this?
That is the strength of C. size_t follows the processors "best/native" size making for good performance, tightness of executable code and the platforms memory capacity. Yet there are exceptions to the guideline.
It is also a weakness of C in that it varies from platform to platform.
Use fixed width types like int32_t if code goals requires it.
I'm much more of a sysadmin than a programmer. But I do spend an inordinate amount of time grovelling through programmers' code trying to figure out what went wrong. And a disturbing amount of that time is spent dealing with problems when the programmer expected one definition of __u_ll_int32_t or whatever (yes, I know that's not real), but either expected the file defining that type to be somewhere other than it is, or (and this is far worse but thankfully rare) expected the semantics of that definition to be something other than it is.
As I understand C, it deliberately doesn't make width definitions for integer types (and that this is a Good Thing), but instead gives the programmer char, short, int, long, and long long, in all their signed and unsigned glory, with defined minima which the implementation (hopefully) meets. Furthermore, it gives the programmer various macros that the implementation must provide to tell you things like the width of a char, the largest unsigned long, etc. And yet the first thing any non-trivial C project seems to do is either import or invent another set of types that give them explicitly 8, 16, 32, and 64 bit integers. This means that as the sysadmin, I have to have those definition files in a place the programmer expects (that is, after all, my job), but then not all of the semantics of all those definitions are the same (this wheel has been re-invented many times) and there's no non-ad-hoc way that I know of to satisfy all of my users' needs here. (I've resorted at times to making a <bits/types_for_ralph.h>, which I know makes puppies cry every time I do it.)
What does trying to define the bit-width of numbers explicitly (in a language that specifically doesn't want to do that) gain the programmer that makes it worth all this configuration management headache? Why isn't knowing the defined minima and the platform-provided MAX/MIN macros enough to do what C programmers want to do? Why would you want to take a language whose main virtue is that it's portable across arbitrarily-bitted platforms and then typedef yourself into specific bit widths?
When a C or C++ programmer (hereinafter addressed in second-person) is choosing the size of an integer variable, it's usually in one of the following circumstances:
You know (at least roughly) the valid range for the variable, based on the real-world value it represents. For example,
numPassengersOnPlane in an airline reservation system should accommodate the largest supported airplane, so needs at least 10 bits. (Round up to 16.)
numPeopleInState in a US Census tabulating program needs to accommodate the most populous state (currently about 38 million), so needs at least 26 bits. (Round up to 32.)
In this case, you want the semantics of int_leastN_t from <stdint.h>. It's common for programmers to use the exact-width intN_t here, when technically they shouldn't; however, 8/16/32/64-bit machines are so overwhelmingly dominant today that the distinction is merely academic.
You could use the standard types and rely on constraints like “int must be at least 16 bits”, but a drawback of this is that there's no standard maximum size for the integer types. If int happens to be 32 bits when you only really needed 16, then you've unnecessarily doubled the size of your data. In many cases (see below), this isn't a problem, but if you have an array of millions of numbers, then you'll get lots of page faults.
Your numbers don't need to be that big, but for efficiency reasons, you want a fast, “native” data type instead of a small one that may require time wasted on bitmasking or zero/sign-extension.
This is the int_fastN_t types in <stdint.h>. However, it's common to just use the built-in int here, which in the 16/32-bit days had the semantics of int_fast16_t. It's not the native type on 64-bit systems, but it's usually good enough.
The variable is an amount of memory, array index, or casted pointer, and thus needs a size that depends on the amount of addressable memory.
This corresponds to the typedefs size_t, ptrdiff_t, intptr_t, etc. You have to use typedefs here because there is no built-in type that's guaranteed to be memory-sized.
The variable is part of a structure that's serialized to a file using fread/fwrite, or called from a non-C language (Java, COBOL, etc.) that has its own fixed-width data types.
In these cases, you truly do need an exact-width type.
You just haven't thought about the appropriate type, and use int out of habit.
Often, this works well enough.
So, in summary, all of the typedefs from <stdint.h> have their use cases. However, the usefulness of the built-in types is limited due to:
Lack of maximum sizes for these types.
Lack of a native memsize type.
The arbitrary choice between LP64 (on Unix-like systems) and LLP64 (on Windows) data models on 64-bit systems.
As for why there are so many redundant typedefs of fixed-width (WORD, DWORD, __int64, gint64, FINT64, etc.) and memsize (INT_PTR, LPARAM, VPTRDIFF, etc.) integer types, it's mainly because <stdint.h> came late in C's development, and people are still using older compilers that don't support it, so libraries need to define their own. Same reason why C++ has so many string classes.
Sometimes it is important. For example, most image file formats require an exact number of bits/bytes be used (or at least specified).
If you only wanted to share a file created by the same compiler on the same computer architecture, you would be correct (or at least things would work). But, in real life things like file specifications and network packets are created by a variety of computer architectures and compilers, so we have to care about the details in these case (at least).
The main reason the fundamental types can't be fixed is that a few machines don't use 8-bit bytes. Enough programmers don't care, or actively want not to be bothered with support for such beasts, that the majority of well-written code demands a specific number of bits wherever overflow would be a concern.
It's better to specify a required range than to use int or long directly, because asking for "relatively big" or "relatively small" is fairly meaningless. The point is to know what inputs the program can work with.
By the way, usually there's a compiler flag that will adjust the built-in types. See INT_TYPE_SIZE for GCC. It might be cleaner to stick that into the makefile, than to specialize the whole system environment with new headers.
If you want portable code, you want the code your write to function identically on all platforms. If you have
int i = 32767;
you can't say for certain what i+1 will give you on all platforms.
This is not portable. Some compilers (on the same CPU architecture!) will give you -32768 and some will give you 32768. Some perverted ones will give you 0. That's a pretty big difference. Granted if it overflows, this is Undefined Behavior, but you don't know it is UB unless you know exactly what the size of int is.
If you use the standard integer definitions (which is <stdint.h>, the ISO/IEC 9899:1999 standard), then you know the answer of +1 will give exact answer.
int16_t i = 32767;
i+1 will overflow (and on most compilers, i will appear to be -32768)
uint16_t j = 32767;
j+1 gives 32768;
int8_t i = 32767; // should be a warning but maybe not. most compilers will set i to -1
i+1 gives 0; (//in this case, the addition didn't overflow
uint8_t j = 32767; // should be a warning but maybe not. most compilers will set i to 255
i+1 gives 0;
int32_t i = 32767;
i+1 gives 32768;
uint32_t j = 32767;
i+1 gives 32768;
There are two opposing forces at play here:
The need for C to adapt to any CPU architecture in a natural way.
The need for data transferred to/from a program (network, disk, file, etc.) so that a program running on any architecture can correctly interpret it.
The "CPU matching" need has to do with inherent efficiency. There is CPU quantity which is most easily handled as a single unit which all arithmetic operations easily and efficiently are performed on, and which results in the need for the fewest bits of instruction encoding. That type is int. It could be 16 bits, 18 bits*, 32 bits, 36 bits*, 64 bits, or even 128 bits on some machines. (* These were some not-well-known machines from the 1960s and 1970s which may have never had a C compiler.)
Data transfer needs when transferring binary data require that record fields are the same size and alignment. For this it is quite important to have control of data sizes. There is also endianness and maybe binary data representations, like floating point representations.
A program which forces all integer operations to be 32 bit in the interests of size compatibility will work well on some CPU architectures, but not others (especially 16 bit, but also perhaps some 64-bit).
Using the CPU's native register size is preferable if all data interchange is done in a non-binary format, like XML or SQL (or any other ASCII encoding).
When should one use the datatypes from stdint.h?
Is it right to always use as a convention them?
What was the purpose of the design of nonspecific size types like int and short?
When should one use the datatypes from stdint.h?
When the programming tasks specify the integer width especially to accommodate some file or communication protocol format.
When high degree of portability between platforms is required over performance.
Is it right to always use as a convention them (then)?
Things are leaning that way. The fixed width types are a more recent addition to C. Original C had char, short, int, long and that was progressive as it tried, without being too specific, to accommodate the various integer sizes available across a wide variety of processors and environments. As C is 40ish years old, it speaks to the success of that strategy. Much C code has been written and successfully copes with the soft integer specification size. With increasing needs for consistency, char, short, int, long and long long, are not enough (or at least not so easy) and so int8_t, int16_t, int32_t, int64_t are born. New languages tend to require very specific fixed integer size types and 2's complement. As they are successfully, that Darwinian pressure will push on C. My crystal ball says we will see a slow migration to increasing uses of fixed width types in C.
What was the purpose of the design of nonspecific size types like int and short?
It was a good first step to accommodate the wide variety of various integer widths (8,9,12,18,36, etc.) and encodings (2's, 1's, sign/mag). So much coding today uses power-of-2 size integers with 2's complement, that one may not realize that many other arrangements existed beforehand. See this answer also.
My work demands that I use them and I actually love using them.
I find it useful when I have to implement a protocol and use them inside a structure which can be a message that needs to be sent out or a holder of certain information.
If I have to use a sequence number that needs to be incremented, I wouldn't use int because sequence numbers aren't supposed to be negative. I use uint32_t instead. I will hence know the sequence number space and can plan/code accordingly.
The code we write will be running on 32 as well as 64 bit machine so using "int" on different bit machines results in subtle bugs which can be a pain to identify. Using unint16_t will allocate 16 bits on 32 or 64 bit architecture.
No, I would say it's never a good idea to use those for general-purpose programming.
If you really care about number of bits, then go ahead and use them but for most general use you don't care so then use the general types. The general types might be faster, and they are certainly easier to read and write.
Fixed width datatypes should be used only when really required (e.g. when implementing transfer protocols or accessing hardware or requiring a certain range of values (you should use the ..._least_... variant there)). Your program won't adapt else on changed environments (e.g. using uint32_t for filesizes might be ok 10 years ago, but off_t will adapt to recent needs). As others have pointed out, there might be a performance impact as int might be faster than uint32_t on 16 bit platforms.
int itself is very problematic due to its signedness; it is better to use e.g. size_t when variable holds result of strlen() or sizeof().
I already know that stdint is used to when you need specific variable sizes for portability between platforms. I don't really have such an issue for now, but what are the cons and pros of using it besides the already shown fact above?
Looking for this on stackoverflow and others sites, I found 2 links that treats about the theme:
codealias.info - this one talks about the portability of the stdint.
stackoverflow - this one is more specific about uint8_t.
These two links are great specially if one is looking to know more about the main reason of this header - portability. But for me, what I like most about it is that I think uint8_t is cleaner than unsigned char (for storing an RBG channel value for example), int32_t looks more meaningful than simply int, etc.
So, my question is, exactly what are the cons and pros of using stdint besides the portability? Should I use it just in some specifics parts of my code, or everywhere? if everywhere, how can I use functions like atoi(), strtok(), etc. with it?
Thanks!
Pros
Using well-defined types makes the code far easier and safer to port, as you won't get any surprises when for example one machine interprets int as 16-bit and another as 32-bit. With stdint.h, what you type is what you get.
Using int etc also makes it hard to detect dangerous type promotions.
Another advantage is that by using int8_t instead of char, you know that you always get a signed 8 bit variable. char can be signed or unsigned, it is implementation-defined behavior and varies between compilers. Therefore, the default char is plain dangerous to use in code that should be portable.
If you want to give the compiler hints of that a variable should be optimized, you can use the uint_fastx_t which tells the compiler to use the fastest possible integer type, at least as large as 'x'. Most of the time this doesn't matter, the compiler is smart enough to make optimizations on type sizes no matter what you have typed in. Between sequence points, the compiler can implicitly change the type to another one than specified, as long as it doesn't affect the result.
Cons
None.
Reference: MISRA-C:2004 rule 6.3."typedefs that indicate size and signedness shall be used in place of the basic types".
EDIT : Removed incorrect example.
The only reason to use uint8_t rather than unsigned char (aside from aesthetic preference) is if you want to document that your program requires char to be exactly 8 bits. uint8_t exists if and only if CHAR_BIT==8, per the requirements of the C standard.
The rest of the intX_t and uintX_t types are useful in the following situations:
reading/writing disk/network (but then you also have to use endian conversion functions)
when you want unsigned wraparound behavior at an exact cutoff (but this can be done more portably with the & operator).
when you're controlling the exact layout of a struct because you need to ensure no padding exists (e.g. for memcmp or hashing purposes).
On the other hand, the uint_least8_t, etc. types are useful anywhere that you want to avoid using wastefully large or slow types but need to ensure that you can store values of a certain magnitude. For example, while long long is at least 64 bits, it might be 128-bit on some machines, and using it when what you need is just a type that can store 64 bit numbers would be very wasteful on such machines. int_least64_t solves the problem.
I would avoid using the [u]int_fastX_t types entirely since they've sometimes changed on a given machine (breaking the ABI) and since the definitions are usually wrong. For instance, on x86_64, the 64-bit integer type is considered the "fast" one for 16-, 32-, and 64-bit values, but while addition, subtraction, and multiplication are exactly the same speed whether you use 32-bit or 64-bit values, division is almost surely slower with larger-than-necessary types, and even if they were the same speed, you're using twice the memory for no benefit.
Finally, note that the arguments some answers have made about the inefficiency of using int32_t for a counter when it's not the native integer size are technically mostly correct, but it's irrelevant to correct code. Unless you're counting some small number of things where the maximum count is under your control, or some external (not in your program's memory) thing where the count might be astronomical, the correct type for a count is almost always size_t. This is why all the standard C functions use size_t for counts. Don't consider using anything else unless you have a very good reason.
cons
The primary reason the C language does not specify the size of int or long, etc. is for computational efficiency. Each architecture has a natural, most-efficient size, and the designers specifically empowered and intended the compiler implementor to use the natural native data size data for speed and code size efficiency.
In years past, communication with other machines was not a primary concern—most programs were local to the machine—so the predictability of each data type's size was of little concern.
Insisting that a particular architecture use a particular size int to count with is a really bad idea, even though it would seem to make other things easier.
In a way, thanks to XML and its brethren, data type size again is no longer much of a concern. Shipping machine-specific binary structures from machine to machine is again the exception rather than the rule.
I use stdint types for one reason only, when the data I hold in memory shall go on disk/network/descriptor in binary form. You only have to fight the little-endian/big-endian issue but that's relatively easy to overcome.
The obvious reason not to use stdint is when the code is size-independent, in maths terms everything that works over the rational integers. It would produce ugly code duplicates if you provided a uint*_t version of, say, qsort() for every expansion of *.
I use my own types in that case, derived from size_t when I'm lazy or the largest supported unsigned integer on the platform when I'm not.
Edit, because I ran into this issue earlier:
I think it's noteworthy that at least uint8_t, uint32_t and uint64_t are broken in Solaris 2.5.1.
So for maximum portability I still suggest avoiding stdint.h (at least for the next few years).