What is the purpose of "Macros for minimum-width integer constants"

What is the purpose of "Macros for minimum-width integer constants" - c

In C99 standard Section 7.18.4.1 "Macros for minimum-width integer constants", some macros defined as [U]INT[N]_C(x) for casting constant integers to least data types where N = 8, 16, 32, 64. Why are these macros defined since I can use L, UL, LL or ULL modifiers instead? For example, when I want to use at least 32 bits unsigned constant integer, I can simply write 42UL instead of UINT32_C(42). Since long data type is at least 32 bits wide it is also portable.
So, what is the purpose of these macros?

You'd need them in places where you want to make sure that they don't become too wide
#define myConstant UINT32_C(42)
and later
printf( "%" PRId32 " is %s\n", (hasproperty ? toto : myConstant), "rich");
here, if the constant would be UL the expression might be ulong and the variadic function could put a 64bit value on the stack that would be misinterpreted by printf.

They use the smallest integer type with a width of at least N, so UINT32_C(42) is only equivalent to 42UL on systems where int is smaller than 32 bits. On systems where int is 32 bits or greater, UINT32_C(42) is equivalent to 42U. You could even imagine a system where a short is 32 bits wide, in which case UINT32_C(42) would be equivalent to (unsigned short)42.
EDIT: #obareey It seems that most, if not all, implementations of the standard library do not comply with this part of the standard, perhaps because it is impossible. [glibc bug 2841] [glibc commit b7398be5]

The macros essentially possibly add an integer constant suffix such as
L, LL, U, UL, o UL to their argument, which basically makes them almost equivalent to the corresponding cast, except that the suffix won't ever downcast.
E.g.,
UINT32_C(42000000000) (42 billion) on an LLP64 architecture will turn into 42000000000U, which will have type UL subject to the rules explained here. The corresponding cast, on the other hand ((uint32_t)42000000000), would truncate it down to uint32_t (unsigned int on LLP64).
I can't think of a good use case, but I imagine it could be usable in some generic bit-twiddling macros that need at least X bits to work, but don't want to remove any extra bits if the user passes in something bigger.

Related

Determine data model by C preprocessor

I want to write a .h file conforming to C89 that would be understood by most C preprocessors like gcc, cl (Visual Studio) etc. and that would determine the data model used, i.e. how many bits the (unsigned) short, (unsigned) int and (unsigned) long types occupy. Where can I find the necessary macros? For instance, are there macros I can evaluate in order to find out whether the data model is e.g. ILP32, LP64, LLP64 or something else? It is fine for me to use compiler-specific macros, but I do not want to use architecture-specific or OS-specific macros. If possible, please also provide the necessary macros to check which compiler is used. Thank you!
ADDED 2: The goal is to allow for type definitions depending on the data model. For instance, if long is at least 48-bit wide, a 48-bit key type could be defined as long, but if not, I would need a struct for that. On the other hand, I do not want to rely on anything not guaranteed by C89, like "long is either 32-bit or 64-bit, so if ULONG_MAX != 0xFFFFFFFFlu, then long is wider than 48-bit", which does not have to be true on all C89-conforming compilers.
ADDED 1: Here, the predefined GCC macros are described. Hence I can do the following:
#if defined(__GNUC__)
#define SHRT_BIT __SHRT_WIDTH__
#define INT_BIT __INT_WIDTH__
#define LONG_BIT __LONG_WIDTH__
#elif ???
# ???
#endif
printf(
"short is %u-bit\n"
"int is %u-bit\n"
"long is %u-bit\n",
SHRT_BIT, INT_BIT, LONG_BIT
);
Are there similar macros for other widely used compilers like cl (Visual Studio), which I could add at the location of the ??? in the code?

You can #include <limits.h> and test the values of SHRT_MAX, USHRT_MAX, INT_MAX, UINT_MAX, LONG_MAX, and ULONG_MAX.
For the signed types, if the maximum value is at least 1,073,741,824 (230), the type has at least 32 bits (31 for the value and 1 for the sign), and similarly for other powers of two. For the unsigned types, one fewer bit is indicated (no sign bit), so comparing to 2,147,483,648 (231) would indicate whether the type has at least 32 bits.
These of course give you only the width of a type, the number of bits used for the value and sign (if present). Its actual size may be larger due to padding bits.

What is the difference between using INTXX_C macros and performing type cast to literals?

For example this code is broken (I've just fixed it in actual code..).
uint64_t a = 1 << 60
It can be fixed as,
uint64_t a = (uint64_t)1 << 60
but then this passed my brain.
uint64_t a = UINT64_C(1) << 60
I know that UINT64_C(1) is a macro that expands usually as 1ul in 64-bit systems, but then what makes it different than just doing a type cast?

There is no obvious difference or advantage, these macros are kind of redundant. There are some minor, subtle differences between the cast and the macro:
(uintn_t)1 might be cumbersome to use for preprocessor purposes, whereas UINTN_C(1) expands into a single pp token.
The resulting type of the UINTN_C is actually uint_leastn_t and not uintn_t. So it is not necessarily the type you expected.
Static analysers for coding standards like MISRA-C might moan if you type 1 rather than 1u in your code, since shifting signed integers isn't a brilliant idea regardless of their size.
(uint64_t)1u is MISRA compliant, UINT64_c(1) might not be, or at least the analyser won't be able to tell since it can't expand pp tokens like a compiler. And UINT64_C(1u) will likely not work, since this macro implementation probably looks something like this:
#define UINT64_C(n) ((uint_least64_t) n ## ull)
// BAD: 1u##ull = 1uull
In general, I would recommend to use an explicit cast. Or better yet wrap all of this inside a named constant:
#define MY_BIT ( (uint64_t)1u << 60 )

(uint64_t)1 is formally an int value 1 casted to uint64_t, whereas 1ul is a constant 1 of type unsigned long which is probably the same as uint64_t on a 64-bit system. As you are dealing with constants, all calculations will be done by the compiler and the result is the same.
The macro is a portable way to specify the correct suffix for a constant (literal) of type uint64_t. The suffix appended by the macro (ul, system specific) can be used for literal constants only.
The cast (uint64_t) can be used for both constant and variable values. With a constant, it will have the same effect as the suffix or suffix-adding macro, whereas with a variable of a different type it may perform a truncation or extension of the value (e.g., fill the higher bits with 0 when changing from 32 bits to 64 bits).
Whether to use UINT64_C(1) or (uint64_t)1 is a matter of taste. The macro makes it a bit more clear that you are dealing with a constant.
As mentioned in a comment, 1ul is a uint32_t, not a uint64_t on windows system. I expect that the macro UINT64_C will append the platform-specific suffix corresponding to uint64_t, so it might append uLL in this case. See also https://stackoverflow.com/a/52490273/10622916.

UINT64_C(1) produces a single token via token pasting, whereas ((uint64_t)1) is a constant expression with the same value.
They can be used interchangeably in the sample code posted, but not in preprocessor directives such as #if expressions.
XXX_C macros should be used to define constants that can be used in #if expressions. They are only needed if the constant must have a specific type, otherwise just spelling the constant in decimal or hexadecimal without a suffix is sufficient.

Defending "U" suffix after Hex literals

There is some debate between my colleague and I about the U suffix after hexadecimally represented literals. Note, this is not a question about the meaning of this suffix or about what it does. I have found several of those topics here, but I have not found an answer to my question.
Some background information:
We're trying to come to a set of rules that we both agree on, to use that as our style from that point on. We have a copy of the 2004 Misra C rules and decided to use that as a starting point. We're not interested in being fully Misra C compliant; we're cherry picking the rules that we think will most increase efficiency and robustness.
Rule 10.6 from the aforementioned guidelines states:
A “U” suffix shall be applied to all constants of unsigned type.
I personally think this is a good rule. It takes little effort, looks better than explicit casts and more explicitly shows the intention of a constant. To me it makes sense to use it for all unsigned contants, not just numerics, since enforcing a rule doesn't happen by allowing exceptions, especially for a commonly used representation of constants.
My colleague, however, feels that the hexadecimal representation doesn't need the suffix. Mostly because we almost exclusively use it to set micro-controller registers, and signedness doesn't matter when setting registers to hex constants.
My Question
My question is not one about who is right or wrong. It is about finding out whether there are cases where the absence or presence of the suffix changes the outcome of an operation. Are there any such cases, or is it a matter of consistency?
Edit: for clarification; Specifically about setting micro-controller registers by assigning hexadecimal values to them. Would there be a case where the suffix could make a difference there? I feel like it wouldn't. As an example, the Freescale Processor Expert generates all register assignments as unsigned.

Appending a U suffix to all hexadecimal constants makes them unsigned as you already mentioned. This may have undesirable side-effects when these constants are used in operations along with signed values, especially comparisons.
Here is a pathological example:
#define MY_INT_MAX 0x7FFFFFFFU // blindly applying the rule
if (-1 < MY_INT_MAX) {
printf("OK\n");
} else {
printf("OOPS!\n");
}
The C rules for signed/unsigned conversions are precisely specified, but somewhat counter-intuitive so the above code will indeed print OOPS.
The MISRA-C rule is precise as it states A “U” suffix shall be applied to all constants of unsigned type. The word unsigned has far reaching consequences and indeed most constants should not really be considered unsigned.
Furthermore, the C Standard makes a subtile difference between decimal and hexadecimal constants:
A hexadecimal constant is considered unsigned if its value can be represented by the unsigned integer type and not the signed integer type of the same size for types int and larger.
This means that on 32-bit 2's complement systems, 2147483648 is a long or a long long whereas 0x80000000 is an unsigned int. Appending a U suffix may make this more explicit in this case but the real precaution to avoid potential problems is to mandate the compiler to reject signed/unsigned comparisons altogether: gcc -Wall -Wextra -Werror or clang -Weverything -Werror are life savers.
Here is how bad it can get:
if (-1 < 0x8000) {
printf("OK\n");
} else {
printf("OOPS!\n");
}
The above code should print OK on 32-bit systems and OOPS on 16-bit systems. To make things even worse, it is still quite common to see embedded projects use obsolete compilers which do not even implement the Standard semantics for this issue.
For your specific question, the defined values for micro-processor registers used specifically to set them via assignment (assuming these registers are memory-mapped), need not have the U suffix at all. The register lvalue should have an unsigned type and the hex value will be signed or unsigned depending on its value, but the operation will proceed the same. The opcode for setting a signed number or an unsigned number is the same on your target architecture and on any architectures I have ever seen.

With all integer-constants
Appending u/U insures the integer-constant will be some unsigned type.
Without a u/U
For a decimal-constant, the integer-constant will be some signed type.
For a hexadecimal/octal-constant, the integer-constant will be signed or unsigned type, depending of value and integer type ranges.
Note: All integer-constants have positive values.
// +-------- unary operator
// |+-+----- integer-constant
int x = -123;
absence or presence of the suffix changes the outcome of an operation?
When is this important?
With various expressions, the sign-ness and width of the math needs to be controlled and preferable not surprising.
// Examples: assume 32-bit `unsigned`, `long`, 64-bit `long long`
// Bad signed int overflow (UB)
unsigned a = 4000 * 1000 * 1000;
// OK
unsigned b = 4000u * 1000 * 1000;
// undefined behavior
unsigned c = 1 << 31
// OK
unsigned d = 1u << 31
printf("Size %zu\n", sizeof(0xFFFFFFFF)); // 8 type is `long long`
printf("Size %zu\n", sizeof(0xFFFFFFFFu)); // 4 type is `unsigned`
// 2 ** 63
long long e = -9223372036854775808; // C99: bad "9223372036854775808" not representable
long long f = -9223372036854775807 - 1; // ok
long long g = -9223372036854775808u; // implementation defined behavior **
some_unsigned_type h_max = -1; OK, max value for the target type.
some_unsigned_type i_max = -1u; OK, but not max value for wide unsigned types
// when negating a negative `int`
unsigned j = 0 - INT_MIN; // typically int overflow or UB
unsigned k = 0u - INT_MIN; // Never UB
** or an implementation-defined signal is raised.

For the specific question, which was loading register(s), then the U makes it an unsigned value, but whether the compiler treats the n-bit word pattern as a signed or unsigned value it will move the same bit pattern, assuming there isn't any size extension that would propagate an MSB. The difference that might matter is if the register load operation will set any processor condition flags based on a signed or unsigned loading. As an overall guide if the processor supports storing a constant to configuration register or a memory address then loading a peripheral register is unlikely to set the processor's NEG condition flag. Loading a general purpose register connected to an ALU, one that can be the target of an arithmetic operation like add increment or decrement, might set a negative flag on loading so that e.g. a trailing "branch (if) negative" opcode would execute the branch. You would want to check the processor's references to be sure. Small instruction set processors tend to have only a load register instruction, while larger instruction sets are more likely to have a load unsigned variant of the load instruction that doesn't set the NEG bit in the processor's flags, but again, check the processor's references. if you don't have access to the processor's errata (the boo-boo list) and need a specific flag state. All of this only tends to come up when an optimizing compiler re-aranges code with an inline assembly instruction and other uncommon situations. Examine the generate assembly code, turn off some or all compiler optimizations for the module when needed, etc.

When should I use UINT32_C(), INT32_C(),... macros in C?

I switched to fixed-length integer types in my projects mainly because they help me think about integer sizes more clearly when using them. Including them via #include <inttypes.h> also includes a bunch of other macros like the printing macros PRIu32, PRIu64,...
To assign a constant value to a fixed length variable I can use macros like UINT32_C() and INT32_C(). I started using them whenever I assigned a constant value.
This leads to code similar to this:
uint64_t i;
for (i = UINT64_C(0); i < UINT64_C(10); i++) { ... }
Now I saw several examples which did not care about that. One is the stdbool.h include file:
#define bool _Bool
#define false 0
#define true 1
bool has a size of 1 byte on my machine, so it does not look like an int. But 0 and 1 should be integers which should be turned automatically into the right type by the compiler. If I would use that in my example the code would be much easier to read:
uint64_t i;
for (i = 0; i < 10; i++) { ... }
So when should I use the fixed length constant macros like UINT32_C() and when should I leave that work to the compiler(I'm using GCC)? What if I would write code in MISRA C?

As a rule of thumb, you should use them when the type of the literal matters. There are two things to consider: the size and the signedness.
Regarding size:
An int type is guaranteed by the C standard values up to 32767. Since you can't get an integer literal with a smaller type than int, all values smaller than 32767 should not need to use the macros. If you need larger values, then the type of the literal starts to matter and it is a good idea to use those macros.
Regarding signedness:
Integer literals with no suffix are usually of a signed type. This is potentially dangerous, as it can cause all manner of subtle bugs during implicit type promotion. For example (my_uint8_t + 1) << 31 would cause an undefined behavior bug on a 32 bit system, while (my_uint8_t + 1u) << 31 would not.
This is why MISRA has a rule stating that all integer literals should have an u/U suffix if the intention is to use unsigned types. So in my example above you could use my_uint8_t + UINT32_C(1) but you can as well use 1u, which is perhaps the most readable. Either should be fine for MISRA.
As for why stdbool.h defines true/false to be 1/0, it is because the standard explicitly says so. Boolean conditions in C still use int type, and not bool type like in C++, for backwards compatibility reasons.
It is however considered good style to treat boolean conditions as if C had a true boolean type. MISRA-C:2012 has a whole set of rules regarding this concept, called essentially boolean type. This can give better type safety during static analysis and also prevent various bugs.

It's for using smallish integer literals where the context won't result in the compiler casting it to the correct size.
I've worked on an embedded platform where int is 16 bits and long is 32 bits. If you were trying to write portable code to work on platforms with either 16-bit or 32-bit int types, and wanted to pass a 32-bit "unsigned integer literal" to a variadic function, you'd need the cast:
#define BAUDRATE UINT32_C(38400)
printf("Set baudrate to %" PRIu32 "\n", BAUDRATE);
On the 16-bit platform, the cast creates 38400UL and on the 32-bit platform just 38400U. Those will match the PRIu32 macro of either "lu" or "u".
I think that most compilers would generate identical code for (uint32_t) X as for UINT32_C(X) when X is an integer literal, but that might not have been the case with early compilers.

will ~ operator change the data type?

When I read someone's code I find that he bothered to write an explicite type cast.
#define ULONG_MAX ((unsigned long int) ~(unsigned long int) 0)
When I write code
1 #include<stdio.h>
2 int main(void)
3 {
4 unsigned long int max;
5 max = ~(unsigned long int)0;
6 printf("%lx",max);
7 return 0;
8 }
it works as well. Is it just a meaningless coding style?

The code you read is very bad, for several reasons.
First of all user code should never define ULONG_MAX. This is a reserved identifier and must be provided by the compiler implementation.
That definition is not suitable for use in a preprocessor #if. The _MAX macros for the basic integer types must be usable there.
(unsigned long)0 is just crap. Everybody should just use 0UL, unless you know that you have a compiler that is not compliant with all the recent C standards with that respect. (I don't know of any.)
Even ~0UL should not be used for that value, since unsigned long may (theoretically) have padding bits. -1UL is more appropriate, because it doesn't deal with the bit pattern of the value. It uses the guaranteed arithmetic properties of unsigned integer types. -1 will always be the maximum value of an unsigned type. So ~ may only be used in a context where you are absolutely certain that unsigned long has no padding bits. But as such using it makes no sense. -1 serves better.
"recasting" an expression that is known to be unsigned long is just superfluous, as you observed. I can't imagine any compiler that bugs on that.
Recasting of expression may make sense when they are used in the preprocessor, but only under very restricted circumstances, and they are interpreted differently, there.
#if ((uintmax_t)-1UL) == SOMETHING
..
#endif
Here the value on the left evalues to UINTMAX_MAX in the preprocessor and in later compiler phases. So
#define UINTMAX_MAX ((uintmax_t)-1UL)
would be an appropriate definition for a compiler implementation.
To see the value for the preprocessor, observe that there (uintmax_t) is not a cast but an unknown identifier token inside () and that it evaluates to 0. The minus sign is then interpreted as binary minus and so we have 0-1UL which is unsigned and thus the max value of the type. But that trick only works if the cast contains a single identifier token, not if it has three as in your example, and if the integer constant has a - or + sign.

They are trying to ensure that the type of the value 0 is unsigned long. When you assign zero to a variable, it gets cast to the appropriate type.
In this case, if 0 doesn't happen to be an unsigned long then the ~ operator will be applied to whatever other type it happens to be and the result of that will be cast.
This would be a problem if the compiler decided that 0 is a short or char.
However, the type after the ~ operator should remain the same. So they are being overly cautious with the outer cast, but perhaps the inner cast is justified.
They could of course have specified the correct zero type to begin with by writing ~0UL.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight