Legal to initialize uint8_t array with string literal? [duplicate] - c

This question already has an answer here:
Why is it ok to use a string literal to initialize an unsigned char array but not to initialize an unsigned char pointer?
(1 answer)
Closed 1 year ago.
Is it OK to initialize a uint8_t array from a string literal? Does it work as expected or does it mangle some bytes due to signed-unsigned conversion? (I want it to just stuff the literal's bits in there unchanged.) GCC doesn't complain with -Wall and it seems to work.
const uint8_t hello[] = "Hello World"
I am using an API that takes a string as uint8_t *. Right now I am using a cast, otherwise I would get a warning:
const char* hello = "Hello World\n"
HAL_UART_Transmit(uart, (uint8_t *)hello, 12, 50);
// HAL_UART_Transmit(uart, hello, 12, 50);
// would give a warning such as:
// pointer targets in passing argument 2 of 'HAL_UART_Transmit' differ in signedness [-Wpointer-sign]
On this platform, char is 8 bits and signed. Is it under that circumstance OK to use uint8_t instead of char? Please don't focus on the constness issue, the API should take const uint8_t * but doesn't. This API call is just the example that brought me to this question.
Annoyingly this question is now closed, I would like to answer it myself. Apologies for adding this info here, I don't have the permission to reopen.
All of the following work with gcc -Wall -pedantic, but the fourth warns about converting signed to unsigned. The bit pattern in memory will be identical, and if you cast such an object to (uint8_t *) it will have the same behavior. According to the marked duplicate, this is because you may assign string literals to any char array.
const char string1[] = "Hello";
const uint8_t string2[] = "Hello";
uint8_t string3[] = "Hello";
uint8_t* string4 = "Hello";
char* string5 = "Hello";
Of course, only the first two are recommendable, since you shouldn't attempt to modify string literals. In the concrete case above, you could either create a wrapper function/macro, or just leave the cast inside as a concession to the API and call it a day.

C 2018 6.7.9 14 tells us “An array of character type may be initialized by a character string literal or UTF–8 string literal…”
C 2018 6.2.5 15 tells us “The three types char, signed char, and unsigned char are collectively called the character types.”
C 2018 6.2.5 4 and 6.2.5 6 says there may be extended integer types.
There is no statement that any extended integer types are character types.
C 2018 7.20 4 tells us “For each type described herein that the implementation provides, <stdint.h> shall declare that typedef name…” and 7.20.1 5 tells us “When typedef names differing only in the absence or presence of the initial u are defined, they shall denote corresponding signed and unsigned types as described in 6.2.5…”
Therefore, a C implementation could provide an unsigned 8-bit type that is an extended integer type, not an unsigned char, and may define uint8_t to be this type, and then 6.7.9 14 does not tell us that an array of this type may be initialized by a character string literal.
If an implementation is allowing you to initialize an array of uint8_t with a string literal, then either it defines uint8_t to be an unsigned char or to be unsigned char, or it defines uint8_t to be an extended integer type but allows you to initialize the array as an extension to the C standard. It would be up to the C implementation to define the behavior of that extension, but I would expect it to work just as initializing for an array of character type.
(Conceivable, defining uint8_t to be an extended integer type and disallowing its treatment as a character type could be useful for distinguish the character types, which are allowed to alias any objects, from pure integer types, which would not allow such aliasing. This might allow the compiler to perform additional optimizations, since it would know the aliasing could not occur, or possibly to diagnose certain errors.)
The elements of a string literal have type char (by C 2018 5.2.1 6). C 2018 6.7.9 14 tells us that “Successive bytes of the string literal… initialize the elements of the array.” Each byte should initialize an array element in the usual way, including conversion to the destination type per C 2018 6.7.9 11. For the string you show, "Hello World", the character values are all non-negative, so there is no issue in converting their char values to uint8_t. If you had negative characters in the string, they should be converted to uint8_t in the usual way.
(If you have octal or hexadecimal escape sequences that have values not represented in a char, there could be some language-lawyer weirdness in the initialization.)

Related

Why do we have the type of char in C, if a character literal is always of type int? Isn´t the whole type of char in C redundant?

As opposed to C++, In C, a character literal is implemented as to be always of type int.
But why we have then the type of char for holding a character value?
In the question Why are C character literals ints instead of chars?,
it is discussed, why character literals are of type int in C. But this is not what my question is about.
Inside the question If character constants are of type `int', why are they assigned to variables of type `char`? then it is going a little more into the deep with the question, why we actually assign character literals to variables of type char if they are of type int, but the provided answers left the concern, why we need the type of char in general.
My Questions are now:
Why we have the type of char if any character literals are always of int type?
Isn´t the type of char redundant then?
What is the purpose of type char, if it is seemingly redundant?
Just be cause a character constant in C source code has type int doesn't mean that the type char has no use.
The type char occupies 1 byte of space. So you can use it anyplace where the values are in the range of a char, which includes ASCII characters. You can read and write those characters from either the console or a file as single byte entities. The fact that a character constant in source code has a different type doesn't change that.
Using char in an array also means you're using less memory than if you had an array of int, which can be useful in situations where space is at a premium. This is especially true if you're using it as a binary format to store data on disk or send it over a network.
A char * can also be used to access the individual bytes of any object if you need to see how that object is represented.
The type char allows to address each byte (the smallest addressable unit of a CPU). So for example it allows to specify a memory extent of any number of bytes for example for using in memcpy or memmove.
Also how to declare a character array without the type char?
If you declare it as an integer array when there will be redundant allocated memory.
Why do we have the type of char in C, if a character literal is always of type int?
char, unsigned, char, signed char are the minimal size object. Character literal constants are type int for simplicity of the language and no strong need otherwise. (C++ choose a different path - computers could handle more complex things 20 years later.) There are no integer constants narrower than int.
Isn't the whole type of char in C redundant?
Why we have the type of char if any character literals are always of int type?
Isn't the type of char redundant then?
No. Object sizes benefit with a variety of sizes, constants less so.
What is the purpose of type char, if it is seemingly redundant?
Concerning int and constants, char is not redundant. Concerning signed char, unsigned char, char is redundant and reflects a compromise of early implementations of char as unsigned or signed. This allows char to be signed (which is symmetric other integer type lacking signed or unsigned as conceptually characters are usually thought as.
Code can form a compound literal of type char if a "char` literal" is needed.
char a = (char){'B'};

C programming preferring uint8 over char

The code I am handling has a lot of castings that are being made from uint8 to char, and then the C library functions are called upon this castings.I was trying to understand why would the writer prefer uint8 over char.
For example:
uint8 *my_string = "XYZ";
strlen((char*)my_string);
What happens to the \0, is it added when I cast?
What happens when I cast the other way around?
Is this a legit way to work, and why would anybody prefer working with uint8 over char?
The casts char <=> uint8 are fine. It is always allowed to access any defined memory as unsigned characters, including string literals, and then of course to cast a pointer that points to a string literal back to char *.
In
uint8 *my_string = "XYZ";
"XYZ" is an anonymous array of 4 chars - including the terminating zero. This decays into a pointer to the first character. This is then implicitly converted to uint8 * - strictly speaking, it should have an explicit cast though.
The problem with the type char is that the standard leaves it up to the implementation to define whether it is signed or unsigned. If there is lots of arithmetic with the characters/bytes, it might be beneficial to have them unsigned by default.
A particularly notorious example is the <ctype.h> with its is* character class functions - isspace, isalpha and the like. They require the characters as unsigned chars (converted to int)! A piece of code that does the equivalent of char c = something(); if (isspace(c)) { ... } is not portable and a compiler cannot even warn about this! If the char type is signed on the platform (default on x86!) and the character isn't ASCII (or, more properly, a member of the basic execution character set), then the behaviour is undefined - it would even abort on MSVC debug builds, but unfortunately just causes silent undefined behaviour (array access out of bounds) on glibc.
However, a compiler would be very loud about using unsigned char * or its alias as an argument to strlen, hence the cast.

In C11, string literals as char[], unsigned char[], char* and unsigned char*

Usually string literals is type of const char[]. But when I treat it as other type I got strange result.
unsigned char *a = "\355\1\23";
With this compiler throw warning saying "pointer targets in initialization differ in signedness", which is quite reasonable since sign information can be discarded.
But with following
unsigned char b[] = "\355\1\23";
There's no warning at all. I think there should be a warning for the same reason above. How can this be possible?
FYI, I use GCC version 4.8.4.
The type of string literals in C is char[], which decays to char*. Note that C is different from C++, where they are of type const char[].
In the first example, you try to assign a char* to an unsigned char*. These are not compatible types, so you get a compiler diagnostic message.
In the second example, the following applies, C11 6.7.9/14:
An array of character type may be initialized by a character string literal or UTF−8 string
literal, optionally enclosed in braces. Successive bytes of the string literal (including the
terminating null character if there is room or if the array is of unknown size) initialize the
elements of the array.
Meaning that the code is identical to this:
unsigned char b[] =
{
'\355',
'\1',
'\23',
'\0'
};
This may yield warnings too, but is valid code. C has lax type safety when it comes to assignment1 between different integer types, but much stricter when it comes to assignment between pointer types.
For the same reason as we can write unsigned int x=1; instead of unsigned int x=1u;.
As a side note, I have no idea what you wish to achieve with an octal escape sequence of value 355. Perhaps you meant to write "\35" "5\1\23"?
1 The type rules of initialization are the same as for assignment. 6.5.16.1 "Simple assignment" applies.
The first is the initialization of a pointer, the target types of pointers must agree on signedness.
The second is the initialization of an array. The special rules for initialization with string literals have it that the value of each character of the literal is taken to initialize the individual elements of the array.
BTW, other than you state, string literals are not const qualified in C. You don't have the right to modify them, but this is not reflected in the type.

Quoted initializer for unsigned char array in C

This seems to work in GCC and Visual C without comment:
static const **unsigned** char foo[] = "bar";
This is a salt being used in a unit test. There are other ways to do it, but this is simplest and involves the least casting down the line.
Will this cause trouble with other compilers?
This is safe, at least if you're using a conforming C compiler.
N1570 6.7.9 paragraph 14 says:
An array of character type may be initialized by a character string
literal or UTF−8 string literal, optionally enclosed in braces.
Successive bytes of the string literal (including the terminating null
character if there is room or if the array is of unknown size)
initialize the elements of the array.
The C90 standard has essentially the same wording, so you don't need to worry about older compilers.
The character types are char, signed char, and unsigned char.
Interestingly, there's no corresponding guarantee for pointer initialization, so this:
const char *ptr = "hello";
is safe, but this:
const unsigned char *uptr = "hello";
is not -- and there doesn't seem to be a simple workaround.

Using unsigned char* with string methods such as strcpy

I recalled now at some places in my code I might have passed
unsigned char* variables as parameters to functions such as strcpy and strtok -- which expect char *. My question is: is it a bad idea? Could it have caused issues?
e.g.
unsigned char * x = // .... some val, null terminated
unsigned char * y = // ... same here;
strcpy(x,y); // ps assuming there is space allocated for x
e.g., unsigned char * x = strtok(NULL,...)
It's guaranteed to be ok (after you cast the pointer), because the "Strict Aliasing Rule" has a special exception for looking at the same object via both signed and unsigned variants.
See here for the rule itself. Other answers on that page explain it.
The C aliasing rules have exceptions for signed/unsigned variants and for char access in general. So no trouble here.
Quote from the standard:
An object shall have its stored value accessed only by an lvalue expression that has one of
the following types:88)
— a type compatible with the effective type of the object,
— a qualified version of a type compatible with the effective type of the object,
— a type that is the signed or unsigned type corresponding to the effective type of the
object,
— a type that is the signed or unsigned type corresponding to a qualified version of the
effective type of the object,
— an aggregate or union type that includes one of the aforementioned types among its
members (including, recursively, a member of a subaggregate or contained union), or
— a character type.
All standard library functions treat any char arguments as unsigned char, so passing char*, unsigned char* or signed char* is treated the same.
Quote from the intro of <string.h>:
For all functions in this subclause, each character shall be interpreted as if it had the type
unsigned char (and therefore every possible object representation is valid and has a
different value).
Still, your compiler should complain if you get the signed-ness wrong, especially if you enable all warnings (you should, always).
The only problem with converting unsigned char * into char * (or vice versa) is that it's supposed to be an error. Fix it with a cast.
e.g,
function((char *) buff, len);
That being said, strcpy needs to have the null-terminating character (\0) to properly work. The alternative is to use memcpy.
But you shouldn't use unsigned char arrays with string handling functions. In C strings are char arrays, not unsigned char arrays. Since passing to strcpy discards the unsigned qualifier, the compiler warns.
As a general rule, don't make things unsigned when you don't have to.

Resources