C programming preferring uint8 over char - c

The code I am handling has a lot of castings that are being made from uint8 to char, and then the C library functions are called upon this castings.I was trying to understand why would the writer prefer uint8 over char.
For example:
uint8 *my_string = "XYZ";
strlen((char*)my_string);
What happens to the \0, is it added when I cast?
What happens when I cast the other way around?
Is this a legit way to work, and why would anybody prefer working with uint8 over char?

The casts char <=> uint8 are fine. It is always allowed to access any defined memory as unsigned characters, including string literals, and then of course to cast a pointer that points to a string literal back to char *.
In
uint8 *my_string = "XYZ";
"XYZ" is an anonymous array of 4 chars - including the terminating zero. This decays into a pointer to the first character. This is then implicitly converted to uint8 * - strictly speaking, it should have an explicit cast though.
The problem with the type char is that the standard leaves it up to the implementation to define whether it is signed or unsigned. If there is lots of arithmetic with the characters/bytes, it might be beneficial to have them unsigned by default.
A particularly notorious example is the <ctype.h> with its is* character class functions - isspace, isalpha and the like. They require the characters as unsigned chars (converted to int)! A piece of code that does the equivalent of char c = something(); if (isspace(c)) { ... } is not portable and a compiler cannot even warn about this! If the char type is signed on the platform (default on x86!) and the character isn't ASCII (or, more properly, a member of the basic execution character set), then the behaviour is undefined - it would even abort on MSVC debug builds, but unfortunately just causes silent undefined behaviour (array access out of bounds) on glibc.
However, a compiler would be very loud about using unsigned char * or its alias as an argument to strlen, hence the cast.

Related

Legal to initialize uint8_t array with string literal? [duplicate]

This question already has an answer here:
Why is it ok to use a string literal to initialize an unsigned char array but not to initialize an unsigned char pointer?
(1 answer)
Closed 1 year ago.
Is it OK to initialize a uint8_t array from a string literal? Does it work as expected or does it mangle some bytes due to signed-unsigned conversion? (I want it to just stuff the literal's bits in there unchanged.) GCC doesn't complain with -Wall and it seems to work.
const uint8_t hello[] = "Hello World"
I am using an API that takes a string as uint8_t *. Right now I am using a cast, otherwise I would get a warning:
const char* hello = "Hello World\n"
HAL_UART_Transmit(uart, (uint8_t *)hello, 12, 50);
// HAL_UART_Transmit(uart, hello, 12, 50);
// would give a warning such as:
// pointer targets in passing argument 2 of 'HAL_UART_Transmit' differ in signedness [-Wpointer-sign]
On this platform, char is 8 bits and signed. Is it under that circumstance OK to use uint8_t instead of char? Please don't focus on the constness issue, the API should take const uint8_t * but doesn't. This API call is just the example that brought me to this question.
Annoyingly this question is now closed, I would like to answer it myself. Apologies for adding this info here, I don't have the permission to reopen.
All of the following work with gcc -Wall -pedantic, but the fourth warns about converting signed to unsigned. The bit pattern in memory will be identical, and if you cast such an object to (uint8_t *) it will have the same behavior. According to the marked duplicate, this is because you may assign string literals to any char array.
const char string1[] = "Hello";
const uint8_t string2[] = "Hello";
uint8_t string3[] = "Hello";
uint8_t* string4 = "Hello";
char* string5 = "Hello";
Of course, only the first two are recommendable, since you shouldn't attempt to modify string literals. In the concrete case above, you could either create a wrapper function/macro, or just leave the cast inside as a concession to the API and call it a day.
C 2018 6.7.9 14 tells us “An array of character type may be initialized by a character string literal or UTF–8 string literal…”
C 2018 6.2.5 15 tells us “The three types char, signed char, and unsigned char are collectively called the character types.”
C 2018 6.2.5 4 and 6.2.5 6 says there may be extended integer types.
There is no statement that any extended integer types are character types.
C 2018 7.20 4 tells us “For each type described herein that the implementation provides, <stdint.h> shall declare that typedef name…” and 7.20.1 5 tells us “When typedef names differing only in the absence or presence of the initial u are defined, they shall denote corresponding signed and unsigned types as described in 6.2.5…”
Therefore, a C implementation could provide an unsigned 8-bit type that is an extended integer type, not an unsigned char, and may define uint8_t to be this type, and then 6.7.9 14 does not tell us that an array of this type may be initialized by a character string literal.
If an implementation is allowing you to initialize an array of uint8_t with a string literal, then either it defines uint8_t to be an unsigned char or to be unsigned char, or it defines uint8_t to be an extended integer type but allows you to initialize the array as an extension to the C standard. It would be up to the C implementation to define the behavior of that extension, but I would expect it to work just as initializing for an array of character type.
(Conceivable, defining uint8_t to be an extended integer type and disallowing its treatment as a character type could be useful for distinguish the character types, which are allowed to alias any objects, from pure integer types, which would not allow such aliasing. This might allow the compiler to perform additional optimizations, since it would know the aliasing could not occur, or possibly to diagnose certain errors.)
The elements of a string literal have type char (by C 2018 5.2.1 6). C 2018 6.7.9 14 tells us that “Successive bytes of the string literal… initialize the elements of the array.” Each byte should initialize an array element in the usual way, including conversion to the destination type per C 2018 6.7.9 11. For the string you show, "Hello World", the character values are all non-negative, so there is no issue in converting their char values to uint8_t. If you had negative characters in the string, they should be converted to uint8_t in the usual way.
(If you have octal or hexadecimal escape sequences that have values not represented in a char, there could be some language-lawyer weirdness in the initialization.)

In C11, string literals as char[], unsigned char[], char* and unsigned char*

Usually string literals is type of const char[]. But when I treat it as other type I got strange result.
unsigned char *a = "\355\1\23";
With this compiler throw warning saying "pointer targets in initialization differ in signedness", which is quite reasonable since sign information can be discarded.
But with following
unsigned char b[] = "\355\1\23";
There's no warning at all. I think there should be a warning for the same reason above. How can this be possible?
FYI, I use GCC version 4.8.4.
The type of string literals in C is char[], which decays to char*. Note that C is different from C++, where they are of type const char[].
In the first example, you try to assign a char* to an unsigned char*. These are not compatible types, so you get a compiler diagnostic message.
In the second example, the following applies, C11 6.7.9/14:
An array of character type may be initialized by a character string literal or UTF−8 string
literal, optionally enclosed in braces. Successive bytes of the string literal (including the
terminating null character if there is room or if the array is of unknown size) initialize the
elements of the array.
Meaning that the code is identical to this:
unsigned char b[] =
{
'\355',
'\1',
'\23',
'\0'
};
This may yield warnings too, but is valid code. C has lax type safety when it comes to assignment1 between different integer types, but much stricter when it comes to assignment between pointer types.
For the same reason as we can write unsigned int x=1; instead of unsigned int x=1u;.
As a side note, I have no idea what you wish to achieve with an octal escape sequence of value 355. Perhaps you meant to write "\35" "5\1\23"?
1 The type rules of initialization are the same as for assignment. 6.5.16.1 "Simple assignment" applies.
The first is the initialization of a pointer, the target types of pointers must agree on signedness.
The second is the initialization of an array. The special rules for initialization with string literals have it that the value of each character of the literal is taken to initialize the individual elements of the array.
BTW, other than you state, string literals are not const qualified in C. You don't have the right to modify them, but this is not reflected in the type.

Is it safe to convert char * to const unsigned char *?

We are using char * and the lib we use is using const unsigned char *, so we convert to const unsigned char *
// lib
int asdf(const unsigned char *vv);
//ours
char *str = "somestring";
asdf((const unsigned char*)str);
Is it safe? any pitfall?
it is safe.
char *str = "somestring";
str is a constant,you could change str point to another string:
str = "some";//right
but you can't change the string which str is pointing now
str[0]='q';//wrong
so it is safe to use const to convert a constant to constant
if the asdf() only use to show the string,likes printf() or puts() ,you no need to const,because you are not change the string.
if you use const it will be safer.when you implementing the asdf(),it will make sure you can't write the wrong code likes "str[0]='q';" because you can't compile it.
if there are no const,you will find the error until you running the program.
If it's being treated as a string by that interface, there should be no problem at all. You don't even need to add the const in your cast if you don't want to - that part is automatic.
Technically, it is always safe to convert pointers from one type to another (except between function pointers and data pointers, unless your implementation provides an extension that allows that). It's only the dereferencing that is unsafe.
But dereferencing a pointer to any signedness of char is always safe (from a type aliasing perspective, at least).
It probably won't break anything in the compiler if you pass a char* where an unsigned char* was expected by using a cast. The const part is unproblematic -- it's just indicating that the function won't modify the argument via its pointer.
In many cases it doesn't matter whether char values are treated as signed or unsigned -- it's usually only a problem when performing arithmetic or comparing the sizes of values. However, if the function is expressly defined to take an unsigned char*, I guess there's a chance that it really requires the input data to be unsigned, for some arithmetical reason. If you're treating your character data as signed elsewhere, then it's possible that there is an incompatibility between your data and the data expected by the function.
In many cases, however, developers write "unsigned" to mean "I will not be doing arithmetic on this data", so the signedness probably won't matter.

What is the best way to represent characters in C?

I know that a char is allowed to be signed or unsigned depending on the implementation. This doesn't really bother me if all I want to do is manipulate bytes. (In fact, I don't think of the char datatype as a character, but a byte).
But, if I understand, string literals are signed chars (actually they're not, but see the update below), and the function fgetc() returns unsigned chars casted into int. So if I want to manipulate characters, is it preferred style to use signed, unsigned, or ambiguous characters? Why does reading characters from a file have a different convention than literals?
I ask because I have some code in c that does string comparison between string literals and the contents of files, but having a signed char * vs unsigned char * might really make my code error prone.
Update 1
Ok as a few people pointed out (in answers and comments) string literals are in fact char arrays, not signed char arrays. That means I really should use char * for string literals, and not think about whether they are signed or unsigned. This makes me perfectly happy (until I have to start making conversion/comparisons with unsigned chars).
However the important question remains, how do I read characters from a file, and compare them to a string literal. The crux of which is the conversion from the int read using fgetc(), which explicitly reads an unsigned char from the file, to the char type, which is allowed to be either signed or unsigned.
Allow me to provide a more detailed example.
int main(void)
{
FILE *someFile = fopen("ThePathToSomeRealFile.html", "r");
assert(someFile);
char substringFromFile[25];
memset((void*)substringFromFile,0,sizeof(substringFromFile));
//Alright, the real example is to read the first few characters from the file
//And then compare them to the string I expect
const char *expectedString = "<!DOCTYPE";
for( int counter = 0; counter < sizeof(expectedString)/sizeof(*expectedString); ++counter )
{
//Read it as an integer, because the function returns an `int`
const int oneCharacter = fgetc(someFile);
if( ferror(someFile) )
return EXIT_FAILURE;
if( int == EOF || feof(someFile) )
break;
assert(counter < sizeof(substringFromFile)/sizeof(*substringFromFile));
//HERE IS THE PROBLEM:
//I know the data contained in oneCharacter must be an unsigned char
//Therefore, this is valid
const unsigned char uChar = (const unsigned char)oneCharacter;
//But then how do I assign it to the char?
substringFromFile[counter] = (char)oneCharacter;
}
//and ultimately here's my goal
int headerIsCorrect = strncmp(substringFromFile, expectedString, 9);
if(headerIsCorrect != 0)
return EXIT_SUCCESS;
//else
return EXIT_FAILURE;
}
Essentially, I know my fgetc() function is returning something that (after some error checking) is code-able as an unsigned char. I know that char may or may not be an unsigned char. That means, depending on the implementation of the c standard, doing a cast to char will involve no reinterpretation. However, in the case that the system is implemented with a signed char, I have to worry about values that can be coded by an unsigned char that aren't code-able by char (i.e. those values between (INT8_MAX UINT8_MAX]).
tl;dr
The question is this, should I (1) copy their underlying data read by fgetc() (by casting pointers - don't worry, I know how to do that), or (2) cast down from unsigned char to char (which is only safe if I know that the values can't exceed INT8_MAX, or those values can be ignored for whatever reason)?
The historical reasons are (as I've been told, I don't have a reference) that the char type was poorly specified from the beginning.
Some implementations used "consistent integer types" where char, short, int and so on were all signed by default. This makes sense because it makes the types consistent with each other.
Other implementations used unsigned for character, since there never existed any symbol tables with negative indices (that would be stupid) and since they saw a need for more than 128 characters (a very valid concern).
By the time C got standardized properly, it was too late to change this, too many different compilers and programs written for them were already out on the market. So the signedness of char was made implementation-defined, for backwards compatibility reasons.
The signedness of char does not matter if you only use it to store characters/strings. It only matters when you decide to involve the char type in arithmetic expressions or use it to store integer values - this is a very bad idea.
For characters/string, always use char (or wchar_t).
For any other form of 1 byte large data, always use uint8_t or int8_t.
But, if I understand, string literals are signed char
No, string literals are char arrays.
the function fgetc() returns unsigned chars casted into int
No, it returns a char converted to an int. It is int because the return type may contain EOF, which is an integer constant and not a character constant.
having a signed char * vs unsigned char * might really make my code error prone.
No, not really. Formally, this rule from the standard applies:
A pointer to an object type may be converted to a pointer to a different object type. If the
resulting pointer is not correctly aligned for the referenced type, the behavior is undefined. Otherwise, when converted back again, the result shall compare equal to the original pointer.
There exists no case where casting from pointer to signed char to pointer to unsigned char or vice versa, would cause any alignment issues or other issues.
I know that a char is allowed to be signed or unsigned depending on the implementation. This doesn't really bother me if all I want to do is manipulate bytes.
If you're going to do comparison or assign char to other integer types, it should bother you.
But, if I understand, string literals are signed chars
They are of type char[], so if char === unsigned char, all string literals are unsigned char[].
the function fgetc() returns unsigned chars casted into int.
That's correct and is required to omit undesired sign extension.
So if I want to manipulate characters, is it preferred style to use signed, unsigned, or ambiguous characters?
For portability I'd advise to follow practice adapted by various libc implementations: use char, but before processing cast to unsigned char (char* to unsigned char*). This way implicit integer promotions won't turn characters in the range 0x80 -- 0xff into negative numbers of wider types.
In short: (signed char)a < (signed char)b is NOT always equivalent to (unsigned char)a < (unsigned char)b. Here is an example.
Why does reading characters from a file have a different convention than literals?
getc() needs a way to return EOF such that it couldn't be confused with any real char.

Pass unsigned char pointer to atoi without cast

On some embedded device, I have passed an unsigned char pointer to atoi without a cast.
unsigned char c[10]="12";
atoi(c);
Question: is it well defined?
I saw somewhere it is ok for string functions, but was not sure about atoi.
Edit: Btw. Some concerns have been expressed on one of the answer below that it might not be OK even for string functions such as strcpy - but if I got right (?) the author meant also it can be that in practice this can be OK.
Also that I am here, is it ok to do following assignment to unsigned char pointer ok too? Because I used some tool which is complaining about "Type mismatch (assignment) (ptrs to signed/unsigned)"
unsigned char *ptr = strtok(unscharbuff,"-");
// is assignment also ok to unsigned char?
No, it's not well defined. It's a constraint violation, requiring a compile-time diagnostic. In practice it's very very likely to work as you expect, but it's not guaranteed to do so, and IMHO it's poor style.
The atoi function is declared in <stdlib.h> as:
int atoi(const char *nptr);
You're passing an unsigned char* argument to a function that expects a char* argument. The two types are not compatible, and there is no implicit conversion from one to the other. A conforming compiler may issue a warning (that counts as a diagnostic) and then proceed to generate an executable, but the behavior of that executable is undefined.
As of C99, a call to a function with no visible declaration is a constraint violation, so you can't get away with it by omitting the #include <stdlib.h>.
C does still permit calls to functions with a visible declaration where the declaration is not a prototype (i.e., doesn't define the number of type(s) of the parameters). So, rather than the usual #include <stdlib.h>, you could add your own declaration:
int atoi();
which would permit calling it with an unsigned char* argument.
This will almost certainly "work", and it might be possible to construct an argument from the standard that its behavior is well defined. The char and unsigned char values of '1' and '2' are guaranteed to have the same representation
But it's far easier to add the cast than to prove that it's not necessary -- or, better yet, to define c as an array of char rather than as an array of unsigned char, since it's intended to hold a string.
unsigned char *ptr = strtok(unscharbuff,"-");
This is also a constraint violation. There is no implicit conversion from unsigned char* to char* for the first argument in the strtok call, and there is no implicit conversion from char* to unsigned char* for the initialization of ptr.
Yes, these will function perfectly fine. Your compiler settings will determine whether you get a warning regarding type. I usually compile with -Wall, to turn on all warnings, and then use static casting in the code for each and every case, so that I know I have carefully examined them. The end result is zero errors and zero warnings, and any change that triggers a warning in the future will really stand out, not get lost in 100 tolerated messages.

Resources