Say I have some utf8 encoded string. Inside it words are delimited using ";".
But each character (except ";") inside this string has utf8 value >128.
Say I store this string inside unsigned char array:
unsigned char buff[]="someutf8string;separated;with;";
Is it safe to pass this buff to strtok function? (If I just want to extracts words using ";" symbol).
My concern is that strtok (or also strcpy) expect char pointers, but inside my
string some values will have value > 128.
So is this behaviour defined?
No, it is not safe -- but if it compiles it will almost certainly work as expected.
unsigned char buff[]="someutf8string;separated;with;";
This is fine; the standard specifically permits arrays of character type (including unsigned char) to be initialized with a string literal. Successive bytes of the string literal initialize the elements of the array.
strtok(buff, ";")
This is a constraint violation, requiring a compile-time diagnostic. (That's about as close as the C standard gets to saying that something is illegal.)
The first parameter of strok is of type char*, but you're passing an argument of type unsigned char*. These two pointer types are not compatible, and there is no implicit conversion between them. A conforming compiler may reject your program if it contains a call like this (and, for example, gcc -std=c99 -pedantic-errors does reject it.)
Many C compilers are somewhat lax about strict enforcement of the standard's requirements. In many cases, compilers issue warnings for code that contains constraint violations -- which is perfectly valid. But once a compiler has diagnosed a constraint violation and proceeded to generate an executable, the behavior of that executable is not defined by the C standard.
As far as I know, any actual compiler that doesn't reject this call will generate code that behaves just as you expect it to. The pointer types char* and unsigned char* almost certainly have the same representation and are passed the same way as arguments, and the types char and unsigned char are explicitly required to have the same representation for non-negative values. Even for values exceeding CHAR_MAX, like the ones you're using, a compiler would have to go out of its way to generate misbehaving code. You could have problems on a system that doesn't use 2's-complement for signed integers, but yo're not likely to encounter such a system.
If you add an explicit cast:
strtok((char*)buff, ";")
removes the constraint violation and will probably silence any warning -- but the behavior is still strictly undefined.
In practice, though, most compilers try to treat char, signed char, and unsigned char almost interchangeably, partly to cater to code like yours, and partly because they'd have to go out of their way to do anything else.
According to the C11 Standard (ISO/IEC 9899:2011 §7.24.1 String Handling Conventions, ¶3, emphasis added):
For all functions in this subclause, each character shall be
interpreted as if it had the type unsigned char (and therefore every
possible object representation is valid and has a different value).
Note: this paragraph was not present in the C99 standard.
So I do not see a problem.
Related
The code I am handling has a lot of castings that are being made from uint8 to char, and then the C library functions are called upon this castings.I was trying to understand why would the writer prefer uint8 over char.
For example:
uint8 *my_string = "XYZ";
strlen((char*)my_string);
What happens to the \0, is it added when I cast?
What happens when I cast the other way around?
Is this a legit way to work, and why would anybody prefer working with uint8 over char?
The casts char <=> uint8 are fine. It is always allowed to access any defined memory as unsigned characters, including string literals, and then of course to cast a pointer that points to a string literal back to char *.
In
uint8 *my_string = "XYZ";
"XYZ" is an anonymous array of 4 chars - including the terminating zero. This decays into a pointer to the first character. This is then implicitly converted to uint8 * - strictly speaking, it should have an explicit cast though.
The problem with the type char is that the standard leaves it up to the implementation to define whether it is signed or unsigned. If there is lots of arithmetic with the characters/bytes, it might be beneficial to have them unsigned by default.
A particularly notorious example is the <ctype.h> with its is* character class functions - isspace, isalpha and the like. They require the characters as unsigned chars (converted to int)! A piece of code that does the equivalent of char c = something(); if (isspace(c)) { ... } is not portable and a compiler cannot even warn about this! If the char type is signed on the platform (default on x86!) and the character isn't ASCII (or, more properly, a member of the basic execution character set), then the behaviour is undefined - it would even abort on MSVC debug builds, but unfortunately just causes silent undefined behaviour (array access out of bounds) on glibc.
However, a compiler would be very loud about using unsigned char * or its alias as an argument to strlen, hence the cast.
In plain C, by the standard there are three distinct "character" types:
plain char which one's signedness is implementation defined.
signed char.
unsigned char.
Let's assume at least C99, where stdint.h is already present (so you have the int8_t and uint8_t types as recommendable alternatives with explicit width to signed and unsigned chars).
For now for me it seems like using the plain char type is only really useful (or necessary) if you need to interface functions of the standard library such as printf, and in all other scenarios, rather to be avoided. Using char could lead to undefined behavior when it is signed on the implementation, and for any reason you need to do any arithmetic on such data.
The problem of using an appropriate type is probably the most apparent when dealing for example with Unicode text (or any code page using values above 127 to represent characters), which otherwise could be handled as a plain C string. However the relevant string.h functions all accept char, and if such data is typed char, that imposes problems when trying to interpret it for example for a display routine capable to handle its encoding.
What is the most recommendable method in such a case? Are there any particular reasons beyond this where it could be recommendable to use char over stdint.h's appropriate fixed-width types?
The char type is for characters and strings. It is the type expected and returned by all the string handling functions. (*) You really should never have to do arithmetic on char, especially not the kind where signed-ness would make a difference.
unsigned char is the type to be used for raw data. For example memcpy() or fread() interpret their void * arguments as arrays of unsigned char. The standard guarantees that any type can be also represented as an array of unsigned char. Any other conversion might be "signalling", i.e. triggering exceptions. (ISO/IEC 9899:2011, section 6.2.6 "Representation of Types"). (**)
signed char is when you need a signed integer of char size (for arithmetics).
(*): The character handling functions in <ctype.h> are a bit oddball about this, as they cater for EOF (negative), and hence "force" the character values into the unsigned char range (ISO/IEC 9899:2011, section 7.4 Character handling). But since it is guaranteed that a char can be cast to unsigned char and back without loss of information as per section 6.2.6... you get the idea.
When signed-ness of char would make a difference -- the comparison functions like in strcmp() -- the standard dictates that char is interpreted as unsigned char (ISO/IEC 9899:2011, section 7.24.4 Comparison functions).
(**): Practically, it is hard to see how a conversion of raw data to char and back could be signalling where the same done with unsigned char would not be signalling. But unsigned char is what the section of the standard says. ;-)
Use char to store characters (standard defines the behaviour for basic execution character set elements only, roughly ASCII 7-bit characters).
Use signed char or unsigned char to get the corresponding arithmetic (signed or unsigned arithmetic have different properties for integers - char is an integer type).
This doesn't means that you can't make arithmetic with raw chars, as stated:
6.2.5 Types - 3. An object declared as type char is large enough to store any member of
the basic execution character set. If a member of the basic execution
character set is stored in a char object, its value is guaranteed to
be nonnegative.
Then if you only use character set elements arithmetic on them is correctly defined.
On some embedded device, I have passed an unsigned char pointer to atoi without a cast.
unsigned char c[10]="12";
atoi(c);
Question: is it well defined?
I saw somewhere it is ok for string functions, but was not sure about atoi.
Edit: Btw. Some concerns have been expressed on one of the answer below that it might not be OK even for string functions such as strcpy - but if I got right (?) the author meant also it can be that in practice this can be OK.
Also that I am here, is it ok to do following assignment to unsigned char pointer ok too? Because I used some tool which is complaining about "Type mismatch (assignment) (ptrs to signed/unsigned)"
unsigned char *ptr = strtok(unscharbuff,"-");
// is assignment also ok to unsigned char?
No, it's not well defined. It's a constraint violation, requiring a compile-time diagnostic. In practice it's very very likely to work as you expect, but it's not guaranteed to do so, and IMHO it's poor style.
The atoi function is declared in <stdlib.h> as:
int atoi(const char *nptr);
You're passing an unsigned char* argument to a function that expects a char* argument. The two types are not compatible, and there is no implicit conversion from one to the other. A conforming compiler may issue a warning (that counts as a diagnostic) and then proceed to generate an executable, but the behavior of that executable is undefined.
As of C99, a call to a function with no visible declaration is a constraint violation, so you can't get away with it by omitting the #include <stdlib.h>.
C does still permit calls to functions with a visible declaration where the declaration is not a prototype (i.e., doesn't define the number of type(s) of the parameters). So, rather than the usual #include <stdlib.h>, you could add your own declaration:
int atoi();
which would permit calling it with an unsigned char* argument.
This will almost certainly "work", and it might be possible to construct an argument from the standard that its behavior is well defined. The char and unsigned char values of '1' and '2' are guaranteed to have the same representation
But it's far easier to add the cast than to prove that it's not necessary -- or, better yet, to define c as an array of char rather than as an array of unsigned char, since it's intended to hold a string.
unsigned char *ptr = strtok(unscharbuff,"-");
This is also a constraint violation. There is no implicit conversion from unsigned char* to char* for the first argument in the strtok call, and there is no implicit conversion from char* to unsigned char* for the initialization of ptr.
Yes, these will function perfectly fine. Your compiler settings will determine whether you get a warning regarding type. I usually compile with -Wall, to turn on all warnings, and then use static casting in the code for each and every case, so that I know I have carefully examined them. The end result is zero errors and zero warnings, and any change that triggers a warning in the future will really stand out, not get lost in 100 tolerated messages.
I noticed I've used several string methods like
strcpy(userAmount, pch);
on unsigned char* buffers, without casting. e.g., these variables are defined like
u8 * pch = NULL;
u8 userAmount[255] = {0};
It has worked fine so far.
Is it expected behaviour? Shall I continue using it this way?
So far buffer stored ASCII text, it may hold UTF8 in the future - will things be different in this case (e.g., in terms of casting, buffer type)?
Why use some non-standard type (u8?) when a standard type (char) would do?
However, your approach should be ok.
From the C11-Standard (italics by me):
7.24.1 String function conventions
The header string.h declares one type and several functions, and defines one
macro useful for manipulating arrays of character type and other objects treated as arrays
of character type.
[...]
For all functions in this subclause, each character shall be interpreted as if it had the type
unsigned char (and therefore every possible object representation is valid and has a
different value).
I have found that the C99 standard have a statement which denies the compatibility between the type char and the type signed char/unsigned char.
Note 35 of C99 standard:
CHAR_MIN, defined in limits.h, will have one of the values 0 or SCHAR_MIN, and this can be used to distinguish the two options. Irrespective of the choice made, char is a separate type from the other two and is not compatible with either.
My question is that why does the committee deny the compatibility? What is the rationale? If char is compatible with signed char or unsigned char, will something terrible happen?
The roots are in compiler history. There were (are) essentially two C dialects in the Eighties:
Where plain char is signed
Where plain char is unsigned
Which of these should C89 have standardized? C89 chose to standardize neither, because it would have invalidated a large number of assumptions made in C code already written--what standard folks call the installed base. So C89 did what K&R did: leave the signedness of plain char implementation-defined. If you required a specific signedness, qualify your char.
Modern compilers usually let you chose the dialect with an option (eg. gcc's -funsigned-char).
The "terrible" thing that can happen if you ignore the distinction between (un)signed char and plain char is that if you do arithmetic and shifts without taking these details into account, you might get sign extensions when you don't expect them or vice versa (or even undefined behavior when shifting).
There's also some dumb advice out there that recommends to always declare your chars with an explicit signed or unsigned qualifier. This works as long as you only work with pointers to such qualified types, but it requires ugly casts as soon as you deal with strings and string functions, all of which operate on pointer-to-plain-char, which is assignment-incompatible without a cast. Such code suddenly gets plastered with tons of ugly-to-the-bone casts.
The basic rules for chars are:
Use plain char for strings and if you need to pass pointers to functions taking plain char
Use unsigned char if you need to do bit twiddling and shifting on bytes
Use signed char if you need small signed values, but think about using int if space is not a concern
Think of signed char and unsigned char as the smallest arithmetic, integral types, just like signed short/unsigned short, and so forth with int, long int, long long int. Those types are all well-specified.
On the other hand, char serves a very different purpose: It's the basic type of I/O and communication with the system. It's not meant for computations, but rather as the unit of data. That's why you find char used in the command line arguments, in the definition of "strings", in the FILE* functions and in other read/write type IO functions, as well as in the exception to the strict aliasing rule. This char type is deliberately less strictly defined so as to allow every implementation to use the most "natural" representation.
It's simply a matter of separating responsibilities.
(It is true, though, that char is layout-compatible with both signed char and unsigned char, so you may explicitly convert one to the other and back.)