Using string methods on unsigned char * without casting - c

I noticed I've used several string methods like
strcpy(userAmount, pch);
on unsigned char* buffers, without casting. e.g., these variables are defined like
u8 * pch = NULL;
u8 userAmount[255] = {0};
It has worked fine so far.
Is it expected behaviour? Shall I continue using it this way?
So far buffer stored ASCII text, it may hold UTF8 in the future - will things be different in this case (e.g., in terms of casting, buffer type)?

Why use some non-standard type (u8?) when a standard type (char) would do?
However, your approach should be ok.
From the C11-Standard (italics by me):
7.24.1 String function conventions
The header string.h declares one type and several functions, and defines one
macro useful for manipulating arrays of character type and other objects treated as arrays
of character type.
[...]
For all functions in this subclause, each character shall be interpreted as if it had the type
unsigned char (and therefore every possible object representation is valid and has a
different value).

Related

Why do we have the type of char in C, if a character literal is always of type int? Isn´t the whole type of char in C redundant?

As opposed to C++, In C, a character literal is implemented as to be always of type int.
But why we have then the type of char for holding a character value?
In the question Why are C character literals ints instead of chars?,
it is discussed, why character literals are of type int in C. But this is not what my question is about.
Inside the question If character constants are of type `int', why are they assigned to variables of type `char`? then it is going a little more into the deep with the question, why we actually assign character literals to variables of type char if they are of type int, but the provided answers left the concern, why we need the type of char in general.
My Questions are now:
Why we have the type of char if any character literals are always of int type?
Isn´t the type of char redundant then?
What is the purpose of type char, if it is seemingly redundant?
Just be cause a character constant in C source code has type int doesn't mean that the type char has no use.
The type char occupies 1 byte of space. So you can use it anyplace where the values are in the range of a char, which includes ASCII characters. You can read and write those characters from either the console or a file as single byte entities. The fact that a character constant in source code has a different type doesn't change that.
Using char in an array also means you're using less memory than if you had an array of int, which can be useful in situations where space is at a premium. This is especially true if you're using it as a binary format to store data on disk or send it over a network.
A char * can also be used to access the individual bytes of any object if you need to see how that object is represented.
The type char allows to address each byte (the smallest addressable unit of a CPU). So for example it allows to specify a memory extent of any number of bytes for example for using in memcpy or memmove.
Also how to declare a character array without the type char?
If you declare it as an integer array when there will be redundant allocated memory.
Why do we have the type of char in C, if a character literal is always of type int?
char, unsigned, char, signed char are the minimal size object. Character literal constants are type int for simplicity of the language and no strong need otherwise. (C++ choose a different path - computers could handle more complex things 20 years later.) There are no integer constants narrower than int.
Isn't the whole type of char in C redundant?
Why we have the type of char if any character literals are always of int type?
Isn't the type of char redundant then?
No. Object sizes benefit with a variety of sizes, constants less so.
What is the purpose of type char, if it is seemingly redundant?
Concerning int and constants, char is not redundant. Concerning signed char, unsigned char, char is redundant and reflects a compromise of early implementations of char as unsigned or signed. This allows char to be signed (which is symmetric other integer type lacking signed or unsigned as conceptually characters are usually thought as.
Code can form a compound literal of type char if a "char` literal" is needed.
char a = (char){'B'};

When to use the plain char type in C

In plain C, by the standard there are three distinct "character" types:
plain char which one's signedness is implementation defined.
signed char.
unsigned char.
Let's assume at least C99, where stdint.h is already present (so you have the int8_t and uint8_t types as recommendable alternatives with explicit width to signed and unsigned chars).
For now for me it seems like using the plain char type is only really useful (or necessary) if you need to interface functions of the standard library such as printf, and in all other scenarios, rather to be avoided. Using char could lead to undefined behavior when it is signed on the implementation, and for any reason you need to do any arithmetic on such data.
The problem of using an appropriate type is probably the most apparent when dealing for example with Unicode text (or any code page using values above 127 to represent characters), which otherwise could be handled as a plain C string. However the relevant string.h functions all accept char, and if such data is typed char, that imposes problems when trying to interpret it for example for a display routine capable to handle its encoding.
What is the most recommendable method in such a case? Are there any particular reasons beyond this where it could be recommendable to use char over stdint.h's appropriate fixed-width types?
The char type is for characters and strings. It is the type expected and returned by all the string handling functions. (*) You really should never have to do arithmetic on char, especially not the kind where signed-ness would make a difference.
unsigned char is the type to be used for raw data. For example memcpy() or fread() interpret their void * arguments as arrays of unsigned char. The standard guarantees that any type can be also represented as an array of unsigned char. Any other conversion might be "signalling", i.e. triggering exceptions. (ISO/IEC 9899:2011, section 6.2.6 "Representation of Types"). (**)
signed char is when you need a signed integer of char size (for arithmetics).
(*): The character handling functions in <ctype.h> are a bit oddball about this, as they cater for EOF (negative), and hence "force" the character values into the unsigned char range (ISO/IEC 9899:2011, section 7.4 Character handling). But since it is guaranteed that a char can be cast to unsigned char and back without loss of information as per section 6.2.6... you get the idea.
When signed-ness of char would make a difference -- the comparison functions like in strcmp() -- the standard dictates that char is interpreted as unsigned char (ISO/IEC 9899:2011, section 7.24.4 Comparison functions).
(**): Practically, it is hard to see how a conversion of raw data to char and back could be signalling where the same done with unsigned char would not be signalling. But unsigned char is what the section of the standard says. ;-)
Use char to store characters (standard defines the behaviour for basic execution character set elements only, roughly ASCII 7-bit characters).
Use signed char or unsigned char to get the corresponding arithmetic (signed or unsigned arithmetic have different properties for integers - char is an integer type).
This doesn't means that you can't make arithmetic with raw chars, as stated:
6.2.5 Types - 3. An object declared as type char is large enough to store any member of
the basic execution character set. If a member of the basic execution
character set is stored in a char object, its value is guaranteed to
be nonnegative.
Then if you only use character set elements arithmetic on them is correctly defined.

passing unsigned char array to string functions

Say I have some utf8 encoded string. Inside it words are delimited using ";".
But each character (except ";") inside this string has utf8 value >128.
Say I store this string inside unsigned char array:
unsigned char buff[]="someutf8string;separated;with;";
Is it safe to pass this buff to strtok function? (If I just want to extracts words using ";" symbol).
My concern is that strtok (or also strcpy) expect char pointers, but inside my
string some values will have value > 128.
So is this behaviour defined?
No, it is not safe -- but if it compiles it will almost certainly work as expected.
unsigned char buff[]="someutf8string;separated;with;";
This is fine; the standard specifically permits arrays of character type (including unsigned char) to be initialized with a string literal. Successive bytes of the string literal initialize the elements of the array.
strtok(buff, ";")
This is a constraint violation, requiring a compile-time diagnostic. (That's about as close as the C standard gets to saying that something is illegal.)
The first parameter of strok is of type char*, but you're passing an argument of type unsigned char*. These two pointer types are not compatible, and there is no implicit conversion between them. A conforming compiler may reject your program if it contains a call like this (and, for example, gcc -std=c99 -pedantic-errors does reject it.)
Many C compilers are somewhat lax about strict enforcement of the standard's requirements. In many cases, compilers issue warnings for code that contains constraint violations -- which is perfectly valid. But once a compiler has diagnosed a constraint violation and proceeded to generate an executable, the behavior of that executable is not defined by the C standard.
As far as I know, any actual compiler that doesn't reject this call will generate code that behaves just as you expect it to. The pointer types char* and unsigned char* almost certainly have the same representation and are passed the same way as arguments, and the types char and unsigned char are explicitly required to have the same representation for non-negative values. Even for values exceeding CHAR_MAX, like the ones you're using, a compiler would have to go out of its way to generate misbehaving code. You could have problems on a system that doesn't use 2's-complement for signed integers, but yo're not likely to encounter such a system.
If you add an explicit cast:
strtok((char*)buff, ";")
removes the constraint violation and will probably silence any warning -- but the behavior is still strictly undefined.
In practice, though, most compilers try to treat char, signed char, and unsigned char almost interchangeably, partly to cater to code like yours, and partly because they'd have to go out of their way to do anything else.
According to the C11 Standard (ISO/IEC 9899:2011 §7.24.1 String Handling Conventions, ¶3, emphasis added):
For all functions in this subclause, each character shall be
interpreted as if it had the type unsigned char (and therefore every
possible object representation is valid and has a different value).
Note: this paragraph was not present in the C99 standard.
So I do not see a problem.

Using unsigned char* with string methods such as strcpy

I recalled now at some places in my code I might have passed
unsigned char* variables as parameters to functions such as strcpy and strtok -- which expect char *. My question is: is it a bad idea? Could it have caused issues?
e.g.
unsigned char * x = // .... some val, null terminated
unsigned char * y = // ... same here;
strcpy(x,y); // ps assuming there is space allocated for x
e.g., unsigned char * x = strtok(NULL,...)
It's guaranteed to be ok (after you cast the pointer), because the "Strict Aliasing Rule" has a special exception for looking at the same object via both signed and unsigned variants.
See here for the rule itself. Other answers on that page explain it.
The C aliasing rules have exceptions for signed/unsigned variants and for char access in general. So no trouble here.
Quote from the standard:
An object shall have its stored value accessed only by an lvalue expression that has one of
the following types:88)
— a type compatible with the effective type of the object,
— a qualified version of a type compatible with the effective type of the object,
— a type that is the signed or unsigned type corresponding to the effective type of the
object,
— a type that is the signed or unsigned type corresponding to a qualified version of the
effective type of the object,
— an aggregate or union type that includes one of the aforementioned types among its
members (including, recursively, a member of a subaggregate or contained union), or
— a character type.
All standard library functions treat any char arguments as unsigned char, so passing char*, unsigned char* or signed char* is treated the same.
Quote from the intro of <string.h>:
For all functions in this subclause, each character shall be interpreted as if it had the type
unsigned char (and therefore every possible object representation is valid and has a
different value).
Still, your compiler should complain if you get the signed-ness wrong, especially if you enable all warnings (you should, always).
The only problem with converting unsigned char * into char * (or vice versa) is that it's supposed to be an error. Fix it with a cast.
e.g,
function((char *) buff, len);
That being said, strcpy needs to have the null-terminating character (\0) to properly work. The alternative is to use memcpy.
But you shouldn't use unsigned char arrays with string handling functions. In C strings are char arrays, not unsigned char arrays. Since passing to strcpy discards the unsigned qualifier, the compiler warns.
As a general rule, don't make things unsigned when you don't have to.

What does the prefix L"..." stand for in GCC C without #including wchar?

That is, why does unsigned short var= L'ÿ' work, but unsigned short var[]= L"ÿ"; does not?
L'ÿ' is of type wchar_t, which can be implicitly converted into an unsigned short. L"ÿ" is of type wchar_t[2], which cannot be implicitly converted into unsigned short[2].
L is the prefix for wide character literals and wide-character string literals. This is part of the language and not a header. It's also not GCC-specific. They would be used like so:
wchar_t some_wchar = L'ÿ';
wchar_t *some_wstring = L"ÿ"; // or wchar_t some_wstring[] = L"ÿ";
You can do unsigned short something = L'ÿ'; because a conversion is defined from wchar_t to short. There is not such conversion defined between wchar_t* and short.
wchar_t is just a typedef to one of the standard integer types. The compiler implementor choses such a type that is large enough to hold all wide characters. If you don't include the header, this is still true and L'ß' is well defined, only that you as a programmer don't know what type it has.
Your initialization to an integer type works because there are rules to convert one into another. Assigning a wide character string (i.e the address of the first address of a wide character array) to an integer pointer is only possible if you guess the integer type to which wchar_t corresponds correctly. There is no automatic conversion of pointers of different types, unless one of them is void*.
Chris has already given the correct answer, but I'd like to offer some thoughts on why you may have made the mistake to begin with. On Windows, wchar_t was defined as 16-bit way back in the early days of Unicode where it was intended to be a 16-bit character set. Unfortunately this turned out to be a bad decision (it makes it impossible for the C compiler to support non-BMP Unicode characters in a way that conforms to the C standard), but they were stuck with it.
Unix systems from the beginning have used 32-bit wchar_t, which of course means short * and wchar_t * are incompatible pointer types.
For what I remember of C
'Y' or whatever is a char and you can cast it into an int and therefore convert it into a L,
"y" is a string constant and you can't translate it into a integer value

Resources