Is it safe to convert char * to const unsigned char *? - c

We are using char * and the lib we use is using const unsigned char *, so we convert to const unsigned char *
// lib
int asdf(const unsigned char *vv);
//ours
char *str = "somestring";
asdf((const unsigned char*)str);
Is it safe? any pitfall?

it is safe.
char *str = "somestring";
str is a constant,you could change str point to another string:
str = "some";//right
but you can't change the string which str is pointing now
str[0]='q';//wrong
so it is safe to use const to convert a constant to constant
if the asdf() only use to show the string,likes printf() or puts() ,you no need to const,because you are not change the string.
if you use const it will be safer.when you implementing the asdf(),it will make sure you can't write the wrong code likes "str[0]='q';" because you can't compile it.
if there are no const,you will find the error until you running the program.

If it's being treated as a string by that interface, there should be no problem at all. You don't even need to add the const in your cast if you don't want to - that part is automatic.

Technically, it is always safe to convert pointers from one type to another (except between function pointers and data pointers, unless your implementation provides an extension that allows that). It's only the dereferencing that is unsafe.
But dereferencing a pointer to any signedness of char is always safe (from a type aliasing perspective, at least).

It probably won't break anything in the compiler if you pass a char* where an unsigned char* was expected by using a cast. The const part is unproblematic -- it's just indicating that the function won't modify the argument via its pointer.
In many cases it doesn't matter whether char values are treated as signed or unsigned -- it's usually only a problem when performing arithmetic or comparing the sizes of values. However, if the function is expressly defined to take an unsigned char*, I guess there's a chance that it really requires the input data to be unsigned, for some arithmetical reason. If you're treating your character data as signed elsewhere, then it's possible that there is an incompatibility between your data and the data expected by the function.
In many cases, however, developers write "unsigned" to mean "I will not be doing arithmetic on this data", so the signedness probably won't matter.

Related

Assignment: create my own memcpy. Why cast the destination and source pointers to unsigned char* instead of char*? [duplicate]

This question already has answers here:
Implement `memcpy()`: Is `unsigned char *` needed, or just `char *`?
(4 answers)
Closed 10 days ago.
I'm trying to create my own versions of C functions and when I got to memcpy and memset I assumed that I should cast the destination and sources pointers to char *. However, I've seen many examples where the pointers were casted to unsigned char * instead. Why is that?
void *mem_cpy(void *dest, const void *src, size_t n) {
if (dest == NULL || src == NULL)
return NULL;
int i = 0;
char *dest_arr = (char *)dest;
char *src_arr = (char *)src;
while (i < n) {
dest_arr[i] = src_arr[i];
i++;
}
return dest;
}
It doesn't matter for this case, but a lot of folks working with raw bytes will prefer to explicitly specify unsigned char (or with stdint.h types, uint8_t) to avoid weirdness if they have to do math with the bytes. char has implementation-defined signedness, and that means, when the integer promotions & usual arithmetic conversions are applied, a char with the high bit set is treated as a negative number if signed, and a positive number if unsigned.
While neither behavior is necessarily wrong for a given problem, the fact that the behavior can change between compilers or even with different flags set on the same compiler, means you often need to be explicit about signedness, using either signed char or unsigned char as appropriate, and 99% of the time, the behaviors of unsigned char are what you want, so people tend to default to it even when it's not strictly required.
There's no particular reason in this specific case, it's mostly stylistic.
But in general it is always best to stick to unsigned arithmetic when dealing with raw data. That is: unsigned char or uint8_t.
The char type is problematic because it has implementation-defined signedness and is therefore avoided in such code. Is char signed or unsigned by default?
NOTE: this is dangerous and poor style:
char *src_arr = (char *)src;
(And the cast hid the problem underneath the carpet)
Since you correctly used "const correctness" for src, the correct type is: const char *src_arr; I'd change to code to:
unsigned char *dest_arr = dest;
const unsigned char *src_arr = src;
A good rule of thumb for beginners is to never use a cast. I'm serious. Some 90% of all casts we see on SO in beginner-level programs are wrong, in one way or the other.
Btw (advanced topic) there's a reason why memcpy has the prototype as:
void *memcpy(void * restrict s1,
const void * restrict s2,
size_t n);
The restrict qualifier on the pointers tell the user of the function "hey I'm counting on you to not pass on two pointers to the same object or pointers that may overlap". Doing so would cause problems in various situations and for various targets, so this is a good idea.
It's much more likely that the user passes on overlapping pointers than null pointers, so if you are to have slow, superfluous error checking against NULL, you should also restrict qualify the pointers.
If the user passes on null pointers I'd just let the function crash, instead of slowing it down with extra branches that are pointless bloat in some 99% of all use cases.
Why ... unsigned char* instead of char*?
Short answer: Because the functionality differs in select operations when char is signed and the C spec specifies unsigned char like functionality for str...() and mem...().
When does it make a difference?
When a function (like memcmp(), strcmp(), etc.) compares for order, one byte is negative and the other is positive, the order of the two bytes differ. Example: -1 < 1, yet when viewed as an unsigned char: 255 > 1.
When does it not make a difference?
When copying data and comparing for equality*1.
Non-2's compliment
*1 One's compliment and sign-magnitude encoding are expected to be dropped in the upcoming version C2x. Until then, those signed encodings support 2 zeroes. For str...() and mem...() functions, C specifies data access as unsigned char. This means only the +0 is a null character and order depends on pure binary, unsigned, encoding.

How to replicate the functionality of `strtod`, etc. without getting the warning "Assigning to 'char *' from 'const char *' discards qualifier"?

There are many posts about this particular warning, but I wasn't able to find one that specifically references a function signature like that of strtod and friends.
I have a function int foo(const char *str, char **end), and there's no way that I have discovered to set *end = str before returning the integer without encountering the aforementioned warning. Is there some additional qualifier that I'm leaving out, or is it something I'm going to have to "live with"?
If I remember correctly, this is actually a big fat wart in the modern C Standard. It's impossible to implement strtod without an explicit cast. There was a huge amount of discussion about this, back in the day, with radical proposals being made for bizarre extensions to the language to make it possible to write strtod "correctly". But the cures were all worse than the disease, so in the end, the radical proposals were not adopted, and the result it that it's tricky (but not impossible) to write strtod and the like.
In your implementation, you will typically have a pointer p that points to the first character you didn't parse. Since p's initial value was your input string, which was const char *, your p will typically be a const char *, too. (And this is fine, because you don't intend to use p to modify the string as you parse it.) But when it comes time to set endp, you're simply going to have to use a cast:
*endp = (char *)p;
It feels lousy to be "casting away constness" like this, but it's really the only way, and it's actually perfectly legal, as Eric P. explains in another answer here.
(Another possibility is *(const char **)endp = p;, but it's more typing, even more sketchy-looking, and it turns out not strictly legal.)
Normally the rule is that explicit casts like this are poor form. Normally the recommendation is to find a way to not need the explicit cast. And although the general rule is a good one, this is an exception, pure and simple. Based on everything else that's going on, the conclusion is that you need this cast here, and if you try to get by without it, you end up having to do something even worse elsewhere.
In answer to a question in a comment, the reason we can't make endp be a const char ** is that it makes things too inconvenient on the caller. The caller might be using pointers that are not const-qualified. Now, if the caller has
char *str = "123.456xyz";
and then calls
double d = strtod(str, NULL);
this is fine: it's okay to pass a regular char * to a function that expects const char *. But if the caller wants to get the end pointer back, and additionally declares
char *endp;
and then calls
double d = strtod(str, &endp);
and if strtod were declared as
double strtod(const char *, const char **);
it turns out it wouldn't work. You can pass a char * to a function that expects a const char *, but you can not pass a char ** to an function that expects a const char **. The explanation for why you can't is rather obscure. There's a sort-of-coherent explanation in the C FAQ list.
… there's no way that I have discovered to set *end = str…
Simply use *end = (char *) str;. When you explicitly convert using a cast, the compiler will not warn you. Further, this is fully defined by the C standard.
C 2018 6.3.2.3 7 says a pointer to an object type (here str) may be converted to a pointer to a different object type (here char *). If the resulting pointer were not correctly aligned for the new type, the behavior would be undefined. However, any pointer is correctly aligned for char, so that is fine. That paragraph also tells us that when we convert a pointer to a character type, the result points to the lowest addressed byte of the object. And that is just what we have done, convert to a pointer to a character type.
These rules do not prohibit us from removing const in the cast: Converting from the const char * that is str to char * is allowed. (A different rule says the behavior is undefined if an attempted is made to modify an object defined with const, but we are not going to do that. Just pointing to it without const is fine.)
Then the caller gets this pointer back in their end object. It is a valid pointer to the byte that ended the parse. So they can use it. If they are working with a string defined without const, they can use this end pointer to read and/or write to the string. If they are working with a string defined with const, they can use this end pointer to read from the string, and they are also free to convert it to a const char *.

C programming preferring uint8 over char

The code I am handling has a lot of castings that are being made from uint8 to char, and then the C library functions are called upon this castings.I was trying to understand why would the writer prefer uint8 over char.
For example:
uint8 *my_string = "XYZ";
strlen((char*)my_string);
What happens to the \0, is it added when I cast?
What happens when I cast the other way around?
Is this a legit way to work, and why would anybody prefer working with uint8 over char?
The casts char <=> uint8 are fine. It is always allowed to access any defined memory as unsigned characters, including string literals, and then of course to cast a pointer that points to a string literal back to char *.
In
uint8 *my_string = "XYZ";
"XYZ" is an anonymous array of 4 chars - including the terminating zero. This decays into a pointer to the first character. This is then implicitly converted to uint8 * - strictly speaking, it should have an explicit cast though.
The problem with the type char is that the standard leaves it up to the implementation to define whether it is signed or unsigned. If there is lots of arithmetic with the characters/bytes, it might be beneficial to have them unsigned by default.
A particularly notorious example is the <ctype.h> with its is* character class functions - isspace, isalpha and the like. They require the characters as unsigned chars (converted to int)! A piece of code that does the equivalent of char c = something(); if (isspace(c)) { ... } is not portable and a compiler cannot even warn about this! If the char type is signed on the platform (default on x86!) and the character isn't ASCII (or, more properly, a member of the basic execution character set), then the behaviour is undefined - it would even abort on MSVC debug builds, but unfortunately just causes silent undefined behaviour (array access out of bounds) on glibc.
However, a compiler would be very loud about using unsigned char * or its alias as an argument to strlen, hence the cast.

What is the best way to represent characters in C?

I know that a char is allowed to be signed or unsigned depending on the implementation. This doesn't really bother me if all I want to do is manipulate bytes. (In fact, I don't think of the char datatype as a character, but a byte).
But, if I understand, string literals are signed chars (actually they're not, but see the update below), and the function fgetc() returns unsigned chars casted into int. So if I want to manipulate characters, is it preferred style to use signed, unsigned, or ambiguous characters? Why does reading characters from a file have a different convention than literals?
I ask because I have some code in c that does string comparison between string literals and the contents of files, but having a signed char * vs unsigned char * might really make my code error prone.
Update 1
Ok as a few people pointed out (in answers and comments) string literals are in fact char arrays, not signed char arrays. That means I really should use char * for string literals, and not think about whether they are signed or unsigned. This makes me perfectly happy (until I have to start making conversion/comparisons with unsigned chars).
However the important question remains, how do I read characters from a file, and compare them to a string literal. The crux of which is the conversion from the int read using fgetc(), which explicitly reads an unsigned char from the file, to the char type, which is allowed to be either signed or unsigned.
Allow me to provide a more detailed example.
int main(void)
{
FILE *someFile = fopen("ThePathToSomeRealFile.html", "r");
assert(someFile);
char substringFromFile[25];
memset((void*)substringFromFile,0,sizeof(substringFromFile));
//Alright, the real example is to read the first few characters from the file
//And then compare them to the string I expect
const char *expectedString = "<!DOCTYPE";
for( int counter = 0; counter < sizeof(expectedString)/sizeof(*expectedString); ++counter )
{
//Read it as an integer, because the function returns an `int`
const int oneCharacter = fgetc(someFile);
if( ferror(someFile) )
return EXIT_FAILURE;
if( int == EOF || feof(someFile) )
break;
assert(counter < sizeof(substringFromFile)/sizeof(*substringFromFile));
//HERE IS THE PROBLEM:
//I know the data contained in oneCharacter must be an unsigned char
//Therefore, this is valid
const unsigned char uChar = (const unsigned char)oneCharacter;
//But then how do I assign it to the char?
substringFromFile[counter] = (char)oneCharacter;
}
//and ultimately here's my goal
int headerIsCorrect = strncmp(substringFromFile, expectedString, 9);
if(headerIsCorrect != 0)
return EXIT_SUCCESS;
//else
return EXIT_FAILURE;
}
Essentially, I know my fgetc() function is returning something that (after some error checking) is code-able as an unsigned char. I know that char may or may not be an unsigned char. That means, depending on the implementation of the c standard, doing a cast to char will involve no reinterpretation. However, in the case that the system is implemented with a signed char, I have to worry about values that can be coded by an unsigned char that aren't code-able by char (i.e. those values between (INT8_MAX UINT8_MAX]).
tl;dr
The question is this, should I (1) copy their underlying data read by fgetc() (by casting pointers - don't worry, I know how to do that), or (2) cast down from unsigned char to char (which is only safe if I know that the values can't exceed INT8_MAX, or those values can be ignored for whatever reason)?
The historical reasons are (as I've been told, I don't have a reference) that the char type was poorly specified from the beginning.
Some implementations used "consistent integer types" where char, short, int and so on were all signed by default. This makes sense because it makes the types consistent with each other.
Other implementations used unsigned for character, since there never existed any symbol tables with negative indices (that would be stupid) and since they saw a need for more than 128 characters (a very valid concern).
By the time C got standardized properly, it was too late to change this, too many different compilers and programs written for them were already out on the market. So the signedness of char was made implementation-defined, for backwards compatibility reasons.
The signedness of char does not matter if you only use it to store characters/strings. It only matters when you decide to involve the char type in arithmetic expressions or use it to store integer values - this is a very bad idea.
For characters/string, always use char (or wchar_t).
For any other form of 1 byte large data, always use uint8_t or int8_t.
But, if I understand, string literals are signed char
No, string literals are char arrays.
the function fgetc() returns unsigned chars casted into int
No, it returns a char converted to an int. It is int because the return type may contain EOF, which is an integer constant and not a character constant.
having a signed char * vs unsigned char * might really make my code error prone.
No, not really. Formally, this rule from the standard applies:
A pointer to an object type may be converted to a pointer to a different object type. If the
resulting pointer is not correctly aligned for the referenced type, the behavior is undefined. Otherwise, when converted back again, the result shall compare equal to the original pointer.
There exists no case where casting from pointer to signed char to pointer to unsigned char or vice versa, would cause any alignment issues or other issues.
I know that a char is allowed to be signed or unsigned depending on the implementation. This doesn't really bother me if all I want to do is manipulate bytes.
If you're going to do comparison or assign char to other integer types, it should bother you.
But, if I understand, string literals are signed chars
They are of type char[], so if char === unsigned char, all string literals are unsigned char[].
the function fgetc() returns unsigned chars casted into int.
That's correct and is required to omit undesired sign extension.
So if I want to manipulate characters, is it preferred style to use signed, unsigned, or ambiguous characters?
For portability I'd advise to follow practice adapted by various libc implementations: use char, but before processing cast to unsigned char (char* to unsigned char*). This way implicit integer promotions won't turn characters in the range 0x80 -- 0xff into negative numbers of wider types.
In short: (signed char)a < (signed char)b is NOT always equivalent to (unsigned char)a < (unsigned char)b. Here is an example.
Why does reading characters from a file have a different convention than literals?
getc() needs a way to return EOF such that it couldn't be confused with any real char.

Pass unsigned char pointer to atoi without cast

On some embedded device, I have passed an unsigned char pointer to atoi without a cast.
unsigned char c[10]="12";
atoi(c);
Question: is it well defined?
I saw somewhere it is ok for string functions, but was not sure about atoi.
Edit: Btw. Some concerns have been expressed on one of the answer below that it might not be OK even for string functions such as strcpy - but if I got right (?) the author meant also it can be that in practice this can be OK.
Also that I am here, is it ok to do following assignment to unsigned char pointer ok too? Because I used some tool which is complaining about "Type mismatch (assignment) (ptrs to signed/unsigned)"
unsigned char *ptr = strtok(unscharbuff,"-");
// is assignment also ok to unsigned char?
No, it's not well defined. It's a constraint violation, requiring a compile-time diagnostic. In practice it's very very likely to work as you expect, but it's not guaranteed to do so, and IMHO it's poor style.
The atoi function is declared in <stdlib.h> as:
int atoi(const char *nptr);
You're passing an unsigned char* argument to a function that expects a char* argument. The two types are not compatible, and there is no implicit conversion from one to the other. A conforming compiler may issue a warning (that counts as a diagnostic) and then proceed to generate an executable, but the behavior of that executable is undefined.
As of C99, a call to a function with no visible declaration is a constraint violation, so you can't get away with it by omitting the #include <stdlib.h>.
C does still permit calls to functions with a visible declaration where the declaration is not a prototype (i.e., doesn't define the number of type(s) of the parameters). So, rather than the usual #include <stdlib.h>, you could add your own declaration:
int atoi();
which would permit calling it with an unsigned char* argument.
This will almost certainly "work", and it might be possible to construct an argument from the standard that its behavior is well defined. The char and unsigned char values of '1' and '2' are guaranteed to have the same representation
But it's far easier to add the cast than to prove that it's not necessary -- or, better yet, to define c as an array of char rather than as an array of unsigned char, since it's intended to hold a string.
unsigned char *ptr = strtok(unscharbuff,"-");
This is also a constraint violation. There is no implicit conversion from unsigned char* to char* for the first argument in the strtok call, and there is no implicit conversion from char* to unsigned char* for the initialization of ptr.
Yes, these will function perfectly fine. Your compiler settings will determine whether you get a warning regarding type. I usually compile with -Wall, to turn on all warnings, and then use static casting in the code for each and every case, so that I know I have carefully examined them. The end result is zero errors and zero warnings, and any change that triggers a warning in the future will really stand out, not get lost in 100 tolerated messages.

Resources