Using unsigned char* with string methods such as strcpy - c

I recalled now at some places in my code I might have passed
unsigned char* variables as parameters to functions such as strcpy and strtok -- which expect char *. My question is: is it a bad idea? Could it have caused issues?
e.g.
unsigned char * x = // .... some val, null terminated
unsigned char * y = // ... same here;
strcpy(x,y); // ps assuming there is space allocated for x
e.g., unsigned char * x = strtok(NULL,...)

It's guaranteed to be ok (after you cast the pointer), because the "Strict Aliasing Rule" has a special exception for looking at the same object via both signed and unsigned variants.
See here for the rule itself. Other answers on that page explain it.

The C aliasing rules have exceptions for signed/unsigned variants and for char access in general. So no trouble here.
Quote from the standard:
An object shall have its stored value accessed only by an lvalue expression that has one of
the following types:88)
— a type compatible with the effective type of the object,
— a qualified version of a type compatible with the effective type of the object,
— a type that is the signed or unsigned type corresponding to the effective type of the
object,
— a type that is the signed or unsigned type corresponding to a qualified version of the
effective type of the object,
— an aggregate or union type that includes one of the aforementioned types among its
members (including, recursively, a member of a subaggregate or contained union), or
— a character type.
All standard library functions treat any char arguments as unsigned char, so passing char*, unsigned char* or signed char* is treated the same.
Quote from the intro of <string.h>:
For all functions in this subclause, each character shall be interpreted as if it had the type
unsigned char (and therefore every possible object representation is valid and has a
different value).
Still, your compiler should complain if you get the signed-ness wrong, especially if you enable all warnings (you should, always).

The only problem with converting unsigned char * into char * (or vice versa) is that it's supposed to be an error. Fix it with a cast.
e.g,
function((char *) buff, len);
That being said, strcpy needs to have the null-terminating character (\0) to properly work. The alternative is to use memcpy.
But you shouldn't use unsigned char arrays with string handling functions. In C strings are char arrays, not unsigned char arrays. Since passing to strcpy discards the unsigned qualifier, the compiler warns.
As a general rule, don't make things unsigned when you don't have to.

Related

C - Conversion behavior between two pointers

Update 2020-12-11: Thanks #"Some programmer dude" for the suggestion in the comment.
My underlying problem is that our team is implementing a dynamic type storage engine. We allocate multiple char array[PAGE_SIZE] buffers with 16-aligned to store dynamic types of data (there is no fixed struct). For efficiency reasons, we cannot perform byte encoding or allocate additional space to use memcpy.
Since the alignment has been determined (i.e., 16), the rest is to use the cast of pointer to access objects of the specified type, for example:
int main() {
// simulate our 16-aligned malloc
_Alignas(16) char buf[4096];
// store some dynamic data:
*((unsigned long *) buf) = 0xff07;
*(((double *) buf) + 2) = 1.618;
}
But our team disputes whether this operation is undefined behavior.
I have read many similar questions, such as
Why does -Wcast-align not warn about cast from char* to int* on x86?
How to cast char array to int at non-aligned position?
C undefined behavior. Strict aliasing rule, or incorrect alignment?
SEI CERT C C.S EXP36-C
But these are different from my interpretation of the C standard, I want to know if it’s my misunderstanding.
The main confusion is about the section 6.3.2.3 #7 of C11:
A pointer to an object type may be converted to a pointer to a different object type. If the resulting pointer is not correctly aligned 68) for the referenced type, the behavior is undefined.
68) In general, the concept ‘‘correctly aligned’’ is transitive: if a pointer to type A is correctly aligned for a pointer to type B, which in turn is correctly aligned for a pointer to type C, then a pointer to type A is correctly aligned for a pointer to type C.
Does the resulting pointer here refer to Pointer Object or Pointer Value?
In my opinion, I think the answer is the Pointer Object, but more answers seem to indicate the Pointer Value.
Interpretation A: Pointer Object
My thoughts are as follows: A pointer itself is an object. According to 6.2.5 #28, different pointer may have different representation and alignment requirements. Therefore, according to 6.3.2.3 #7, as long as two pointers have the same alignment, they can be safely converted without undefined behavior, but there is no guarantee that they can be dereferenced.
Express this idea in a program:
#include <stdio.h>
int main() {
char buf[4096];
char *pc = buf;
if (_Alignof(char *) == _Alignof(int *)) {
// cast safely, because they have the same alignment requirement?
int *pi = (int *) pc;
printf("pi: %p\n", pi);
} else {
printf("char * and int * don't have the same alignment.\n");
}
}
Interpretation B: Pointer Value
However, if the C11 standard is talking about Pointer Value for referenced type rather than Pointer Object. The alignment check of the above code is meaningless.
Express this idea in a program:
#include <stdio.h>
int main() {
char buf[4096];
char *pc = buf;
/*
* undefined behavior, because:
* align of char is 1
* align of int is 4
*
* and we don't know whether the `value` of pc is 4-aligned.
*/
int *pi = (int *) pc;
printf("pi: %p\n", pi);
}
Which interpretation is correct?
Interpretation B is correct. The standard is talking about a pointer to an object, not the object itself. "Resulting pointer" is referring to the result of the cast, and a cast does not produce an lvalue, so it's referring to the pointer value after the cast.
Taking the code in your example, suppose that an int must be aligned on a 4 byte boundary, i.e. it's address must be a multiple of 4. If the address of buf is 0x1001 then converting that address to int * is invalid because the pointer value is not properly aligned. If the address of buf is 0x1000 then converting it to int * is valid.
Update:
The code you added addresses the alignment issue, so it's fine in that regard. It however has a different issue: it violates strict aliasing.
The array you defined contains objects of type char. By casting the address to a different type and subsequently dereferencing the converted type type, you're accessing objects of one type as objects of another type. This is not allowed by the C standard.
Though the term "strict aliasing" is not used in the standard, the concept is described in section 6.5 paragraphs 6 and 7:
6 The effective type of an object for an access to its stored value is the declared type of the object, if any.87) If a
value is stored into an object having no declared type through an
lvalue having a type that is not a character type, then the type of
the lvalue becomes the effective type of the object for that access
and for subsequent accesses that do not modify the stored value. If a
value is copied into an object having no declared type using memcpy
or memmove, or is copied as an array of character type, then the
effective type of the modified object for that access and for
subsequent accesses that do not modify the value is the effective type
of the object from which the value is copied, if it has one. For all
other accesses to an object having no declared type, the effective
type of the object is simply the type of the lvalue used for the
access.
7 An object shall have its stored value accessed only by an lvalue expression that has one of the following types:88)
a type compatible with the effective type of the object,
a qualified version of a type compatible with the effective type of the object,
a type that is the signed or unsigned type corresponding to the effective type of the object,
a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a
subaggregate or contained union), or
a character type.
...
87 ) Allocated objects have no declared type.
88 ) The intent of this list is to specify those circumstances in which
an object may or may not be aliased.
In your example, you're writing an unsigned long and a double on top of char objects. Neither of these types satisfies the conditions of paragraph 7.
In addition to that, the pointer arithmetic here is not valid:
*(((double *) buf) + 2) = 1.618;
As you're treating buf as an array of double when it is not. At the very least, you would need to perform the necessary arithmetic on buf directly and cast the result at the end.
So why is this a problem for a char array and not a buffer returned by malloc? Because memory returned from malloc has no effective type until you store something in it, which is what paragraph 6 and footnote 87 describe.
So from a strict point of view of the standard, what you're doing is undefined behavior. But depending on your compiler you may be able to disable strict aliasing so this will work. If you're using gcc, you'll want to pass the -fno-strict-aliasing flag
The Standard does not require that implementations consider the possibility that code will ever observe a value in a T* that is not aligned for type T. In clang, for example, when targeting platforms whose "larger" load/store instructions do not support unaligned access, converting a pointer into a type whose alignment it doesn't satisfy and then using memcpy on it may result in the compiler generating code which will fail if the pointer isn't aligned, even though memcpy itself would not otherwise impose any alignment requirements.
When targeting an ARM Cortex-M0 or Cortex-M3, for example, given:
void test1(long long *dest, long long *src)
{
memcpy(dest, src, sizeof (long long));
}
void test2(char *dest, char *src)
{
memcpy(dest, src, sizeof (long long));
}
void test3(long long *dest, long long *src)
{
*dest = *src;
}
clang will generate for both test1 and test3 code which would fail if src or dest were not aligned, but for test2 it will generate code which is bigger and slower, but which will support arbitrary alignment of the source and destination operands.
To be sure, even on clang the act of converting an unaligned pointer into a long long* won't generally cause anything weird to happen by itself, but it is the fact that such a conversion would produce UB that exempts the compiler of any responsibility to handle the unaligned-pointer case in test1.

Why void pointer if pointers can be casted into any type(in c)?

I want to understand the real need of having a void pointer, for example in the following code, i use casting to be able to use the same ptr in different way, so why is there really a void pointer if anything can be casted?
int main()
{
int x = 0xAABBCCDD;
int * y = &x;
short * c = (short *)y;
char * d = (char*)y;
*c = 0;
printf("x is %x\n",x);//aabb0000
d +=2;
*d = 0;
printf("x is %x\n",x);//aa000000
return 0;
}
Converting any pointer type to any other pointer type is not supported by base C (that is, C without any extensions or behavior not required by the C standard). The 2018 C standard says in clause 6.3.2.3, paragraph 7:
A pointer to an object type may be converted to a pointer to a different object type. If the resulting pointer is not correctly aligned for the referenced type, the behavior is undefined. Otherwise, when converted back again, the result shall compare equal to the original pointer…
In that passage, we see two limitations:
If the pointer is not properly aligned, the conversion may fail in various ways. In your example, converting an int * to a short * is unlikely to fail since int typically has stricter alignment than short. However, the reverse conversion is not supported by base C. Say you define an array with short x[20]; or char x[20];. Then the array will be aligned as needed for a short or char, but not necessarily as needed for an int, in which case the behavior of (int *) x would not be defined by the C standard.
The value that results from the conversion mostly unspecified. This passage only guarantees that converting it back yields the original pointer (or something equivalent). It does not guarantee you can do anything useful with the pointer without converting it back—you cannot necessarily use a pointer converted from int * to access a short.
The standard does make some additional guarantees about certain pointer conversions. One of them is in the continuation of the passage above:
… When a pointer to an object is converted to a pointer to a character type, the result points to the lowest addressed byte of the object. Successive increments of the result, up to the size of the object, yield pointers to the remaining bytes of the object.
So you can use a pointer converted from int * to access the individual bytes that represent an int, and you can do the same to access the bytes of any other object type. But that guarantee is made only for access the individual bytes with a character type, not with a short type.
From the above, we know that after the short * c = (short *)y; in your example, y does not necessarily point to any part of the x it originated from—the value resulting from the pointer conversion is not guaranteed to work as a short * at all. But, even if it does point to the place where x is, base C does not support using c to access those bytes, because 6.5 7 says:
An object shall have its stored value accessed only by an lvalue expression that has one of the following types:
— a type compatible with the effective type of the object,
— a qualified version of a type compatible with the effective type of the object,
— a type that is the signed or unsigned type corresponding to the effective type of the object,
— a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
— an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
— a character type.
So the *c = 0; in your example is not supported by C for two reasons: c does not necessarily point to any part of x or to any valid address, and, even if it does, the behavior of modifying part of the int x using short type is not defined by the C standard. It might appear to work in your C implementation, and it might even be supported by your C implementation, but it is not strictly conforming C code.
The C standard provides the void * type for use when a specific type is inadequate. 6.3.2.3 1 makes a similar guarantee for pointers to void as it does for pointers to objects:
A pointer to void may be converted to or from a pointer to any object type. A pointer to any object type may be converted to a pointer to void and back again; the result shall compare equal to the original pointer.
void * is used with routines that must work with arbitrary object types, such as qsort. char * could serve this purpose, but it is better to have a separate type that clearly denotes no specific type is associated with it. For example, if the parameter to a function were char *p, the function could inadvertently use *p and get a character that it does not want. If the parameter is void *p, then the function must convert the pointer to a specific type before using it to access an object. Thus having a special type for “generic pointers” can help avoid errors as well as indicate intent to people reading the code.
Why void pointer if pointers can be casted into any type(in c)?
C does not specify that void* can be cast into a pointer of any type. A void * may be cast into a pointer to any object type. IOWs, a void * may be insufficient to completely store a function pointer.
need of having a void pointer
A void * is a universal pointer for object types. Setting aside pointers to const, volatile, etc. concerns, functions like malloc(), memset() provide universal ways to allocate and move/set data.
In more novel architectures, a int * and void * and others have different sizes and interpretations. void* is the common pointer type for objects, complete enough to store information to re-constitute the original pointer, regardless of object type pointed to.

How is "signed or unsigned type" meant in this C90 undefined behaviour definition?

In the ANSI C90 standard, section 6.3 has this to say about expressions:
An object shall have its stored value accessed only by an lvalue that has one of the following types: [...] a type that is the signed or unsigned type corresponding to a qualified version of the declared type of the object
And there is this instance of undefined behaviour in Annex G.2:
The behavior in the following circumstances is undefined: [...] An object has its stored value accessed by an lvalue that does not have one of the following types: the declared type of the object, a qualified version of the declared type of the object, the signed or unsigned type corresponding to the declared type of the object, the signed or unsigned type corresponding to a qualified version of the declared type of the object, an aggregate or union type that (recursively) includes one of the aforementioned types among its members, or a character type (6.3).
I find the wording of the emphasised parts ambiguous and am struggling to interpret it.
Does it mean "the signed type corresponding to the original type if it was signed, or the unsigned type corresponding to the original type if it was unsigned"; or "the type (whether signed or unsigned doesn't matter) corresponding to the original type"? That is, is:
signed int a = -10;
unsigned int b = *((unsigned int *) a);
...undefined?
If signed/unsigned doesn't matter, given that the standard makes the distinction between the three types char, signed char, and unsigned char, would accessing a char via signed char * or unsigned char * be defined?
It's saying that it's not undefined behavior to cast the value to a different signedness. If the object is declared signed int, you can access it using an unsigned int lvalue, and vice versa.
The case where the signedness is the same is already covered when it says "the declared type of the object", although this case could also be considered to say that.
In the case of char, both signed char and unsigned char are "the signed or unsigned type corresponding to" that type.
All together it's just saying that the signedness of the lvalue doesn't affect whether the access is well-defined.
Please note that Annex G is informative and the relevant part to quote is normative C90 6.3.
This refers to the precursor to the "strict aliasing rule" later introduced in C99. In C90, it was ambiguous what to do with objects that had no type, such as the data pointed at by the return from malloc.
It means that if the type of the object is either signed int or unsigned int, you can do a lvalue access either with signed int* or unsigned int*. These two pointer types are allowed to alias. So for example if you have a function like this:
void func (signed int* a, unsigned int* b)
then the compiler cannot assume that a and b point to different objects.
(Note that wildly exotic systems can in theory have padding bits and trap representations for signed types, so accessing an unsigned int through a signed int* could be UB for other reasons, in theory.)
The character types are a special case compared to other integer types indeed. But it doesn't matter here, since the rule have a special case too: "or a character type". char, unsigned char and signed char are all character types. This means that all pointer access to an lvalue using any of these 3 types are well-defined.
The lvalue type doesn't even need to be a character type! You can for example lvalue access an int through signed char* and it is well-defined, but not the other way around.
When C89 was written, unsigned types were a sufficiently new addition to the language that a lot of code used int in places where unsigned--once it existed--would have made more sense. The authors of the Standard wanted to ensure that functions that used the newer unsigned type would be able to exchange data with those that had been written to use int because unsigned hadn't existed yet.
The Standard is a bit ambiguous as to whether a type like unsigned* has a "corresponding signed type" int*, or unsigned** has a "corresponding unsigned type" int**, etc. Given the purpose of allowing interaction between code that predates unsigned types with code that uses them, making a function that's written to operate on sequences of int* unusable by clients that have sequence of unsigned* would be contrary to that purpose and also to the Committee's charter. Upholding the stated purpose wouldn't require that int** be universally usable to access objects of type unsigned*, but would require that compilers given constructs like:
unsigned *foo[10];
actOnIntPtrs((int**)foo, 10);
recognize that the called function might affect objects of type unsigned* stored in foo.

C Guarantees Re: Pointers to Void and Character Types, Typecasting

My best-effort reading of the C specification (C99, primarily) makes me think that it is valid to cast (or implicitly convert, where void *'s implicit conversion behavior applies), between any of these types:
void *, char *, signed char *, unsigned char *
I expect that this will trigger no undefined behavior, and that those pointers are guaranteed to have the same underlying representation.
Consequently, it should be possible to take a pointer of either one of those four types that is already pointing to an address which can be legally dereferenced, typecast and/or assign it to one of the three char type pointers, and dereference it to access the same memory, with the only difference being whether your code will treat the data at that location as a char, signed char, or unsigned char.
Is this correct? Is there any version of the C standard (lack of void * type in pre-standardization C not withstanding) where this is not true?
P.S. I believe that this question is answered piecemeal in passing in a lot of other questions, but I've never seen a single clear answer where this is explicitly stated/confirmed.
Consequently, it should be possible to take a pointer of either one of those four types that is already pointing to an address which can be legally dereferenced, typecast and/or assign it to one of the three char type pointers, and dereference it to access the same memory, with the only difference being whether your code will treat the data at that location as a char, signed char, or unsigned char.
This is correct. In fact you could take a valid pointer to an object of any type and convert it to some of those three and access the memory.
You correctly mention the provision about void * and char * etc. having the same representation and alignment requirements, but that actually does not matter. That refers to the properties of the pointer itself, not the properties of the objects being pointed to.
The strict aliasing rule is not violated because that contains an explicit provision that a character type may be used to read or write any object.
Note that if we have for example, signed char ch = -2;, or any other negative value, then (unsigned char)ch may differ from *(unsigned char *)&ch. On a system with 8-bit characters, the former is guaranteed to be 254 but the latter could be 254, 253, or 130 depending on the numbering system in use.

Passing an int to a function that expects char*

I want to pass an int to a function that expects a char *.
The function header looks like this:
void swapEndian(char *pElement, unsigned int uNumBytes)
and calling the function with an int (I know, this won't work..)
int numLine;
swapEndian(numLine, sizeof(int));
So what must I do to numline to pass it to this function?
It sounds as though you just want:
swapEndian( (char*) &numLine, sizeof(int) );
The char* cast already suggested will work
swapEndian( (char*) &numLine, sizeof(int) );
...however if you can change swapEndian to take a void* it would be better since it avoids needing the cast (this is basically what void pointers are meant for) and also avoids any potential problems or potential future problems to do with the casting operation itself...
If swapEndian works with every pointers to object, then you can pass your pointer, because, according to the strict aliasing rule, a pointer to a character type can be dereference regardless to the type of the pointed object. You just need a typecast.
swapEndian((char *)&numLine, sizeof numLine);
C11 (n1570), § 6.5 Expressions
An object shall have its stored value accessed only by an lvalue
expression that has one of the following types:
— a type compatible with the effective type of the object,
— a qualified version of a type
compatible with the effective type of the object,
— a type that is the
signed or unsigned type corresponding to the effective type of the
object,
— a type that is the signed or unsigned type corresponding to
a qualified version of the effective type of the object,
— an
aggregate or union type that includes one of the aforementioned types
among its members (including, recursively, a member of a subaggregate
or contained union), or
— a character type.
In addition to other answers, assuming the swapEndian does what it name suggests, I believe it is wrongly declared. I suggest to declare it (if you can)
void swapEndian(void *pElement, size_t uNumBytes);

Resources