Implement `memcpy()`: Is `unsigned char *` needed, or just `char *`? - c

I was implementing a version of memcpy() to be able to use it with volatile.
Is it safe to use char * or do I need unsigned char *?
volatile void *memcpy_v(volatile void *dest, const volatile void *src, size_t n)
{
const volatile char *src_c = (const volatile char *)src;
volatile char *dest_c = (volatile char *)dest;
for (size_t i = 0; i < n; i++) {
dest_c[i] = src_c[i];
}
return dest;
}
I think unsigned should be necessary to avoid overflow problems if the data in any cell of the buffer is > INT8_MAX, which I think might be UB.

In theory, your code might run on a machine which forbids one bit pattern in a signed char. It might use ones' complement or sign-magnitude representations of negative integers, in which one bit pattern would be interpreted as a 0 with a negative sign. Even on two's-complement architectures, the standard allows the implementation to restrict the range of negative integers so that INT_MIN == -INT_MAX, although I don't know of any actual machine which does that.
So, according to §6.2.6.2p2, there may be one signed character value which an implementation might treat as a trap representation:
Which of these [representations of negative integers] applies is implementation-defined, as is whether the value with sign bit 1 and all value bits zero (for the first two [sign-magnitude and two's complement]), or with sign bit and all value bits 1 (for ones' complement), is a trap representation or a normal value. In the case of sign and magnitude and ones’ complement, if this representation is a normal value it is called a negative zero.
(There cannot be any other trap values for character types, because §6.2.6.2 requires that signed char not have any padding bits, which is the only other way that a trap representation can be formed. For the same reason, no bit pattern is a trap representation for unsigned char.)
So, if this hypothetical machine has a C implementation in which char is signed, then it is possible that copying an arbitrary byte through a char will involve copying a trap representation.
For signed integer types other than char (if it happens to be signed) and signed char, reading a value which is a trap representation is undefined behaviour. But §6.2.6.1/5 allows reading and writing these values for character types only:
Certain object representations need not represent a value of the object type. If the stored value of an object has such a representation and is read by an lvalue expression that does not have character type, the behavior is undefined. If such a representation is produced by a side effect that modifies all or any part of the object by an lvalue expression that does not have character type, the behavior is undefined. Such a representation is called a trap representation. (Emphasis added)
(The third sentence is a bit clunky, but to simplify: storing a value into memory is a "side effect that modifies all of the object", so it's permitted as well.)
In short, thanks to that exception, you can use char in an implementation of memcpy without worrying about undefined behaviour.
However, the same is not true of strcpy. strcpy must check for the trailing NUL byte which terminates a string, which means it needs to compare the value it reads from memory with 0. And the comparison operators (indeed, all arithmetic operators) first perform integer promotion on their operands, which will convert the char to an int. Integer promotion of a trap representation is undefined behaviour, as far as I know, so on the hypothetical C implementation running on the hypothetical machine, you would need to use unsigned char in order to implement strcpy.

Is it safe to use char * or do I need unsigned char *?
Perhaps
"String handling" functions such as memcpy() have the specification:
For all functions in this subclause, each character shall be interpreted as if it had the type unsigned char (and therefore every possible object representation is valid and has a different value). C11dr §7.23.1 3
Using unsigned char is the specified "as if" type. Little to be gained attempting others - which may or may not work.
Using char with memcpy() may work, but extending that paradigm to other like functions leads to problems.
A single big reason to avoid char for str...() and mem...() like functions is that sometimes it makes a functional difference unexpectedly.
memcmp(), strcmp() certainly differ with (signed) char vs. unsigned char.
Pedantic: On relic non-2's complement with signed char, only '\0' should end a string. Yet negative_zero == 0 too and a char with negative_zero should not indicate the end of a string.

You do not need unsigned.
Like so:
volatile void *memcpy_v(volatile void *dest, const volatile void *src, size_t n)
{
const volatile char *src_c = (const volatile char *)src;
volatile char *dest_c = (volatile char *)dest;
for (size_t i = 0; i < n; i++) {
dest_c[i] = src_c[i];
}
return dest;
}
Attemping to make a confirming implementation where char has a trap value will eventually lead to a contradiction:
fopen("", "rb") does not require use of only fread() and fwrite()
fgets() takes a char * as its first argument and can be used on binary files.
strlen() finds the distance to the next null from a given char *. Since fgets() is guaranteed to have written one, it will not read past the end of the array and therefore will not trap

The unsigned is not needed, but there is no reason to use plain char for this function. Plain char should only be used for actual character strings. For other uses, the types unsigned char or uint8_t and int8_t are more precise as the signedness is explicitly specified.
If you want to simplify the function code, you can remove the casts:
volatile void *memcpy_v(volatile void *dest, const volatile void *src, size_t n) {
const volatile unsigned char *src_c = src;
volatile unsigned char *dest_c = dest;
for (size_t i = 0; i < n; i++) {
dest_c[i] = src_c[i];
}
return dest;
}

Related

Assignment: create my own memcpy. Why cast the destination and source pointers to unsigned char* instead of char*? [duplicate]

This question already has answers here:
Implement `memcpy()`: Is `unsigned char *` needed, or just `char *`?
(4 answers)
Closed 10 days ago.
I'm trying to create my own versions of C functions and when I got to memcpy and memset I assumed that I should cast the destination and sources pointers to char *. However, I've seen many examples where the pointers were casted to unsigned char * instead. Why is that?
void *mem_cpy(void *dest, const void *src, size_t n) {
if (dest == NULL || src == NULL)
return NULL;
int i = 0;
char *dest_arr = (char *)dest;
char *src_arr = (char *)src;
while (i < n) {
dest_arr[i] = src_arr[i];
i++;
}
return dest;
}
It doesn't matter for this case, but a lot of folks working with raw bytes will prefer to explicitly specify unsigned char (or with stdint.h types, uint8_t) to avoid weirdness if they have to do math with the bytes. char has implementation-defined signedness, and that means, when the integer promotions & usual arithmetic conversions are applied, a char with the high bit set is treated as a negative number if signed, and a positive number if unsigned.
While neither behavior is necessarily wrong for a given problem, the fact that the behavior can change between compilers or even with different flags set on the same compiler, means you often need to be explicit about signedness, using either signed char or unsigned char as appropriate, and 99% of the time, the behaviors of unsigned char are what you want, so people tend to default to it even when it's not strictly required.
There's no particular reason in this specific case, it's mostly stylistic.
But in general it is always best to stick to unsigned arithmetic when dealing with raw data. That is: unsigned char or uint8_t.
The char type is problematic because it has implementation-defined signedness and is therefore avoided in such code. Is char signed or unsigned by default?
NOTE: this is dangerous and poor style:
char *src_arr = (char *)src;
(And the cast hid the problem underneath the carpet)
Since you correctly used "const correctness" for src, the correct type is: const char *src_arr; I'd change to code to:
unsigned char *dest_arr = dest;
const unsigned char *src_arr = src;
A good rule of thumb for beginners is to never use a cast. I'm serious. Some 90% of all casts we see on SO in beginner-level programs are wrong, in one way or the other.
Btw (advanced topic) there's a reason why memcpy has the prototype as:
void *memcpy(void * restrict s1,
const void * restrict s2,
size_t n);
The restrict qualifier on the pointers tell the user of the function "hey I'm counting on you to not pass on two pointers to the same object or pointers that may overlap". Doing so would cause problems in various situations and for various targets, so this is a good idea.
It's much more likely that the user passes on overlapping pointers than null pointers, so if you are to have slow, superfluous error checking against NULL, you should also restrict qualify the pointers.
If the user passes on null pointers I'd just let the function crash, instead of slowing it down with extra branches that are pointless bloat in some 99% of all use cases.
Why ... unsigned char* instead of char*?
Short answer: Because the functionality differs in select operations when char is signed and the C spec specifies unsigned char like functionality for str...() and mem...().
When does it make a difference?
When a function (like memcmp(), strcmp(), etc.) compares for order, one byte is negative and the other is positive, the order of the two bytes differ. Example: -1 < 1, yet when viewed as an unsigned char: 255 > 1.
When does it not make a difference?
When copying data and comparing for equality*1.
Non-2's compliment
*1 One's compliment and sign-magnitude encoding are expected to be dropped in the upcoming version C2x. Until then, those signed encodings support 2 zeroes. For str...() and mem...() functions, C specifies data access as unsigned char. This means only the +0 is a null character and order depends on pure binary, unsigned, encoding.

What is the best way to represent characters in C?

I know that a char is allowed to be signed or unsigned depending on the implementation. This doesn't really bother me if all I want to do is manipulate bytes. (In fact, I don't think of the char datatype as a character, but a byte).
But, if I understand, string literals are signed chars (actually they're not, but see the update below), and the function fgetc() returns unsigned chars casted into int. So if I want to manipulate characters, is it preferred style to use signed, unsigned, or ambiguous characters? Why does reading characters from a file have a different convention than literals?
I ask because I have some code in c that does string comparison between string literals and the contents of files, but having a signed char * vs unsigned char * might really make my code error prone.
Update 1
Ok as a few people pointed out (in answers and comments) string literals are in fact char arrays, not signed char arrays. That means I really should use char * for string literals, and not think about whether they are signed or unsigned. This makes me perfectly happy (until I have to start making conversion/comparisons with unsigned chars).
However the important question remains, how do I read characters from a file, and compare them to a string literal. The crux of which is the conversion from the int read using fgetc(), which explicitly reads an unsigned char from the file, to the char type, which is allowed to be either signed or unsigned.
Allow me to provide a more detailed example.
int main(void)
{
FILE *someFile = fopen("ThePathToSomeRealFile.html", "r");
assert(someFile);
char substringFromFile[25];
memset((void*)substringFromFile,0,sizeof(substringFromFile));
//Alright, the real example is to read the first few characters from the file
//And then compare them to the string I expect
const char *expectedString = "<!DOCTYPE";
for( int counter = 0; counter < sizeof(expectedString)/sizeof(*expectedString); ++counter )
{
//Read it as an integer, because the function returns an `int`
const int oneCharacter = fgetc(someFile);
if( ferror(someFile) )
return EXIT_FAILURE;
if( int == EOF || feof(someFile) )
break;
assert(counter < sizeof(substringFromFile)/sizeof(*substringFromFile));
//HERE IS THE PROBLEM:
//I know the data contained in oneCharacter must be an unsigned char
//Therefore, this is valid
const unsigned char uChar = (const unsigned char)oneCharacter;
//But then how do I assign it to the char?
substringFromFile[counter] = (char)oneCharacter;
}
//and ultimately here's my goal
int headerIsCorrect = strncmp(substringFromFile, expectedString, 9);
if(headerIsCorrect != 0)
return EXIT_SUCCESS;
//else
return EXIT_FAILURE;
}
Essentially, I know my fgetc() function is returning something that (after some error checking) is code-able as an unsigned char. I know that char may or may not be an unsigned char. That means, depending on the implementation of the c standard, doing a cast to char will involve no reinterpretation. However, in the case that the system is implemented with a signed char, I have to worry about values that can be coded by an unsigned char that aren't code-able by char (i.e. those values between (INT8_MAX UINT8_MAX]).
tl;dr
The question is this, should I (1) copy their underlying data read by fgetc() (by casting pointers - don't worry, I know how to do that), or (2) cast down from unsigned char to char (which is only safe if I know that the values can't exceed INT8_MAX, or those values can be ignored for whatever reason)?
The historical reasons are (as I've been told, I don't have a reference) that the char type was poorly specified from the beginning.
Some implementations used "consistent integer types" where char, short, int and so on were all signed by default. This makes sense because it makes the types consistent with each other.
Other implementations used unsigned for character, since there never existed any symbol tables with negative indices (that would be stupid) and since they saw a need for more than 128 characters (a very valid concern).
By the time C got standardized properly, it was too late to change this, too many different compilers and programs written for them were already out on the market. So the signedness of char was made implementation-defined, for backwards compatibility reasons.
The signedness of char does not matter if you only use it to store characters/strings. It only matters when you decide to involve the char type in arithmetic expressions or use it to store integer values - this is a very bad idea.
For characters/string, always use char (or wchar_t).
For any other form of 1 byte large data, always use uint8_t or int8_t.
But, if I understand, string literals are signed char
No, string literals are char arrays.
the function fgetc() returns unsigned chars casted into int
No, it returns a char converted to an int. It is int because the return type may contain EOF, which is an integer constant and not a character constant.
having a signed char * vs unsigned char * might really make my code error prone.
No, not really. Formally, this rule from the standard applies:
A pointer to an object type may be converted to a pointer to a different object type. If the
resulting pointer is not correctly aligned for the referenced type, the behavior is undefined. Otherwise, when converted back again, the result shall compare equal to the original pointer.
There exists no case where casting from pointer to signed char to pointer to unsigned char or vice versa, would cause any alignment issues or other issues.
I know that a char is allowed to be signed or unsigned depending on the implementation. This doesn't really bother me if all I want to do is manipulate bytes.
If you're going to do comparison or assign char to other integer types, it should bother you.
But, if I understand, string literals are signed chars
They are of type char[], so if char === unsigned char, all string literals are unsigned char[].
the function fgetc() returns unsigned chars casted into int.
That's correct and is required to omit undesired sign extension.
So if I want to manipulate characters, is it preferred style to use signed, unsigned, or ambiguous characters?
For portability I'd advise to follow practice adapted by various libc implementations: use char, but before processing cast to unsigned char (char* to unsigned char*). This way implicit integer promotions won't turn characters in the range 0x80 -- 0xff into negative numbers of wider types.
In short: (signed char)a < (signed char)b is NOT always equivalent to (unsigned char)a < (unsigned char)b. Here is an example.
Why does reading characters from a file have a different convention than literals?
getc() needs a way to return EOF such that it couldn't be confused with any real char.

Is the strict aliasing rule really a "two-way street"?

In these comments user #Deduplicator insists that the strict aliasing rule permits access through an incompatible type if either of the aliased or the aliasing pointer is a pointer-to-character type (qualified or unqualified, signed or unsigned char *). So, his assertion is basically that both
long long foo;
char *p = (char *)&foo;
*p; // just in order to dereference 'p'
and
char foo[sizeof(long long)];
long long *p = (long long *)&foo[0];
*p; // just in order to dereference 'p'
are conforming and have defined behavior.
In my read, however, it is only the first form that is valid, that is, when the aliasing pointer is a pointer-to-char; however, one can't do that in the other direction, i. e. when the aliasing pointer points to an incompatible type (other than a character type), the aliased pointer being a char *.
So, the second snippet above would have undefined behavior.
What's the case? Is this correct? For the record, I have already read this question and answer, and there the accepted answer explicitly states that
The rules allow an exception for char *. It's always assumed that char * aliases other types. However this won't work the other way, there's no assumption that your struct aliases a buffer of chars.
(emphasis mine)
You are correct to say that this is not valid. As you yourself have quoted (so I shall not re-quote here) the guaranteed valid cast is only from any other type to char*.
The other form is indeed against standard and causes undefined behaviour. However as a little bonus let us discuss a little behind this standard.
Chars, on every significant architecture is the only type that allows completely unaligned access, this is due to the read byte instructions having to work on any byte, otherwise they would be all but useless. This means that an indirect read to a char will always be valid on every CPU I know of.
However the other way around this will not apply, you cannot read a uint64_t unless the pointer is aligned to 8 bytes on most arches.
However, there is a very common compiler extension allowing you to cast properly aligned pointers from char to other types and access them, however this is non-standard. Also note, if you cast a pointer to any type to a pointer to char and then cast it back the resultant pointer is guaranteed to be equal to the original object. Therefore this is ok:
struct x *mystruct = MakeAMyStruct();
char * foo = (char *)mystruct;
struct x *mystruct2 = (struct mystruct *)foo;
And mystruct2 will equal mystruct. This also guarantees the struct is properly aligned for it's needs.
So basically, if you want a pointer to char and a pointer to another type, always declare the pointer to the other type then cast to char. Or even better use a union, that is what they are basically for...
Note, there is a notable exception to the rule however. Some old implementations of malloc used to return a char*. This pointer is always guaranteed to be castable to any type successfully without breaking aliasing rules.
Deduplicator is correct. The undefined behaviour that allows compilers to implement "strict aliasing" optimizations doesn't apply when character values are being used to produce a representation of an object.
Certain object representations need not represent a value of the object type. If the
stored value of an object has such a representation and is read by an lvalue expression
that does not have character type, the behavior is undefined. If such a representation
is produced by a side effect that modifies all or any part of the object by an lvalue
expression that does not have character type, the behavior is undefined. Such a
representation is called a trap representation.
However your second example has undefined behaviour because foo is uninitialized. If you initialize foo then it only has implementation defined behaviour. It depends on the implementation defined alignment requirements of long long and whether long long has any implementation defined pad bits.
Consider if you change your second example to this:
long long bar() {
char *foo = malloc(sizeof(long long));
char c;
for(c = 0; c < sizeof(long long); c++)
foo[c] = c;
long long *p = (long long *) p;
return *p;
}
Now alignment is no longer issue and this example is only dependent of the implementation defined representation of long long. What value is returned depends on the representation of long long but if that representation is defined as having no pad bits them this function must always return the same value and it must also always be a valid value. Without pad bits this function can't generate a trap representation, and so the compiler cannot perform any strict aliasing type optimizations on it.
You have to look pretty hard to find a standard conforming implementation of C that has implementation defined pad bits in any of its integer types. I doubt you'll find one that implements any sort of strict aliasing type of optimization. In other words, compilers don't use the undefined behaviour caused by accessing a trap representation to allow strict-aliasing optimizations because no compiler that implements strict-aliasing optimizations has defined any trap representations.
Note also that had buf been initialized with all zeros ('\0' characters) then this function wouldn't have any undefined or implementation defined behaviour. An all-bits-zero representation of a integer type is guaranteed not to be a trap representation and guaranteed to have the value 0.
Now for a strictly conforming example that uses char values to create a guaranteed valid (possibly non-zero) representation of a long long value:
#include <stdio.h>
#include <stdlib.h>
int
main(int argc, char **argv) {
int i;
long long l;
char *buf;
if (argc < 2) {
return 1;
}
buf = malloc(sizeof l);
if (buf == NULL) {
return 1;
}
l = strtoll(argv[1], NULL, 10);
for (i = 0; i < sizeof l; i++) {
buf[i] = ((char *) &l)[i];
}
printf("%lld\n", *(long long *)buf);
return 0;
}
This example has no undefined behaviour and is not dependent on the alignment or representation of long long. This is the sort of code that the character type exception on accessing objects was created for. In particular this means that Standard C lets you implement your own memcpy function in portable C code.

Using 'char' variables in bit operations

I use XLookupString that map a key event to ASCII string, keysym, and ComposeStatus.
int XLookupString(event_structure, buffer_return, bytes_buffer, keysym_return, status_in_out)
XKeyEvent *event_structure;
char *buffer_return; /* Returns the resulting string (not NULL-terminated). Returned value of the function is the length of the string. */
int bytes_buffer;
KeySym *keysym_return;
XComposeStatus *status_in_out;
Here is my code:
char mykey_string;
int arg = 0;
------------------------------------------------------------
case KeyPress:
XLookupString( &event.xkey, &mykey_string, 1, 0, 0 );
arg |= mykey_string;
But using 'char' variables in bit operations, sign extension can generate unexpected results.
I is possible to prevent this?
Thanks
char can be either signed or unsigned so if you need unsigned char you should specify it explicitly, it makes it clear to those reading you code your intention as opposed to relying on compiler settings.
The relevant portion of the c99 draft standard is from 6.2.5 Types paragraph 15:
The three types char, signed char, and unsigned char are collectively called
the character types. The implementation shall define char to have the same range,
representation, and behavior as either signed char or unsigned char

cast content of array into arithmetic type in C

I'm facing weird behavior with casting (or more even dereferencing) single items from array into a single arithmetic type.
Heres a reduced test case:
void test1()
{
unsigned char test[10] = {0};
unsigned long i=0xffffffff;
*((unsigned long *)(&test[3])) = i;
int it;
for ( it = 0 ; it < 10 ; it++ )
{
printf("%02x ", test[it]);
}
}
void test2()
{
unsigned char test[10] = {0};
unsigned char test2[10] = {0};
test[2]=0xFF;
test[3]=0xFF;
*((unsigned short *)(&test2[1])) = *((unsigned short *)(&test[2]));
int it;
for ( it = 0 ; it < 10 ; it++ )
{
printf("%02x ", test2[it]);
}
}
In detail it is mainly this expression:
*((unsigned short *)(&test2[1]))
I'm getting access violations on some other platforms (mainly embedded platforms like PIC24).
So my question is: is this C conformant? I can't find anything within C-standard but maybe I'm only blind.
Do you know any alternatives doing this operation without such cast (looping byte-to-byte copy/unrolling etc. is not meant!) and where I don't need to know the current byte order of the platform?
Thanks!
*((unsigned short *)(&test2[1]))
This is undefined behavior, you are violating alignment and aliasing rules. Don't do it.
Your test2 object is an array of unsigned char and through the cast you are accessing its elements as unsigned short objects. There is no guarantee that unsigned char alignment requirement is the same as unsigned short alignment requirement.
In the C standard you can find information on alignment in 6.3.2.3p7 (C99) and on aliasing rules in 6.5p7.
A good rule of thumb is to always be very wary in presence of casts in the left side of the = operator.
The line *((unsigned long *)(&test[3])) = i; has undefined behavior. it depends on sizeof(long) and the endianness of your machine.
In general, you should not cast between different pointer types (except of void*).
The problem here is almost certainly that you're doing unaligned access. If chars are 1 byte and shorts are 2 (which is likely), then you're doing a write-short operation on an odd number. This is not always supported and is why you're most likely getting an access violation. If you really want to do this (which you probably don't), you could pad the char array by making it one char longer at the front and then just not use that first char (treat the array as 1-indexed rather than 0-indexed) and that would probably work on the platforms where this doesn't, but even that's not guaranteed.

Resources