Using 'char' variables in bit operations - c

I use XLookupString that map a key event to ASCII string, keysym, and ComposeStatus.
int XLookupString(event_structure, buffer_return, bytes_buffer, keysym_return, status_in_out)
XKeyEvent *event_structure;
char *buffer_return; /* Returns the resulting string (not NULL-terminated). Returned value of the function is the length of the string. */
int bytes_buffer;
KeySym *keysym_return;
XComposeStatus *status_in_out;
Here is my code:
char mykey_string;
int arg = 0;
------------------------------------------------------------
case KeyPress:
XLookupString( &event.xkey, &mykey_string, 1, 0, 0 );
arg |= mykey_string;
But using 'char' variables in bit operations, sign extension can generate unexpected results.
I is possible to prevent this?
Thanks

char can be either signed or unsigned so if you need unsigned char you should specify it explicitly, it makes it clear to those reading you code your intention as opposed to relying on compiler settings.
The relevant portion of the c99 draft standard is from 6.2.5 Types paragraph 15:
The three types char, signed char, and unsigned char are collectively called
the character types. The implementation shall define char to have the same range,
representation, and behavior as either signed char or unsigned char

Related

Comparison between 'int' and 'char'?

The result of the below code is "No" but I am not sure 'Why' ? Searched in google and found some information but I am confused... Can someone explain me ? Thanks !
int i = 23;
char c = -23;
if (i < c)
{
printf("Yes");
}
else
{
printf("No");
}
Unfortunately (imho) char is considered an integer type and you can treat it as an integer type without any explicit cast.
char is a different type than signed char and unsigned char. Whether char is unsigned or signed is implementation defined.
When used in arithmetic operations (including comparisons) integer types with rank less or equal to rank of int undergo integer promotion so your code is equivalent with:
if ((int)i < (int)c)
Another use for char is to access raw memory. C doesn't have a byte type and char is ... the byte type.
Just because you can doesn't mean you should. Use char for ... well ... characters. For memory access a lot of API uses char*, but if you can you should use unsigned char*. For small integers if you really need to save the space use int8_t and uint8_t which should be aliases for signed char and unsigned char respectively.

Implement `memcpy()`: Is `unsigned char *` needed, or just `char *`?

I was implementing a version of memcpy() to be able to use it with volatile.
Is it safe to use char * or do I need unsigned char *?
volatile void *memcpy_v(volatile void *dest, const volatile void *src, size_t n)
{
const volatile char *src_c = (const volatile char *)src;
volatile char *dest_c = (volatile char *)dest;
for (size_t i = 0; i < n; i++) {
dest_c[i] = src_c[i];
}
return dest;
}
I think unsigned should be necessary to avoid overflow problems if the data in any cell of the buffer is > INT8_MAX, which I think might be UB.
In theory, your code might run on a machine which forbids one bit pattern in a signed char. It might use ones' complement or sign-magnitude representations of negative integers, in which one bit pattern would be interpreted as a 0 with a negative sign. Even on two's-complement architectures, the standard allows the implementation to restrict the range of negative integers so that INT_MIN == -INT_MAX, although I don't know of any actual machine which does that.
So, according to §6.2.6.2p2, there may be one signed character value which an implementation might treat as a trap representation:
Which of these [representations of negative integers] applies is implementation-defined, as is whether the value with sign bit 1 and all value bits zero (for the first two [sign-magnitude and two's complement]), or with sign bit and all value bits 1 (for ones' complement), is a trap representation or a normal value. In the case of sign and magnitude and ones’ complement, if this representation is a normal value it is called a negative zero.
(There cannot be any other trap values for character types, because §6.2.6.2 requires that signed char not have any padding bits, which is the only other way that a trap representation can be formed. For the same reason, no bit pattern is a trap representation for unsigned char.)
So, if this hypothetical machine has a C implementation in which char is signed, then it is possible that copying an arbitrary byte through a char will involve copying a trap representation.
For signed integer types other than char (if it happens to be signed) and signed char, reading a value which is a trap representation is undefined behaviour. But §6.2.6.1/5 allows reading and writing these values for character types only:
Certain object representations need not represent a value of the object type. If the stored value of an object has such a representation and is read by an lvalue expression that does not have character type, the behavior is undefined. If such a representation is produced by a side effect that modifies all or any part of the object by an lvalue expression that does not have character type, the behavior is undefined. Such a representation is called a trap representation. (Emphasis added)
(The third sentence is a bit clunky, but to simplify: storing a value into memory is a "side effect that modifies all of the object", so it's permitted as well.)
In short, thanks to that exception, you can use char in an implementation of memcpy without worrying about undefined behaviour.
However, the same is not true of strcpy. strcpy must check for the trailing NUL byte which terminates a string, which means it needs to compare the value it reads from memory with 0. And the comparison operators (indeed, all arithmetic operators) first perform integer promotion on their operands, which will convert the char to an int. Integer promotion of a trap representation is undefined behaviour, as far as I know, so on the hypothetical C implementation running on the hypothetical machine, you would need to use unsigned char in order to implement strcpy.
Is it safe to use char * or do I need unsigned char *?
Perhaps
"String handling" functions such as memcpy() have the specification:
For all functions in this subclause, each character shall be interpreted as if it had the type unsigned char (and therefore every possible object representation is valid and has a different value). C11dr §7.23.1 3
Using unsigned char is the specified "as if" type. Little to be gained attempting others - which may or may not work.
Using char with memcpy() may work, but extending that paradigm to other like functions leads to problems.
A single big reason to avoid char for str...() and mem...() like functions is that sometimes it makes a functional difference unexpectedly.
memcmp(), strcmp() certainly differ with (signed) char vs. unsigned char.
Pedantic: On relic non-2's complement with signed char, only '\0' should end a string. Yet negative_zero == 0 too and a char with negative_zero should not indicate the end of a string.
You do not need unsigned.
Like so:
volatile void *memcpy_v(volatile void *dest, const volatile void *src, size_t n)
{
const volatile char *src_c = (const volatile char *)src;
volatile char *dest_c = (volatile char *)dest;
for (size_t i = 0; i < n; i++) {
dest_c[i] = src_c[i];
}
return dest;
}
Attemping to make a confirming implementation where char has a trap value will eventually lead to a contradiction:
fopen("", "rb") does not require use of only fread() and fwrite()
fgets() takes a char * as its first argument and can be used on binary files.
strlen() finds the distance to the next null from a given char *. Since fgets() is guaranteed to have written one, it will not read past the end of the array and therefore will not trap
The unsigned is not needed, but there is no reason to use plain char for this function. Plain char should only be used for actual character strings. For other uses, the types unsigned char or uint8_t and int8_t are more precise as the signedness is explicitly specified.
If you want to simplify the function code, you can remove the casts:
volatile void *memcpy_v(volatile void *dest, const volatile void *src, size_t n) {
const volatile unsigned char *src_c = src;
volatile unsigned char *dest_c = dest;
for (size_t i = 0; i < n; i++) {
dest_c[i] = src_c[i];
}
return dest;
}

When to use the plain char type in C

In plain C, by the standard there are three distinct "character" types:
plain char which one's signedness is implementation defined.
signed char.
unsigned char.
Let's assume at least C99, where stdint.h is already present (so you have the int8_t and uint8_t types as recommendable alternatives with explicit width to signed and unsigned chars).
For now for me it seems like using the plain char type is only really useful (or necessary) if you need to interface functions of the standard library such as printf, and in all other scenarios, rather to be avoided. Using char could lead to undefined behavior when it is signed on the implementation, and for any reason you need to do any arithmetic on such data.
The problem of using an appropriate type is probably the most apparent when dealing for example with Unicode text (or any code page using values above 127 to represent characters), which otherwise could be handled as a plain C string. However the relevant string.h functions all accept char, and if such data is typed char, that imposes problems when trying to interpret it for example for a display routine capable to handle its encoding.
What is the most recommendable method in such a case? Are there any particular reasons beyond this where it could be recommendable to use char over stdint.h's appropriate fixed-width types?
The char type is for characters and strings. It is the type expected and returned by all the string handling functions. (*) You really should never have to do arithmetic on char, especially not the kind where signed-ness would make a difference.
unsigned char is the type to be used for raw data. For example memcpy() or fread() interpret their void * arguments as arrays of unsigned char. The standard guarantees that any type can be also represented as an array of unsigned char. Any other conversion might be "signalling", i.e. triggering exceptions. (ISO/IEC 9899:2011, section 6.2.6 "Representation of Types"). (**)
signed char is when you need a signed integer of char size (for arithmetics).
(*): The character handling functions in <ctype.h> are a bit oddball about this, as they cater for EOF (negative), and hence "force" the character values into the unsigned char range (ISO/IEC 9899:2011, section 7.4 Character handling). But since it is guaranteed that a char can be cast to unsigned char and back without loss of information as per section 6.2.6... you get the idea.
When signed-ness of char would make a difference -- the comparison functions like in strcmp() -- the standard dictates that char is interpreted as unsigned char (ISO/IEC 9899:2011, section 7.24.4 Comparison functions).
(**): Practically, it is hard to see how a conversion of raw data to char and back could be signalling where the same done with unsigned char would not be signalling. But unsigned char is what the section of the standard says. ;-)
Use char to store characters (standard defines the behaviour for basic execution character set elements only, roughly ASCII 7-bit characters).
Use signed char or unsigned char to get the corresponding arithmetic (signed or unsigned arithmetic have different properties for integers - char is an integer type).
This doesn't means that you can't make arithmetic with raw chars, as stated:
6.2.5 Types - 3. An object declared as type char is large enough to store any member of
the basic execution character set. If a member of the basic execution
character set is stored in a char object, its value is guaranteed to
be nonnegative.
Then if you only use character set elements arithmetic on them is correctly defined.

What is difference between signed char and char in C language? [duplicate]

This question already has answers here:
What is an unsigned char?
(16 answers)
Closed 5 years ago.
I have seen in my legacy embedded code that people are using signed char as return type. What's the need to put signed there? Isn't that implicit that char is nothing but signed char.
char, signed char, and unsigned char are all distinct types.
Your implementation can set char to be either signed or unsigned.
unsigned char has a distinct range set by the standard. The problem with using simply char is that your code can behave differently on different platforms. It could even be a 1's complement or signed magnitude type with a range -127 to +127.
Because no-one in the candidate duplicate answer cited the right paragraph from the spec:
6.2.5 [Types], paragraph 15
The three types char, signed char, and unsigned char are collectively called the character types. The implementation shall define char to have the same range, representation, and behavior as either signed char or unsigned char
So char could be either. (FWIW, I believe I ran into char = unsigned char in JNI.)
From CppReference:
Character types
signed char - type for signed character representation.
unsigned char - type for unsigned character representation. Also used to inspect object representations (raw memory).
char - type for character representation. Equivalent to either signed char or unsigned char (which one is implementation-defined and may be controlled by a compiler commandline switch), but char is a distinct type, different from both signed char and unsigned char.
So if you want to ensure that you're using either an unsigned char or a signed char, specify it explicitly. Otherwise there'd be no guarantee whether it'll be signed or not.

What is the best way to represent characters in C?

I know that a char is allowed to be signed or unsigned depending on the implementation. This doesn't really bother me if all I want to do is manipulate bytes. (In fact, I don't think of the char datatype as a character, but a byte).
But, if I understand, string literals are signed chars (actually they're not, but see the update below), and the function fgetc() returns unsigned chars casted into int. So if I want to manipulate characters, is it preferred style to use signed, unsigned, or ambiguous characters? Why does reading characters from a file have a different convention than literals?
I ask because I have some code in c that does string comparison between string literals and the contents of files, but having a signed char * vs unsigned char * might really make my code error prone.
Update 1
Ok as a few people pointed out (in answers and comments) string literals are in fact char arrays, not signed char arrays. That means I really should use char * for string literals, and not think about whether they are signed or unsigned. This makes me perfectly happy (until I have to start making conversion/comparisons with unsigned chars).
However the important question remains, how do I read characters from a file, and compare them to a string literal. The crux of which is the conversion from the int read using fgetc(), which explicitly reads an unsigned char from the file, to the char type, which is allowed to be either signed or unsigned.
Allow me to provide a more detailed example.
int main(void)
{
FILE *someFile = fopen("ThePathToSomeRealFile.html", "r");
assert(someFile);
char substringFromFile[25];
memset((void*)substringFromFile,0,sizeof(substringFromFile));
//Alright, the real example is to read the first few characters from the file
//And then compare them to the string I expect
const char *expectedString = "<!DOCTYPE";
for( int counter = 0; counter < sizeof(expectedString)/sizeof(*expectedString); ++counter )
{
//Read it as an integer, because the function returns an `int`
const int oneCharacter = fgetc(someFile);
if( ferror(someFile) )
return EXIT_FAILURE;
if( int == EOF || feof(someFile) )
break;
assert(counter < sizeof(substringFromFile)/sizeof(*substringFromFile));
//HERE IS THE PROBLEM:
//I know the data contained in oneCharacter must be an unsigned char
//Therefore, this is valid
const unsigned char uChar = (const unsigned char)oneCharacter;
//But then how do I assign it to the char?
substringFromFile[counter] = (char)oneCharacter;
}
//and ultimately here's my goal
int headerIsCorrect = strncmp(substringFromFile, expectedString, 9);
if(headerIsCorrect != 0)
return EXIT_SUCCESS;
//else
return EXIT_FAILURE;
}
Essentially, I know my fgetc() function is returning something that (after some error checking) is code-able as an unsigned char. I know that char may or may not be an unsigned char. That means, depending on the implementation of the c standard, doing a cast to char will involve no reinterpretation. However, in the case that the system is implemented with a signed char, I have to worry about values that can be coded by an unsigned char that aren't code-able by char (i.e. those values between (INT8_MAX UINT8_MAX]).
tl;dr
The question is this, should I (1) copy their underlying data read by fgetc() (by casting pointers - don't worry, I know how to do that), or (2) cast down from unsigned char to char (which is only safe if I know that the values can't exceed INT8_MAX, or those values can be ignored for whatever reason)?
The historical reasons are (as I've been told, I don't have a reference) that the char type was poorly specified from the beginning.
Some implementations used "consistent integer types" where char, short, int and so on were all signed by default. This makes sense because it makes the types consistent with each other.
Other implementations used unsigned for character, since there never existed any symbol tables with negative indices (that would be stupid) and since they saw a need for more than 128 characters (a very valid concern).
By the time C got standardized properly, it was too late to change this, too many different compilers and programs written for them were already out on the market. So the signedness of char was made implementation-defined, for backwards compatibility reasons.
The signedness of char does not matter if you only use it to store characters/strings. It only matters when you decide to involve the char type in arithmetic expressions or use it to store integer values - this is a very bad idea.
For characters/string, always use char (or wchar_t).
For any other form of 1 byte large data, always use uint8_t or int8_t.
But, if I understand, string literals are signed char
No, string literals are char arrays.
the function fgetc() returns unsigned chars casted into int
No, it returns a char converted to an int. It is int because the return type may contain EOF, which is an integer constant and not a character constant.
having a signed char * vs unsigned char * might really make my code error prone.
No, not really. Formally, this rule from the standard applies:
A pointer to an object type may be converted to a pointer to a different object type. If the
resulting pointer is not correctly aligned for the referenced type, the behavior is undefined. Otherwise, when converted back again, the result shall compare equal to the original pointer.
There exists no case where casting from pointer to signed char to pointer to unsigned char or vice versa, would cause any alignment issues or other issues.
I know that a char is allowed to be signed or unsigned depending on the implementation. This doesn't really bother me if all I want to do is manipulate bytes.
If you're going to do comparison or assign char to other integer types, it should bother you.
But, if I understand, string literals are signed chars
They are of type char[], so if char === unsigned char, all string literals are unsigned char[].
the function fgetc() returns unsigned chars casted into int.
That's correct and is required to omit undesired sign extension.
So if I want to manipulate characters, is it preferred style to use signed, unsigned, or ambiguous characters?
For portability I'd advise to follow practice adapted by various libc implementations: use char, but before processing cast to unsigned char (char* to unsigned char*). This way implicit integer promotions won't turn characters in the range 0x80 -- 0xff into negative numbers of wider types.
In short: (signed char)a < (signed char)b is NOT always equivalent to (unsigned char)a < (unsigned char)b. Here is an example.
Why does reading characters from a file have a different convention than literals?
getc() needs a way to return EOF such that it couldn't be confused with any real char.

Resources