Unsigned char plaguing my code - c

In the C library an implementation of memcpy might look like this:
#include <stddef.h> /* size_t */
void *memcpy(void *dest, const void *src, size_t n)
{
char *dp = dest;
const char *sp = src;
while (n--)
*dp++ = *sp++;
return dest;
}
Nice, clean and type agnostic right? But when following a kernel tutorial, the prototypes look like this:
unsigned char *memcpy(unsigned char *dest, const unsigned char *src, int count);
I've attempted to implement it like this:
{
unsigned char *dp = dest;
unsigned const char *sp = src;
while (count--)
*dp++ = *sp++;
return dest;
}
But I'm quite wary of seeing unsigned char everywhere and potentially nasty bugs resulting from casts.
Should I attempt to use uint8_t and other variants wherever possible instead of unsigned TYPE?
Sometimes I see unsigned char * instead of const char*. Should this be considered a bug?

What have you seen at C library implementation of memcpy that is to confirm C programming standard ie that is how C programming standard defined the memcpy interface. And what have you seen at kernel tutorial is totally depends on the tutorial writer, how he designed his kernel and code. AFAIU, to show how to write a kernel and how internal components needs to be glued and built, you can avoid thinking too much about whether to use uint8_t or unsigned char (these kind of decision needs to be made based on whether you really want your kernel to expand at certain level) they are same, but helps readability of your project.
And, regarding the second point - if you really confident that you should use const char * instead of unsigned char * then yes fix it. But, also concentrate on how kernel works, how memory/video/peripheral devices are setup initialized etc.

Technically, as far as C99 is concerned, (u)int8_t must be available if the underlying machine has a primitive 8-bit integer (with no padding)... and not otherwise. For C99 the basic unit of memory is the char, which technically may have any number of bits. So, from that perspective, unsigned char is more correct than uint8_t. POSIX, on the other hand, has decided that the 8-bit byte has achieved tablet-of-stone status, so CHARBIT == 8, forever, and the difference between (u)int8_t and (unsigned/signed) char is academic.
The ghastly legacy of the ambiguous signed-ness of char, we just get to live with.
I like to typedef unsigned char byte pretty early, and have done with.
I note that memcmp() is defined to compare on the basis of unsigned char. So I can see some logic in treating all memxxx() as taking unsigned char.
I'm not sure I would describe memcpy(void* dst, ...) as "type agnostic"... what memcpy is doing is moving chars, but by declaring the arguments in this way the programmer is auto-magically relieved of casting to char*. Anyone who passes an int* and a count of ints, expecting memcpy to do the usual pointer arithmetic magic, is destined for a short, sharp shock ! So, in this case, if you are required to cast to unsigned char*, then I would argue that it is clearer and safer. (Everybody learns very quickly what memxxx() do, so the extra clarity would generally be seen as a pain in the hindquarters. I agree that casts in general should be treated as "taking off the seat belt and pressing the pedal to the metal", but not so much in this case.)
Clearly unsigned char* and const char* are quite different to each other (how different would depend on the signed-ness of char), but whether the appearance of one or the other is a bug or not would depend on the context.

Related

Determining endianness of machine architecture in C the correct way

I just wrote the following function to determine the endianness of the machine architecture (writing for an ARM Cortex-M7 architecture based MCU though, but wanted functionality to make the code portable):
uint8_t is_little_endian()
{
static const union test {
uint32_t num;
uint8_t bytes[sizeof(uint32_t)];
} p = {.num = 1U };
return (p.bytes[0] == 1U);
}
I was just wanting to know if there will be any false results if I use unsigned int and char here instead of uint32_t and uint8_t in the above code? If yes, why?
To answer your immediate question, unsigned and char will work just as well if CHAR_BIT < 16. That's because the C standard requires an unsigned to have at least 16 value bits and every type must have a storage size that's a multiple of a char (a byte). So as long as your char has fewer than 16 bits, an unsigned must consist of at least 2 bytes and the endianness check will work this way.
Using char actually has the benefit that it's allowed to alias any other type. So I'd suggest something like this:
#include <limits.h>
#if CHAR_BIT > 15
#error exotic platform
#endif
int is_little_endian(void)
{
unsigned x = 1U;
unsigned char *r = (unsigned char *)&x;
return !!*r;
}
I used unsigned char here just to be sure.
Be aware this assumes there's no exotic byte order (like "middle-endian"). Also, I personally think such code is a waste of space in the program, if you really need endianness information, it's probably better to have your build system determine it for your target and just #define it (e.g. in a config.h file).
I was just wanting to know if there will be false results if I use
unsigned int and char here instead of uint32_t and uint8_t? If
yes, why?
Yes, it may.
Types mentioned(unsigned int and char) are implementation-defined. It may depend on compiler, machine, compiler options etc. If you look at the types declared in stdint.h. This is part of the standard library, so it is expected (though technically not guaranteed) to be available everywhere. Among the types declared here are int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, and uint64_t.
You can simply return ntohs(12345) != 12345.

Why libxml2 uses "BAD_CAST" everywhere in a C/C++ code?

It's said that because libxml2 uses unsigned char as storage to make encoding/decoding between char sets convenient ---- isn't it ugly to force users to write "BAD_CAST" everywhere in the code, when reading/creating xml node names, contents, text nodes, etc.
It there a way to avoid writing such "BAD_CAST" everywhere? The design is really so bad.
This design decision is unfortunate, but rooted in another unfortunate choice made 40 years ago: allowing char to be potentially signed by default, which is inconsistent with the behavior of getchar(), strcmp()...
You can use an inline function to convert from char * to unsigned char * and vice-versa with a single hidden cast, and use these when passing arguments:
static inline unsigned char *c2uc(char *s) { return (unsigned char*)s; }
static inline char *uc2c(unsigned char *s) { return (char*)s; }
These wrappers are much safer to use than basic casts as they can only be applied to one type and convert to its unsigned counterpart. Regular casts are problematic as they can be applied to any type and hide type conversion errors. static inline functions are expanded by the compiler at no runtime cost.

Explicit cast of pointer to long long

I need to cast a pointer to a long long, and would prefer to do it in a way that gcc doesn't complain on either 32 or 64-bit architectures about converting pointer to ints of different size. And before anyone asks, yes, I know what I'm doing, and I know what I'm casting to -- my specific use case is wanting to send a stack trace (the pointers themselves being the subject here) over the network when an application error occurs, so there is no guarantee the sender and receiver will have the same word size. I've therefore built a struct holding the message data with, among other entries, an array of "unsigned long long" values (guaranteed minimum 64-bits) to hold the pointers. And yes, I know "long long" is not guaranteed to be only 64-bits, but all compilers I'm using for both source and destination implement it as 64-bits. Because the header (and source) with the struct will be used on both architectures, "uintptr_t" doesn't seem like a workable solution (because, according to the definition in stdint.h, its size is architecture-dependent).
I thought about getting tricky with anonymous unions, but this feels a little too hackish to me...I'm hoping there's a way with some double-cast magic or something to do this in C99 (since anonymous unions weren't standard until C11).
EDIT:
typedef struct error_msg_t {
int msgid;
int len;
pid_t pid;
int si_code;
int signum;
int errno;
unsigned long long stack[20];
char err_msg[];
} error_msg_t;
...
void **stack;
...
msg.msgid = ERROR_MSG;
msg.len = sizeof(error_msg_t) + strlen(err_msg) + 1);
msg.pid = getpid();
...
for (i=0; i<stack_depth; i++)
msg.stack[i] = (unsigned long long)stack[i];
Warning (on a 32-bit compile) about casting to integer of different size occurs on the last line.
Probably your best bet is to double cast to spell it out to the compiler what you want to do (as suggested by Max).
I would recommend wrapping it up into a macro so that the code intention is clear from the macro name.
#define PTR_TO_UINT64(x) (uint64_t)(uintptr_t)(x)

Pass char* to method expecting unsigned char*

I am working on some embedded device which has SDK. It has a method like:
MessageBox(u8*, u8*); // u8 is typedefed unsigned char when I checked
But I have seen in their examples calling code like:
MessageBox("hi","hello");
passing char pointer without cast. Can this be well defined? I am asking because I ran some tool over the code, and it was complaining about above mismatch:
messageBox("Status", "Error calculating \rhash");
diy.c 89 Error 64: Type mismatch (arg. no. 1) (ptrs to signed/unsigned)
diy.c 89 Error 64: Type mismatch (arg. no. 2) (ptrs to signed/unsigned)
Sometimes I get different opinions on this answer and this confuses me even more. So to sum up, by using their API the way described above, is this problem? Will it crash the program?
And also it would be nice to hear what is the correct way then to pass string to SDK methods expecting unsigned char* without causing constraint violation?
It is a constraint violation, so technically it is not well defined, but in practice, it is not a problem. Yet you should cast these arguments to silence these warnings. An alternative to littering your code with ugly casts is to define an inline function:
static inline unsigned char *ucstr(const char *str) { return (unsigned char *)str; }
And use that function wherever you need to pass strings to the APIs that (mistakenly) take unsigned char * arguments:
messageBox(ucstr("hi"), ucstr("hello"));
This way you will not get warnings while keeping some type safety.
Also note that messageBox should take const char * arguments. This SDK uses questionable conventions.
The problem comes down to it being implementation-defined whether char is unsigned or signed.
Compilers for which there is no error will be those for which char is actually unsigned. Some of those (notably the ones that are actually C++ compilers, where char and unsigned char are distinct types) will issue a warning. With these compilers, converting the pointer to unsigned char * will be safe.
Compilers which report an error will be those for which char is actually signed. If the compiler (or host) uses an ASCII or similar character set, and the characters in the string are printable, then converting the string to unsigned char * (or, better, to const unsigned char * which avoids dropping constness from string literals) is technically safe. However, those conversions are potentially unsafe for implementations that use different character sets OR for strings that contain non-printable characters (e.g. values of type signed char that are negative, and values of unsigned char greater than 127). I say potentially unsafe, because what happens depends on what the called function does - for example does it check the values of individual characters? does it check the individual bits of individual characters in the string? The latter is, if the called function is well designed, one reason it will accept a pointer to unsigned char *.
What you need to do therefore comes down to what you can assume about the target machine, and its char and unsigned char types - and what the function is doing with its argument. The most general approach (in the sense that it works for all character sets, and regardless of whether char is signed or unsigned) is to create a helper function which copies the array of char to a different array of unsigned char. The working of that helper function will depend on how (and if) you need to handle the conversion of signed char values with values that are negative.

strange unsigned char casting

What is the purpose / advantage / difference of using
/* C89 compliant way to cast 'char' to 'unsigned char'. */
static inline unsigned char
to_uchar (char ch)
{
return ch;
}
versus a standard cast ?
Edit :
Found in a base64 code in gnulib
Maybe the programmer who wrote the function doesn't like the cast syntax ...
foo(to_uchar(ch)); /* function call */
foo((unsigned char)ch); /* cast */
But I'd let the compiler worry about it anyway :)
void foo(unsigned char);
char s[] = "bar";
foo(s[2]); /* compiler implicitly casts the `s[2]` char to unsigned char */
Purpose
To:
Cast from char to unsigned char
Do more with the cast than from a conventional cast
Open your mind to the possibilities of customized casting between other types and select from the advantages below for those also
Advantage
One could:
Break on these kinds of casts when debugging
Track and quantify the use of casts through profiling tools
Add limits checking code (pretty grim for char conversions but potentially very useful for larger/smaller type casts)
Have delusions of grandeur
There is a single point of casting, allowing you to carefully analyze and modify what code is generated
You could select from a range of casting techniques based on the environment (for example in C++ you could use numeric_limits<>)
The cast is explicit and will never generate warnings (or at least you can force choke them in one place)
Difference
Slower with poor compilers or good compilers with the necessary optimization flags turned off
The language doesn't force consistency, you might not notice you've forgotten to use the cast in places
Kind of strange, and Java-esque, one should probably accept and study C's weak typing and deal with it case by case rather than trying to conjure special functions to cope

Resources