strange unsigned char casting

strange unsigned char casting - c

What is the purpose / advantage / difference of using
/* C89 compliant way to cast 'char' to 'unsigned char'. */
static inline unsigned char
to_uchar (char ch)
{
return ch;
}
versus a standard cast ?
Edit :
Found in a base64 code in gnulib

Maybe the programmer who wrote the function doesn't like the cast syntax ...
foo(to_uchar(ch)); /* function call */
foo((unsigned char)ch); /* cast */
But I'd let the compiler worry about it anyway :)
void foo(unsigned char);
char s[] = "bar";
foo(s[2]); /* compiler implicitly casts the `s[2]` char to unsigned char */

Purpose
To:
Cast from char to unsigned char
Do more with the cast than from a conventional cast
Open your mind to the possibilities of customized casting between other types and select from the advantages below for those also
Advantage
One could:
Break on these kinds of casts when debugging
Track and quantify the use of casts through profiling tools
Add limits checking code (pretty grim for char conversions but potentially very useful for larger/smaller type casts)
Have delusions of grandeur
There is a single point of casting, allowing you to carefully analyze and modify what code is generated
You could select from a range of casting techniques based on the environment (for example in C++ you could use numeric_limits<>)
The cast is explicit and will never generate warnings (or at least you can force choke them in one place)
Difference
Slower with poor compilers or good compilers with the necessary optimization flags turned off
The language doesn't force consistency, you might not notice you've forgotten to use the cast in places
Kind of strange, and Java-esque, one should probably accept and study C's weak typing and deal with it case by case rather than trying to conjure special functions to cope

Related

Does C's void* serve any purpose other than making code more idiomatic?

Does C's void* lend some benefit in the form of compiler optimization etc, or is it just an idiomatic equivalent for char*? I.e. if every void* in the C standard library was replaced by char*, would anything be damaged aside from code legibility?

In the original K&R C there was no void * type, char * was used as the generic pointer.
void * serves two main purposes:
It makes it clear when a pointer is being used as a generic pointer. Previously, you couldn't tell whether char * was being used as a pointer to an actual character buffer or as a generic pointer.
It allows the compiler to catch errors where you try to dereference or perform arithmetic on a generic pointer, rather than casting it to the appropriate type first. These operations aren't permitted on void * values. Unfortunately, some compilers (e.g. gcc) allow arithmetic on void * pointers by default, so it's not as safe as it could be.

C got along just fine for years with char * as the generic pointer type before void * was introduced later. So it clearly isn't strictly necessary. But programming languages are communication--not just telling a compiler what to do, but telling other human beings reading the program what is intended. Anything that makes a language more expressive is a good thing.

Does C's void* serve any purpose other than making code more idiomatic?
if every void* in the C standard library was replaced by char*, would anything be damaged aside from code legibility?
C11 introduced _Genric. Because of that, _Generic(some_C_standard_library_function()) ... code can compile quite differently depending on if the return type was a void *, char *, int *, etc.

Why libxml2 uses "BAD_CAST" everywhere in a C/C++ code?

It's said that because libxml2 uses unsigned char as storage to make encoding/decoding between char sets convenient ---- isn't it ugly to force users to write "BAD_CAST" everywhere in the code, when reading/creating xml node names, contents, text nodes, etc.
It there a way to avoid writing such "BAD_CAST" everywhere? The design is really so bad.

This design decision is unfortunate, but rooted in another unfortunate choice made 40 years ago: allowing char to be potentially signed by default, which is inconsistent with the behavior of getchar(), strcmp()...
You can use an inline function to convert from char * to unsigned char * and vice-versa with a single hidden cast, and use these when passing arguments:
static inline unsigned char *c2uc(char *s) { return (unsigned char*)s; }
static inline char *uc2c(unsigned char *s) { return (char*)s; }
These wrappers are much safer to use than basic casts as they can only be applied to one type and convert to its unsigned counterpart. Regular casts are problematic as they can be applied to any type and hide type conversion errors. static inline functions are expanded by the compiler at no runtime cost.

Strict aliasing rule and strlen implementation of glibc

I have been reading about the strict aliasing rule for a while, and I'm starting to get really confused. First of all, I have read these questions and some answers:
strict-aliasing-rule-and-char-pointers
when-is-char-safe-for-strict-pointer-aliasing
is-the-strict-aliasing-rule-really-a-two-way-street
According to them (as far as I understand), accessing a char buffer using a pointer to another type violates the strict aliasing rule. However, the glibc implementation of strlen() has such code (with comments and the 64-bit implementation removed):
size_t strlen(const char *str)
{
const char *char_ptr;
const unsigned long int *longword_ptr;
unsigned long int longword, magic_bits, himagic, lomagic;
for (char_ptr = str; ((unsigned long int) char_ptr
& (sizeof (longword) - 1)) != 0; ++char_ptr)
if (*char_ptr == '\0')
return char_ptr - str;
longword_ptr = (unsigned long int *) char_ptr;
himagic = 0x80808080L;
lomagic = 0x01010101L;
for (;;)
{
longword = *longword_ptr++;
if (((longword - lomagic) & himagic) != 0)
{
const char *cp = (const char *) (longword_ptr - 1);
if (cp[0] == 0)
return cp - str;
if (cp[1] == 0)
return cp - str + 1;
if (cp[2] == 0)
return cp - str + 2;
if (cp[3] == 0)
return cp - str + 3;
}
}
}
The longword_ptr = (unsigned long int *) char_ptr; line obviously aliases an unsigned long int to char. I fail to understand what makes this possible. I see that the code takes care of alignment problems, so no issues there, but I think this is not related with the strict aliasing rule.
The accepted answer for the third linked question says:
However, there is a very common compiler extension allowing you to cast properly aligned pointers from char to other types and access them, however this is non-standard.
Only thing comes to my mind is the -fno-strict-aliasing option, is this the case? I could not find it documented anywhere what glibc implementors depend on, and the comments somehow imply that this cast is done without any concerns like it is obvious that there will be no problems. That makes me think that it is indeed obvious and I am missing something silly, but my search failed me.

In ISO C this code would violate the strict aliasing rule. (And also violate the rule that you cannot define a function with the same name as a standard library function). However this code is not subject to the rules of ISO C. The standard library doesn't even have to be implemented in a C-like language. The standard only specifies that the implementation implements the behaviour of the standard functions.
In this case, we could say that the implementation is in a C-like GNU dialect, and if the code is compiled with the writer's intended compiler and settings then it would implement the standard library function successfully.

When writing the aliasing rules, the authors of the Standard only considered forms that would be useful, and should thus be mandated, on all implementations. C implementations are targeted toward a variety of purposes, and the authors of the Standard make no attempt to specify what a compiler must do to be suitable for any particular purpose (e.g. low-level programming) or, for that matter, any purpose whatsoever.
Code like the above which relies upon low-level constructs should not be expected to run on compilers that make no claim of being suitable for low-level programming. On the flip side, any compiler which can't support such code should be viewed as unsuitable for low-level programming. Note that compilers can employ type-based aliasing assumptions and still be suitable for low-level programming if they make a reasonable effort to recognize common aliasing patterns. Some compiler writers are very highly invested in a view of code which fits neither common low-level coding patterns, nor the C Standard, but
anyone writing low-level code should simply recognize that those compilers'
optimizers are unsuitable for use with low-level code.

The wording of the standard is actually a bit more weird than the actual compiler implementations: The C standard talks about declared object types, but the compilers only ever see pointers to these objects. As such, when a compiler sees a cast from a char* to an unsigned long*, it has to assume that the char* is actually aliasing an object with a declared type of unsigned long, making the cast correct.
A word of caution: I assume that strlen() is compiled into a library that is later only linked to the rest of the application. As such, the optimizer does not see the use of the function when compiling it, forcing it to assume that the cast to unsigned long* is indeed legit. If you called strlen() with
short myString[] = {0x666f, 0x6f00, 0};
size_t length = strlen((char*)myString); //implementation now invokes undefined behavior!
the cast within strlen() is undefined behavior, and your compiler would be allowed to strip pretty much the entire body of strlen() if it saw your use while compiling strlen() itself. The only thing that allows strlen() to behave as expected in this call is the fact, that strlen() is compiled separately as a library, hiding the undefined behavior from the optimizer, so the optimizer has to assume the cast to be legit when compiling strlen().
So, assuming that the optimizer cannot call "undefined behavior", the reason why casts from char* to anything else are dangerous, is not aliasing, but alignment. On some hardware, weird stuff starts happening if you try to access a misaligned pointer. The hardware might load data from the wrong address, raise an interrupt, or just process the requested memory load extremely slowly. That is why the C standard generally declares such casts undefined behavior.
Nevertheless, you see that the code in question actually handles the alignment issue explicitly (the first loop that contains the (unsigned long int) char_ptr & (sizeof (longword) - 1) subcondition). After that, the char* is properly aligned to be reinterpreted as unsigned long*.
Of course, all of this is not really compliant with the C standard, but it is compliant with the C implementation of the compiler that this code is meant to be compiled with. If the gcc people modified their compiler to act up on this bit of code, the glibc people would just complain about it loud enough so that the gcc will be changed back to handle this kind of cast correctly.
At the end of the day, standard C library implementations simply must violate strict aliasing rules to work properly and be efficient. strlen() just needs to violate those rules to be efficient, the malloc()/free() function pair must be able to take a memory region that had a declared type of Foo, and turn it into a memory region of declared type Bar. And there is no malloc() call inside the malloc() implementation that would give the object a declared type in the first place. The abstraction of the C language simply breaks down at this level.

The underlying assumption is probably that the function is separately compiled, and not available for inlining or other cross function optimizations. This means that no compile time information flows inside or outside the function.
The function doesn't try to modify anything through a pointer, so there is no conflict.

Strict aliasing and flexible array member

I thought I knew C pretty well, but I'm confused by the following code:
typedef struct {
int type;
} cmd_t;
typedef struct {
int size;
char data[];
} pkt_t;
int func(pkt_t *b)
{
int *typep;
char *ptr;
/* #1: Generates warning */
typep = &((cmd_t*)(&(b->data[0])))->type;
/* #2: Doesn't generate warning */
ptr = &b->data[0];
typep = &((cmd_t*)ptr)->type;
return *typep;
}
When I compile with GCC, I get the "dereferencing type-punned pointer will break strict-aliasing rules" warning.
Why am I getting this warning at all? I'm dereferencing at char array. Casting a char * to anything is legal. Is this one of those cases where an array is not exactly the same as a pointer?
Why aren't both assignments generating the warning? The 2nd assignment is the equivalent of the first, isn't it?

When strict aliasing is turned on, the compiler is allowed to assume that two pointers of different type (char* vs cmt_t* in this instance) will not point to the same memory location. This allows for a greater range of optimizations which you would otherwise not want to be applied if they do indeed point to the same memory location. Various examples/horror-stories can be found in this question.
This is why, under strict-aliasing, you have to be careful how you do type punning. I believe that the standard doesn't allow for any type-puning what-so-ever (don't quote me on that) but most compilers have exemption for unions (my google-fu is failing in turning up the relevant manual pages):
union float_to_int {
double d;
uint64_t i;
};
union float_to_int ftoi;
ftoi.d = 1.0;
... = ftoi.i;
Unfortunately, this doesn't quite work for your situation as you would have to memcpy the content of the array into the union which is less then ideal. A simpler approach would be to simply to turn off strict-aliasing via the -fno-strict-aliasing switch. This will ensure that your code is correct and it's unlikely to have a significant performance impact (do measure if performance matters).
As for why the warning doesn't show up when the line is broken up, I don't know. Chances are that the modifications to the source code manages to confuse the compiler's static analysis pass enough that it doesn't see the type-punning. Note that the static analysis pass responsible for detecting type-punning is unrelated and doesn't talk to the various optimization passes that assume strict-aliasing. You can think of any static analysis done by compilers (unless otherwise specified) as a best-effort type of thing. In other words, the absence of warning doesn't mean that there are no errors which means that simply breaking up the line doesn't magically make your type-punning safe.

Using size_t for specifying the precision of a string in C's printf

I have a structure to represent strings in memory looking like this:
typedef struct {
size_t l;
char *s;
} str_t;
I believe using size_t makes sense for specifying the length of a char string. I'd also like to print this string using printf("%.*s\n", str.l, str.s). However, the * precision expects an int argument, not size_t. I haven't been able to find anything relevant about this. Is there someway to use this structure correctly, without a cast to int in the printf() call?

printf("%.*s\n", (int)str.l, str.s)
// ^^^^^ use a type cast
Edit
OK, I didn't read the question properly. You don't want to use a type cast, but I think, in this case: tough.
Either that or simply use fwrite
fwrite(str.s, str.l, 1, stdout);
printf("\n");

You could do a macro
#define STR2(STR) (int const){ (STR).l }, (char const*const){ (STR).s }
and then use this as printf("%.*s\n", STR2(str)).
Beware that this evaluates STR twice, so be carefull with side effects, but you probably knew that already.
Edit:
I am using compound initializers such that these are implicit conversions. If things go wrong there are more chances that the compiler will warn you than with an explicit cast.
E.g if STR has a field .l that is a pointer and you'd only put a cast to int, all compilers would happily convert that pointer to int. Similar for the .s field this really has to correspond to a char* or something compatible, otherwise you'd see a warning or error.

There is no guarantee that the size_t is an int, or that it can be represented within an int. It's just part of C's legacy in not defining the exact size of an int, coupled with concerns that size_t's implementation might need to be leveraged to address large memory areas (ones that have more than MAX_INT values in them).
The most common error concerning size_t is to assume that it is equivalent to unsigned int. Such old bugs were common, and from personal experience it makes porting from a 32 bit to a 64 bit architecture a pain, as you need to undo this assumption.
At best, you can use a cast. If you really want to get rid of the cast, you could alternatively discard the use of size_t.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

strange unsigned char casting - c

What is the purpose / advantage / difference of using /* C89 compliant way to cast 'char' to 'unsigned char'. */ static inline unsigned char to_uchar (char ch) { return ch; } versus a standard cast ? Edit : Found in a base64 code in gnulib

Related

Does C's void* serve any purpose other than making code more idiomatic?

Why libxml2 uses "BAD_CAST" everywhere in a C/C++ code?

Strict aliasing rule and strlen implementation of glibc

Strict aliasing and flexible array member

Using size_t for specifying the precision of a string in C's printf

Categories

Resources