Why libxml2 uses "BAD_CAST" everywhere in a C/C++ code?

Why libxml2 uses "BAD_CAST" everywhere in a C/C++ code? - c

It's said that because libxml2 uses unsigned char as storage to make encoding/decoding between char sets convenient ---- isn't it ugly to force users to write "BAD_CAST" everywhere in the code, when reading/creating xml node names, contents, text nodes, etc.
It there a way to avoid writing such "BAD_CAST" everywhere? The design is really so bad.

This design decision is unfortunate, but rooted in another unfortunate choice made 40 years ago: allowing char to be potentially signed by default, which is inconsistent with the behavior of getchar(), strcmp()...
You can use an inline function to convert from char * to unsigned char * and vice-versa with a single hidden cast, and use these when passing arguments:
static inline unsigned char *c2uc(char *s) { return (unsigned char*)s; }
static inline char *uc2c(unsigned char *s) { return (char*)s; }
These wrappers are much safer to use than basic casts as they can only be applied to one type and convert to its unsigned counterpart. Regular casts are problematic as they can be applied to any type and hide type conversion errors. static inline functions are expanded by the compiler at no runtime cost.

Related

Does C's void* serve any purpose other than making code more idiomatic?

Does C's void* lend some benefit in the form of compiler optimization etc, or is it just an idiomatic equivalent for char*? I.e. if every void* in the C standard library was replaced by char*, would anything be damaged aside from code legibility?

In the original K&R C there was no void * type, char * was used as the generic pointer.
void * serves two main purposes:
It makes it clear when a pointer is being used as a generic pointer. Previously, you couldn't tell whether char * was being used as a pointer to an actual character buffer or as a generic pointer.
It allows the compiler to catch errors where you try to dereference or perform arithmetic on a generic pointer, rather than casting it to the appropriate type first. These operations aren't permitted on void * values. Unfortunately, some compilers (e.g. gcc) allow arithmetic on void * pointers by default, so it's not as safe as it could be.

C got along just fine for years with char * as the generic pointer type before void * was introduced later. So it clearly isn't strictly necessary. But programming languages are communication--not just telling a compiler what to do, but telling other human beings reading the program what is intended. Anything that makes a language more expressive is a good thing.

Does C's void* serve any purpose other than making code more idiomatic?
if every void* in the C standard library was replaced by char*, would anything be damaged aside from code legibility?
C11 introduced _Genric. Because of that, _Generic(some_C_standard_library_function()) ... code can compile quite differently depending on if the return type was a void *, char *, int *, etc.

Pass char* to method expecting unsigned char*

I am working on some embedded device which has SDK. It has a method like:
MessageBox(u8*, u8*); // u8 is typedefed unsigned char when I checked
But I have seen in their examples calling code like:
MessageBox("hi","hello");
passing char pointer without cast. Can this be well defined? I am asking because I ran some tool over the code, and it was complaining about above mismatch:
messageBox("Status", "Error calculating \rhash");
diy.c 89 Error 64: Type mismatch (arg. no. 1) (ptrs to signed/unsigned)
diy.c 89 Error 64: Type mismatch (arg. no. 2) (ptrs to signed/unsigned)
Sometimes I get different opinions on this answer and this confuses me even more. So to sum up, by using their API the way described above, is this problem? Will it crash the program?
And also it would be nice to hear what is the correct way then to pass string to SDK methods expecting unsigned char* without causing constraint violation?

It is a constraint violation, so technically it is not well defined, but in practice, it is not a problem. Yet you should cast these arguments to silence these warnings. An alternative to littering your code with ugly casts is to define an inline function:
static inline unsigned char *ucstr(const char *str) { return (unsigned char *)str; }
And use that function wherever you need to pass strings to the APIs that (mistakenly) take unsigned char * arguments:
messageBox(ucstr("hi"), ucstr("hello"));
This way you will not get warnings while keeping some type safety.
Also note that messageBox should take const char * arguments. This SDK uses questionable conventions.

The problem comes down to it being implementation-defined whether char is unsigned or signed.
Compilers for which there is no error will be those for which char is actually unsigned. Some of those (notably the ones that are actually C++ compilers, where char and unsigned char are distinct types) will issue a warning. With these compilers, converting the pointer to unsigned char * will be safe.
Compilers which report an error will be those for which char is actually signed. If the compiler (or host) uses an ASCII or similar character set, and the characters in the string are printable, then converting the string to unsigned char * (or, better, to const unsigned char * which avoids dropping constness from string literals) is technically safe. However, those conversions are potentially unsafe for implementations that use different character sets OR for strings that contain non-printable characters (e.g. values of type signed char that are negative, and values of unsigned char greater than 127). I say potentially unsafe, because what happens depends on what the called function does - for example does it check the values of individual characters? does it check the individual bits of individual characters in the string? The latter is, if the called function is well designed, one reason it will accept a pointer to unsigned char *.
What you need to do therefore comes down to what you can assume about the target machine, and its char and unsigned char types - and what the function is doing with its argument. The most general approach (in the sense that it works for all character sets, and regardless of whether char is signed or unsigned) is to create a helper function which copies the array of char to a different array of unsigned char. The working of that helper function will depend on how (and if) you need to handle the conversion of signed char values with values that are negative.

Unsigned char plaguing my code

In the C library an implementation of memcpy might look like this:
#include <stddef.h> /* size_t */
void *memcpy(void *dest, const void *src, size_t n)
{
char *dp = dest;
const char *sp = src;
while (n--)
*dp++ = *sp++;
return dest;
}
Nice, clean and type agnostic right? But when following a kernel tutorial, the prototypes look like this:
unsigned char *memcpy(unsigned char *dest, const unsigned char *src, int count);
I've attempted to implement it like this:
{
unsigned char *dp = dest;
unsigned const char *sp = src;
while (count--)
*dp++ = *sp++;
return dest;
}
But I'm quite wary of seeing unsigned char everywhere and potentially nasty bugs resulting from casts.
Should I attempt to use uint8_t and other variants wherever possible instead of unsigned TYPE?
Sometimes I see unsigned char * instead of const char*. Should this be considered a bug?

What have you seen at C library implementation of memcpy that is to confirm C programming standard ie that is how C programming standard defined the memcpy interface. And what have you seen at kernel tutorial is totally depends on the tutorial writer, how he designed his kernel and code. AFAIU, to show how to write a kernel and how internal components needs to be glued and built, you can avoid thinking too much about whether to use uint8_t or unsigned char (these kind of decision needs to be made based on whether you really want your kernel to expand at certain level) they are same, but helps readability of your project.
And, regarding the second point - if you really confident that you should use const char * instead of unsigned char * then yes fix it. But, also concentrate on how kernel works, how memory/video/peripheral devices are setup initialized etc.

Technically, as far as C99 is concerned, (u)int8_t must be available if the underlying machine has a primitive 8-bit integer (with no padding)... and not otherwise. For C99 the basic unit of memory is the char, which technically may have any number of bits. So, from that perspective, unsigned char is more correct than uint8_t. POSIX, on the other hand, has decided that the 8-bit byte has achieved tablet-of-stone status, so CHARBIT == 8, forever, and the difference between (u)int8_t and (unsigned/signed) char is academic.
The ghastly legacy of the ambiguous signed-ness of char, we just get to live with.
I like to typedef unsigned char byte pretty early, and have done with.
I note that memcmp() is defined to compare on the basis of unsigned char. So I can see some logic in treating all memxxx() as taking unsigned char.
I'm not sure I would describe memcpy(void* dst, ...) as "type agnostic"... what memcpy is doing is moving chars, but by declaring the arguments in this way the programmer is auto-magically relieved of casting to char*. Anyone who passes an int* and a count of ints, expecting memcpy to do the usual pointer arithmetic magic, is destined for a short, sharp shock ! So, in this case, if you are required to cast to unsigned char*, then I would argue that it is clearer and safer. (Everybody learns very quickly what memxxx() do, so the extra clarity would generally be seen as a pain in the hindquarters. I agree that casts in general should be treated as "taking off the seat belt and pressing the pedal to the metal", but not so much in this case.)
Clearly unsigned char* and const char* are quite different to each other (how different would depend on the signed-ness of char), but whether the appearance of one or the other is a bug or not would depend on the context.

Using size_t for specifying the precision of a string in C's printf

I have a structure to represent strings in memory looking like this:
typedef struct {
size_t l;
char *s;
} str_t;
I believe using size_t makes sense for specifying the length of a char string. I'd also like to print this string using printf("%.*s\n", str.l, str.s). However, the * precision expects an int argument, not size_t. I haven't been able to find anything relevant about this. Is there someway to use this structure correctly, without a cast to int in the printf() call?

printf("%.*s\n", (int)str.l, str.s)
// ^^^^^ use a type cast
Edit
OK, I didn't read the question properly. You don't want to use a type cast, but I think, in this case: tough.
Either that or simply use fwrite
fwrite(str.s, str.l, 1, stdout);
printf("\n");

You could do a macro
#define STR2(STR) (int const){ (STR).l }, (char const*const){ (STR).s }
and then use this as printf("%.*s\n", STR2(str)).
Beware that this evaluates STR twice, so be carefull with side effects, but you probably knew that already.
Edit:
I am using compound initializers such that these are implicit conversions. If things go wrong there are more chances that the compiler will warn you than with an explicit cast.
E.g if STR has a field .l that is a pointer and you'd only put a cast to int, all compilers would happily convert that pointer to int. Similar for the .s field this really has to correspond to a char* or something compatible, otherwise you'd see a warning or error.

There is no guarantee that the size_t is an int, or that it can be represented within an int. It's just part of C's legacy in not defining the exact size of an int, coupled with concerns that size_t's implementation might need to be leveraged to address large memory areas (ones that have more than MAX_INT values in them).
The most common error concerning size_t is to assume that it is equivalent to unsigned int. Such old bugs were common, and from personal experience it makes porting from a 32 bit to a 64 bit architecture a pain, as you need to undo this assumption.
At best, you can use a cast. If you really want to get rid of the cast, you could alternatively discard the use of size_t.

strange unsigned char casting

What is the purpose / advantage / difference of using
/* C89 compliant way to cast 'char' to 'unsigned char'. */
static inline unsigned char
to_uchar (char ch)
{
return ch;
}
versus a standard cast ?
Edit :
Found in a base64 code in gnulib

Maybe the programmer who wrote the function doesn't like the cast syntax ...
foo(to_uchar(ch)); /* function call */
foo((unsigned char)ch); /* cast */
But I'd let the compiler worry about it anyway :)
void foo(unsigned char);
char s[] = "bar";
foo(s[2]); /* compiler implicitly casts the `s[2]` char to unsigned char */

Purpose
To:
Cast from char to unsigned char
Do more with the cast than from a conventional cast
Open your mind to the possibilities of customized casting between other types and select from the advantages below for those also
Advantage
One could:
Break on these kinds of casts when debugging
Track and quantify the use of casts through profiling tools
Add limits checking code (pretty grim for char conversions but potentially very useful for larger/smaller type casts)
Have delusions of grandeur
There is a single point of casting, allowing you to carefully analyze and modify what code is generated
You could select from a range of casting techniques based on the environment (for example in C++ you could use numeric_limits<>)
The cast is explicit and will never generate warnings (or at least you can force choke them in one place)
Difference
Slower with poor compilers or good compilers with the necessary optimization flags turned off
The language doesn't force consistency, you might not notice you've forgotten to use the cast in places
Kind of strange, and Java-esque, one should probably accept and study C's weak typing and deal with it case by case rather than trying to conjure special functions to cope

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight