What are the possible values for a valid pointer? - c

Does the standard restrict possible memory addresses (which I would interpret as possible values for a pointer)? Can your code rely on some values to be never used and still be fully portable?
I just saw this in our code base (it's a C library), and I was wondering if it would be always OK. I'm not sure what the author meant by this, but it's obviously not just a check for possible null.
int LOG(const char* func, const char* file,
int lineno, severity level, const char* msg)
{
const unsigned long u_func = (unsigned long)func;
const unsigned long u_file = (unsigned long)file;
const unsigned long u_msg = (unsigned long)msg;
if(u_func < 0x400 || u_file < 0x400 || u_msg < 0x400 ||
(unsigned)lineno > 10000 || (unsigned)level > LOG_DEBUG)
{
fprintf(stderr, "log function called with bad args");
return -1;
}
//...
}
Another possible use case for this would be storing boolean flags inside a pointer, instead in a separate member variable as an optimization. I think the C++11 small string optimization does this, but I might be wrong.
EDIT:
If it's implementation defined, as some of you have mentioned, can you detect it at compile time?

Does the standard restrict possible memory addresses (which I would interpret as possible values for a pointer)?
The C++ (nor C to my knowledge) standard does not restrict possible memory addresses.
Can your code rely on some values to be never used and still be fully portable?
A program that unconditionally relies on such implementation defined (or unspecified by standard) detail would not be fully portable to all concrete and theoretical standard implementations.
However, using platform detection macros, it may be possible to make the program portable by relying on the detail conditionally only on systems where the detail is reliable.
P.S. Another thing that you cannot technically rely on: unsigned long is not guaranteed to be able to represent all pointer values (uintptr_t is).

The standard term is 'safely derived pointer'. It is implementation defined if you can use not safely derived pointers - for example, numeric constants - in your program.
You can check pointer safety model with std::pointer_safety: https://en.cppreference.com/w/cpp/memory/gc/pointer_safety

Related

Is it valid to compare 2 strings using type punning?

I had just discovered the mindf*** that is type-punning when learning C and while experimenting I ran this code:
char* str="abc";
void* n=(void*)str;
uint32_t str_in_int=*(uint32_t*)n;
printf("%u", str_in_int);
which obviously gave out a uint32_t integer.
Since I thought that this is a pointer operation, if the addresses would be different it would give a different result, but each time I ran it it gave the same result. I also stored a duplicate value in another variable and compared it with the original (in case there were some addressing shenanigans going on under the hood) and it still came out the same. The code:
char* str="abc";
char* str2="abc";
void* n=(void*)str;
void* n2=(void*)str2;
uint32_t str_in_int=*(uint32_t*)n;
uint32_t str_in_int2=*(uint32_t*)n2;
printf("%u %u", str_in_int,str_in_int2);
Is this a viable form of string comparison in case of smaller strings as an alternative to strcmp or comparing character by character? Also an example where the resulting uint is the same for different strings is also welcome if it exists.
It is Undefined Behaviour as you break the string aliasing rules.
The correct way of doing it:
char *str = "abc";
uint32_t x;
memcpy(&x, str, sizeof(x));
printf("%"PRIu32"\n", x);
Most optimizing compilers will not call memcpy and the performance will be the same as using dangerous pointer punning.
https://godbolt.org/z/fnrn7b9jK
Is this a viable form of string comparison in case of smaller strings as an alternative to strcmp or comparing character by character?
Yes and no. Implementations of strcmp and memcmp may use techniques like this to compare multiple bytes at once. However, because an implementation of the C standard library is coordinated with the compiler, the code in the library implementation may use things that are not completely defined by the C standard, because they are defined by the compiler. Further, the library will be written to respect alignment requirements and memory mapping issues.
When similar code is written in an ordinary program, the semantics that are not fully defined by the C standard may be changed by the compiler, particularly when high optimization is required. Most particularly, if an object is defined as an array of char but you use it as an uint32_t, the behavior is not defined by the C standard. Sometimes these issues can be worked around, as the library implementors do, but doing so requires a good knowledge of the C standard and the particular features of the compiler being used.

Thread-safely initializing a pointer just once

I'm writing a library function, say, count_char(const char *str, int len, char ch) that detects the supported SIMD extensions of the CPU it's running on and dispatches the call to, say, an AVX2- or SSE4.2-optimized version. Since I'd like to avoid the penalty of doing a couple of cpuid instructions per each call, I'm trying to do this just once the first time the function is called (which might be called by different threads simultaneously).
In C++ land I'd just do something like
int count_char(const char *str, int len, char ch) {
static const auto fun_ptr = select_simd_function();
return (*fun_ptr)(str, len, ch);
}
and rely on C++ semantics of static to guarantee that it's called exactly once without any race conditions. But what's the best way to do this in pure C?
This is what I've come up with:
Using atomic variables (that are also present in C) — rather error-prone and a bit harder to maintain.
Using pthread_once — not sure about what overhead it has, plus it might give headache on Windows.
Forcing the library user to call another library function to initialize the pointer — in short, it won't work in my case since this is actually C bits of a library for another language.
Aligning the pointer by 8 bytes and relying on x86 word-sized accesses being atomic — unportable to other architectures (shall I later implement some PowerPC or ARM-specific SIMD versions, say), technically UB (at least in C++).
Using thread-local storage and marking fun_ptr as thread_local and then doing something like
static thread_local fun_ptr_t fun_ptr = NULL;
if (!fun_ptr) {
fun_ptr = select_simd_function();
}
return (*fun_ptr)(str, len, ch);
The upside is that the code is very clear and apparently correct, but I'm not sure about the performance implications of TLS, plus every thread will have to call select_simd_function() once (but that's probably not a big deal).
For me personally, (5) is the winner so far, followed closely by (1) (I'd probably even go with (1) if it weren't somebody else's very foundational library and I didn't want to embarrass myself with a likely faulty implementation).
So, what'd be the best option? Did I miss anything else?
If you can use C11, this would work (assuming your implementation supports threads - it's an optional feature):
#include <threads.h>
static fun_ptr_t fun_ptr = NULL;
static void init_fun_ptr( void )
{
fun_ptr = select_simd_function();
}
fun_ptr_t get_simd_function( void )
{
static once_flag flag = ONCE_FLAG_INIT;
call_once( &flag, init_fun_ptr);
return ( fun_ptr );
}
Of course, you mentioned Windows. I doubt MSVC supports this.

Casting function pointers with different return types

This question is about using function pointers, which are not precisely compatible, but which I hope I can use nevertheless as long as my code relies only on the compatible parts. Let's start with some code to get the idea:
typedef void (*funcp)(int* val);
static void myFuncA(int* val) {
*val *= 2;
return;
}
static int myFuncB(int* val) {
*val *= 2;
return *val;
}
int main(void) {
funcp f = NULL;
int v = 2;
f = myFuncA;
f(&v);
// now v is 4
// explicit cast so the compiler will not complain
f = (funcp)myFuncB;
f(&v);
// now v is 8
return 0;
}
While the arguments of myFuncA and myFuncB are identical and fully compatible, the return values are not and are thus just ignored by the calling code. I tried the above code and it works correctly using GCC.
What I learned so far from here and here is that the functions are incompatible by definition of the standard and may cause undefined behavior. My intuition, however, tells me that my code example will still work correctly, since it does not rely in any way on the incompatible parts (the return value). However, in the answers to this question a possible corruption of the stack has been mentioned.
So my question is: Is my example above valid C code so it will always work as intended (without any side effects), or does it depend on the compiler?
EDIT:
I want to do this in order to use a "more powerful" function with a "les powerful" interface. In my example funcp is the interface but I would like to provide additional functionality like myFuncB for optional use.
Agreed it is undefined behaviour, don't do that!
Yes the code functions, i.e. it doesn't fall over, but the value you assign after returning void is undefined.
In a very old version of "C" the return type was unspecified and int and void functions could be 'safely' intermixed. The integer value being returned in the designated accumulator register. I remember writing code using this 'feature'!
For almost anything else you might return the results are likely to be fatal.
Going forward a few years, floating-point return values are often returned using the fp coprocessor (we are still in the 80s) register, so you can't mix int and float return types, because the state of the coprocessor would be confused if the caller does not strip off the value, or strips off a value that was never placed there and causes an fp exception. Worse, if you build with fp emulation, then the fp value may be returned on the stack as described next. Also on 32-bit builds it is possible to pass 64bit objects (on 16 bit builds you can have 32 bit objects) which would be returned either using multiple registers or on the stack. If they are on the stack and allocated the wrong size, then some local stomping will occur,
Now, c supports struct return types and return value copy optimisations. All bets are off if you don't match the types correctly.
Also some function models have the caller allocate stack space for the parameters for the call, but the function itself releases the stack. Disagreement between caller and implementation on on the number or types of parameters and return values would be fatal.
By default C function names are exported and linked undecorated - just the function name defines the symbol, so different modules of your program could have different views about function signatures, which conflict when you link, and potentially generate very interesting runtime errors.
In c++ the function names are highly decorated primarily to allow overloading, but also it helps to avoid signature mismatches. This helps with keeping arguments in step, but actually, ( as noted by #Jens ) the return type is not encoded into the decorated name, primarily because the return type isn't used (wasn't, but I think occasionally can now influence) for overload resolution.
Yes, this is an undefined behaviour and you should never rely on undefined behaviour if want to write portable code.
Function with different return values can have different calling conventions. Your example will probably work for small return types, but when returning large structs (e.g. larger than 32 bits) some compilers will generate code where struct is returned in a temporary memory area which should be cleaned up the the caller.

Strict aliasing rule and strlen implementation of glibc

I have been reading about the strict aliasing rule for a while, and I'm starting to get really confused. First of all, I have read these questions and some answers:
strict-aliasing-rule-and-char-pointers
when-is-char-safe-for-strict-pointer-aliasing
is-the-strict-aliasing-rule-really-a-two-way-street
According to them (as far as I understand), accessing a char buffer using a pointer to another type violates the strict aliasing rule. However, the glibc implementation of strlen() has such code (with comments and the 64-bit implementation removed):
size_t strlen(const char *str)
{
const char *char_ptr;
const unsigned long int *longword_ptr;
unsigned long int longword, magic_bits, himagic, lomagic;
for (char_ptr = str; ((unsigned long int) char_ptr
& (sizeof (longword) - 1)) != 0; ++char_ptr)
if (*char_ptr == '\0')
return char_ptr - str;
longword_ptr = (unsigned long int *) char_ptr;
himagic = 0x80808080L;
lomagic = 0x01010101L;
for (;;)
{
longword = *longword_ptr++;
if (((longword - lomagic) & himagic) != 0)
{
const char *cp = (const char *) (longword_ptr - 1);
if (cp[0] == 0)
return cp - str;
if (cp[1] == 0)
return cp - str + 1;
if (cp[2] == 0)
return cp - str + 2;
if (cp[3] == 0)
return cp - str + 3;
}
}
}
The longword_ptr = (unsigned long int *) char_ptr; line obviously aliases an unsigned long int to char. I fail to understand what makes this possible. I see that the code takes care of alignment problems, so no issues there, but I think this is not related with the strict aliasing rule.
The accepted answer for the third linked question says:
However, there is a very common compiler extension allowing you to cast properly aligned pointers from char to other types and access them, however this is non-standard.
Only thing comes to my mind is the -fno-strict-aliasing option, is this the case? I could not find it documented anywhere what glibc implementors depend on, and the comments somehow imply that this cast is done without any concerns like it is obvious that there will be no problems. That makes me think that it is indeed obvious and I am missing something silly, but my search failed me.
In ISO C this code would violate the strict aliasing rule. (And also violate the rule that you cannot define a function with the same name as a standard library function). However this code is not subject to the rules of ISO C. The standard library doesn't even have to be implemented in a C-like language. The standard only specifies that the implementation implements the behaviour of the standard functions.
In this case, we could say that the implementation is in a C-like GNU dialect, and if the code is compiled with the writer's intended compiler and settings then it would implement the standard library function successfully.
When writing the aliasing rules, the authors of the Standard only considered forms that would be useful, and should thus be mandated, on all implementations. C implementations are targeted toward a variety of purposes, and the authors of the Standard make no attempt to specify what a compiler must do to be suitable for any particular purpose (e.g. low-level programming) or, for that matter, any purpose whatsoever.
Code like the above which relies upon low-level constructs should not be expected to run on compilers that make no claim of being suitable for low-level programming. On the flip side, any compiler which can't support such code should be viewed as unsuitable for low-level programming. Note that compilers can employ type-based aliasing assumptions and still be suitable for low-level programming if they make a reasonable effort to recognize common aliasing patterns. Some compiler writers are very highly invested in a view of code which fits neither common low-level coding patterns, nor the C Standard, but
anyone writing low-level code should simply recognize that those compilers'
optimizers are unsuitable for use with low-level code.
The wording of the standard is actually a bit more weird than the actual compiler implementations: The C standard talks about declared object types, but the compilers only ever see pointers to these objects. As such, when a compiler sees a cast from a char* to an unsigned long*, it has to assume that the char* is actually aliasing an object with a declared type of unsigned long, making the cast correct.
A word of caution: I assume that strlen() is compiled into a library that is later only linked to the rest of the application. As such, the optimizer does not see the use of the function when compiling it, forcing it to assume that the cast to unsigned long* is indeed legit. If you called strlen() with
short myString[] = {0x666f, 0x6f00, 0};
size_t length = strlen((char*)myString); //implementation now invokes undefined behavior!
the cast within strlen() is undefined behavior, and your compiler would be allowed to strip pretty much the entire body of strlen() if it saw your use while compiling strlen() itself. The only thing that allows strlen() to behave as expected in this call is the fact, that strlen() is compiled separately as a library, hiding the undefined behavior from the optimizer, so the optimizer has to assume the cast to be legit when compiling strlen().
So, assuming that the optimizer cannot call "undefined behavior", the reason why casts from char* to anything else are dangerous, is not aliasing, but alignment. On some hardware, weird stuff starts happening if you try to access a misaligned pointer. The hardware might load data from the wrong address, raise an interrupt, or just process the requested memory load extremely slowly. That is why the C standard generally declares such casts undefined behavior.
Nevertheless, you see that the code in question actually handles the alignment issue explicitly (the first loop that contains the (unsigned long int) char_ptr & (sizeof (longword) - 1) subcondition). After that, the char* is properly aligned to be reinterpreted as unsigned long*.
Of course, all of this is not really compliant with the C standard, but it is compliant with the C implementation of the compiler that this code is meant to be compiled with. If the gcc people modified their compiler to act up on this bit of code, the glibc people would just complain about it loud enough so that the gcc will be changed back to handle this kind of cast correctly.
At the end of the day, standard C library implementations simply must violate strict aliasing rules to work properly and be efficient. strlen() just needs to violate those rules to be efficient, the malloc()/free() function pair must be able to take a memory region that had a declared type of Foo, and turn it into a memory region of declared type Bar. And there is no malloc() call inside the malloc() implementation that would give the object a declared type in the first place. The abstraction of the C language simply breaks down at this level.
The underlying assumption is probably that the function is separately compiled, and not available for inlining or other cross function optimizations. This means that no compile time information flows inside or outside the function.
The function doesn't try to modify anything through a pointer, so there is no conflict.

Variables defined and assigned at the same time

A coding style presentation that I attended lately in office advocated that variables should NOT be assigned (to a default value) when they are defined. Instead, they should be assigned a default value just before their use.
So, something like
int a = 0;
should be frowned upon.
Obviously, an example of 'int' is simplistic but the same follows for other types also like pointers etc.
Further, it was also mentioned that the C99 compatible compilers now throw up a warning in the above mentioned case.
The above approach looks useful to me only for structures i.e. you memset them only before use. This would be efficient if the structure is used (or filled) only in an error leg.
For all other cases, I find defining and assigning to a default value a prudent exercise as I have encountered a lot of bugs because of un-initialized pointers both while writing and maintaining code. Further, I believe C++ via constructors also advocates the same approach i.e. define and assign.
I am wondering why(if) C99 standard does not like defining & assigning. Is their any considerable merit in doing what the coding style presentation advocated?
Usually I'd recommend initialising variables when they are defined if the value they should have is known, and leave variables uninitialised if the value isn't. Either way, put them as close to their use as scoping rules allow.
Instead, they should be assigned a default value just before their use.
Usually you shouldn't use a default value at all. In C99 you can mix code and declarations, so there's no point defining the variable before you assign a value to it. If you know the value it's supposed to take, then there is no point in having a default value.
Further, it was also mentioned that the C99 compatible compilers now throw up a warning in the above mentioned case.
Not for the case you show - you don't get a warning for having int x = 0;. I strongly suspect that someone got this mixed up. Compilers warn if you use a variable without assigning a value to it, and if you have:
... some code ...
int x;
if ( a )
x = 1;
else if ( b )
x = 2;
// oops, forgot the last case else x = 3;
return x * y;
then you will get a warning that x may be used without being initialised, at least with gcc.
You won't get a warning if you assign a value to x before the if, but it is irrelevant whether the assignment is done as an initialiser or as a separate statement.
Unless you have a particular reason to assign the value twice for two of the branches, there's no point assigning the default value to x first, as it stops the compiler warning you that you've covered every branch.
There's no such requirement (or even guideline that I'm aware of) in C99, nor does the compiler warn you about it. It's simply a matter of style.
As far as coding style is concerned, I think you took things too literally. For example, your statement is right in the following case...
int i = 0;
for (; i < n; i++)
do_something(i);
... or even in ...
int i = 1;
[some code follows here]
while (i < a)
do_something(i);
... but there are other cases that, in my mind, are better handled with an early "declare and assign". Consider structures constructed on the stack or various OOP constructs, like in:
struct foo {
int bar;
void *private;
};
int my_callback(struct foo *foo)
{
struct my_struct *my_struct = foo->private;
[do something with my_struct]
return 0;
}
Or like in (C99 struct initializers):
void do_something(int a, int b, int c)
{
struct foo foo = {
.a = a,
.b = b + 1,
.c = c / 2,
};
write_foo(&foo);
}
I sort of concur with the advice, even though I'm not altogether sure the standard says anything about it, and I very much doubt the bit about compiler warnings is true.
The thing is, modern compilers can and do detect the use of uninitialised variables. If you set your variables to default values at initialisation, you lose that detection. And default values can cause bugs too; certainly in the case of your example, int a = 0;. Who says 0 is an appropriate value for a?
In the 1990s, the advice would've been wrong. Nowadays, it's correct.
I find it highly useful to pre-assign some default data to variables so that i don't have to do (as many) null checks in code.
I have seen so many bugs due to uninitialized pointers that I always advocated to declare each variable with NULL_PTR and each primitivewith some invalid/default value.
Since I work on RTOS and high performance but low resource systems, it is possible that the compilers we use do not catch non-initialized usage. Though I doubt modern compilers can also be relied on 100%.
In large projects where Macro's are extensively used, I have seen rare scenarios where even Kloclwork /Purify have failed to find non-initialized usage.
So I say stick with it as long as you are using plain old C/C++.
Modern languages like .Net can guarantee to initialize varaibles, or give a compiler error for uninitialized variable usage. Following link does a performance analysis and validates that there is a 10-20% performance hit for .NET. The analysis is in quite detail and is explained well.
http://www.codeproject.com/KB/dotnet/DontInitializeVariables.aspx

Resources