Thread-safely initializing a pointer just once - c

I'm writing a library function, say, count_char(const char *str, int len, char ch) that detects the supported SIMD extensions of the CPU it's running on and dispatches the call to, say, an AVX2- or SSE4.2-optimized version. Since I'd like to avoid the penalty of doing a couple of cpuid instructions per each call, I'm trying to do this just once the first time the function is called (which might be called by different threads simultaneously).
In C++ land I'd just do something like
int count_char(const char *str, int len, char ch) {
static const auto fun_ptr = select_simd_function();
return (*fun_ptr)(str, len, ch);
}
and rely on C++ semantics of static to guarantee that it's called exactly once without any race conditions. But what's the best way to do this in pure C?
This is what I've come up with:
Using atomic variables (that are also present in C) — rather error-prone and a bit harder to maintain.
Using pthread_once — not sure about what overhead it has, plus it might give headache on Windows.
Forcing the library user to call another library function to initialize the pointer — in short, it won't work in my case since this is actually C bits of a library for another language.
Aligning the pointer by 8 bytes and relying on x86 word-sized accesses being atomic — unportable to other architectures (shall I later implement some PowerPC or ARM-specific SIMD versions, say), technically UB (at least in C++).
Using thread-local storage and marking fun_ptr as thread_local and then doing something like
static thread_local fun_ptr_t fun_ptr = NULL;
if (!fun_ptr) {
fun_ptr = select_simd_function();
}
return (*fun_ptr)(str, len, ch);
The upside is that the code is very clear and apparently correct, but I'm not sure about the performance implications of TLS, plus every thread will have to call select_simd_function() once (but that's probably not a big deal).
For me personally, (5) is the winner so far, followed closely by (1) (I'd probably even go with (1) if it weren't somebody else's very foundational library and I didn't want to embarrass myself with a likely faulty implementation).
So, what'd be the best option? Did I miss anything else?

If you can use C11, this would work (assuming your implementation supports threads - it's an optional feature):
#include <threads.h>
static fun_ptr_t fun_ptr = NULL;
static void init_fun_ptr( void )
{
fun_ptr = select_simd_function();
}
fun_ptr_t get_simd_function( void )
{
static once_flag flag = ONCE_FLAG_INIT;
call_once( &flag, init_fun_ptr);
return ( fun_ptr );
}
Of course, you mentioned Windows. I doubt MSVC supports this.

Related

Questions regarding (non-)volatile and optimizing compilers

I have the following C code:
/* the memory entry points to can be changed from another thread but
* it is not declared volatile */
struct myentry *entry;
bool isready(void)
{
return entry->status == 1;
}
bool isready2(int idx)
{
struct myentry *x = entry + idx;
return x->status == 1;
}
int main(void) {
/* busy loop */
while (!isready())
;
while (!isready2(5))
;
}
As I note in the comment, entry is not declared as volatile even though the array it points to can be changed from another thread (or actually even directly from kernel space).
Is the above code incorrect / unsafe? My thinking is that no optimization could be performed in the bodies of isready, isready2 and since I repeatedly perform function calls from within main the appropriate memory location should be read on every call.
On the other hand, the compiler could inline these functions. Is it possible that it does it in a way that results in a single read happening (hence causing an infinite loop) instead of multiple reads (even if these reads come from a load/store buffer)?
And a second question. Is it possible to prevent the compiler from doing optimizations by casting to volatile only in certain places like that?
void func(void)
{
entry->status = 1;
while (((volatile struct myentry *) entry)->status != 2)
;
}
Thanks.
If the memory entry points to can be modified by another thread, then the program has a data race and therefore the behaviour is undefined . This is still true even if volatile is used.
To have a variable accessed concurrently by multiple threads, in ISO C11, it must either be an atomic type, or protected by correct synchronization.
If using an older standard revision then there are no guarantees provided by the Standard about multithreading so you are at the mercy of any idiosyncratic behaviour of your compiler.
If using POSIX threads, there are no portable atomic operations, but it does define synchronization primitives.
See also:
Why is volatile not considered useful in multithreaded C or C++ programming?
The second question is a bugbear, I would suggest not doing it because different compilers may interpret the meaning differently, and the behaviour is still formally undefined either way.

Avoiding variable-length stack arrays at compiletime

I've implemented a function that requires some temporary stack space, the amount of which depends on one of its inputs. That smells like variable-length stack memory allocation, which is not always considered a good idea (e.g., it's not part of C90 or C++, and, in that context, only available in gcc through an extension). However, my situation is slightly different: I do know how many bytes I'll end up allocating at compile-time, it's just that it's different for several different calls to this function, sprinkled around my codebase.
C99 seems to be fine with this, but that's not what e.g. Visual Studio implements, and thus my CI runs on Windows are failing.
It seems that I have a few options, none of which are great. I hope this question can either convince me of one of these, or provide a more idiomatic alternative.
Allocate the stack space outside of the function call, based on the compile-time constant that I'd otherwise pass as a parameter, and then pass a pointer.
Turn my function into a macro.
Turn my function into a wrapper-macro that then allocates the stack space and passes it on to the 'real' function (essentially combining 1 and 2).
Somehow convince Visual Studio that this is fine (relevant NMakefile).
The goal here is not only to get something that works and is reasonably performant but also that is readable and clean, as that strongly aligns with the context of the project this is part of. I should note that allocation on the heap is also not an option here.
How can I best deal with this?
If you prefer more hands-on, real-world context, here's a Github comment where I describe my specific instance of this problem.
Apparently MSVC does handle C99 compound literals (§6.5.2.5), so you can pass stack-allocated arrays directly to the called function as additional arguments. You might want to use a macro to simplify the call syntax.
Here's an example:
/* Function which needs two temporary arrays. Both arrays and the size
* are passed as arguments
*/
int process(const char* data, size_t n_elems, char* stack, int* height) {
/* do the work */
}
/* To call the function with n_elems known at compile-time */
int result = process(data, N, (char[N]){0}, (int[N]){0});
/* Or you might use a macro like this: */
#define process_FIXED(D, N) (process(D, N, (char[N]){0}, (int[N]){0})))
int result = process_FIXED(data, N);
The process function doesn't need to know how the temporaries are allocated; the caller could just as well malloc the arrays (and free them after the call) or use a VLA or alloca to stack-allocate them.
Compound literals are initialised. But they cannot be too large, because otherwise you risk stack overflow, so the overhead shouldn't be excessive. But that's your call. Note that in C, an initialiser list cannot be empty although GCC seems to accept (char[N]){} without complaint. MSVC complains, or at least the on-line compiler I found for it complains.
You could try to offer both:
module.h
// Helper macro for calculating correct buffer size
#define CALC_SIZE(quantity) (/* expands to integer constant expression */)
// C90 compatible function
void func(uint8_t * data, int quantity);
// Optional function for newer compilers
// uses CALC_SIZE internally for simpler API similarly to 'userFunc' below
#if NOT_ANCIENT_COMPILER
void funcVLA(int quantity);
#endif
user.c
#include "module.h"
void userFunc(void) {
uint8_t buffer[CALC_SIZE(MY_QUANTITY)];
func(buffer, MY_QUANTITY);
}

What are the possible values for a valid pointer?

Does the standard restrict possible memory addresses (which I would interpret as possible values for a pointer)? Can your code rely on some values to be never used and still be fully portable?
I just saw this in our code base (it's a C library), and I was wondering if it would be always OK. I'm not sure what the author meant by this, but it's obviously not just a check for possible null.
int LOG(const char* func, const char* file,
int lineno, severity level, const char* msg)
{
const unsigned long u_func = (unsigned long)func;
const unsigned long u_file = (unsigned long)file;
const unsigned long u_msg = (unsigned long)msg;
if(u_func < 0x400 || u_file < 0x400 || u_msg < 0x400 ||
(unsigned)lineno > 10000 || (unsigned)level > LOG_DEBUG)
{
fprintf(stderr, "log function called with bad args");
return -1;
}
//...
}
Another possible use case for this would be storing boolean flags inside a pointer, instead in a separate member variable as an optimization. I think the C++11 small string optimization does this, but I might be wrong.
EDIT:
If it's implementation defined, as some of you have mentioned, can you detect it at compile time?
Does the standard restrict possible memory addresses (which I would interpret as possible values for a pointer)?
The C++ (nor C to my knowledge) standard does not restrict possible memory addresses.
Can your code rely on some values to be never used and still be fully portable?
A program that unconditionally relies on such implementation defined (or unspecified by standard) detail would not be fully portable to all concrete and theoretical standard implementations.
However, using platform detection macros, it may be possible to make the program portable by relying on the detail conditionally only on systems where the detail is reliable.
P.S. Another thing that you cannot technically rely on: unsigned long is not guaranteed to be able to represent all pointer values (uintptr_t is).
The standard term is 'safely derived pointer'. It is implementation defined if you can use not safely derived pointers - for example, numeric constants - in your program.
You can check pointer safety model with std::pointer_safety: https://en.cppreference.com/w/cpp/memory/gc/pointer_safety

How much speed gain if using __INLINE__?

In my understanding, INLINE can speed up code execution, is it?
How much speed can we gain from it?
Ripped from here:
Yes and no. Sometimes. Maybe.
There are no simple answers. inline functions might make the code faster, they might make it slower. They might make the executable larger, they might make it smaller. They might cause thrashing, they might prevent thrashing. And they might be, and often are, totally irrelevant to speed.
inline functions might make it faster: As shown above, procedural integration might remove a bunch of unnecessary instructions, which might make things run faster.
inline functions might make it slower: Too much inlining might cause code bloat, which might cause "thrashing" on demand-paged virtual-memory systems. In other words, if the executable size is too big, the system might spend most of its time going out to disk to fetch the next chunk of code.
inline functions might make it larger: This is the notion of code bloat, as described above. For example, if a system has 100 inline functions each of which expands to 100 bytes of executable code and is called in 100 places, that's an increase of 1MB. Is that 1MB going to cause problems? Who knows, but it is possible that that last 1MB could cause the system to "thrash," and that could slow things down.
inline functions might make it smaller: The compiler often generates more code to push/pop registers/parameters than it would by inline-expanding the function's body. This happens with very small functions, and it also happens with large functions when the optimizer is able to remove a lot of redundant code through procedural integration — that is, when the optimizer is able to make the large function small.
inline functions might cause thrashing: Inlining might increase the size of the binary executable, and that might cause thrashing.
inline functions might prevent thrashing: The working set size (number of pages that need to be in memory at once) might go down even if the executable size goes up. When f() calls g(), the code is often on two distinct pages; when the compiler procedurally integrates the code of g() into f(), the code is often on the same page.
inline functions might increase the number of cache misses: Inlining might cause an inner loop to span across multiple lines of the memory cache, and that might cause thrashing of the memory-cache.
inline functions might decrease the number of cache misses: Inlining usually improves locality of reference within the binary code, which might decrease the number of cache lines needed to store the code of an inner loop. This ultimately could cause a CPU-bound application to run faster.
inline functions might be irrelevant to speed: Most systems are not CPU-bound. Most systems are I/O-bound, database-bound or network-bound, meaning the bottleneck in the system's overall performance is the file system, the database or the network. Unless your "CPU meter" is pegged at 100%, inline functions probably won't make your system faster. (Even in CPU-bound systems, inline will help only when used within the bottleneck itself, and the bottleneck is typically in only a small percentage of the code.)
There are no simple answers: You have to play with it to see what is best. Do not settle for simplistic answers like, "Never use inline functions" or "Always use inline functions" or "Use inline functions if and only if the function is less than N lines of code." These one-size-fits-all rules may be easy to write down, but they will produce sub-optimal results.
Copyright (C) Marshall Cline
Using inline makes the system use the substitution model of evaluation, but this is not guaranteed to be used all the time. If this is used, the generated code will be longer and may be faster, but if some optimizations are active, the sustitution model is not faster not all the time.
The reason I use inline function specifier (specifically, static inline), is not because of "speed", but because
static part tells the compiler the function is only visible in the current translation unit (the current file being compiled and included header files)
inline part tells the compiler it can include the implementation of the function at the call site, if it wants to
static inline tells the compiler that it can skip the function completely if it is not used at all in the current translation unit
(Specifically, the compiler that I use most with the options I use most, gcc -Wall, does issue a warning if a function marked static is unused; but will not issue a warning if a function marked static inline is unused.)
static inline tells us humans that the function is a macro-like helper function, in addition adding type-checker to the same behavior as macro's.
Thus, in my opinion, the assumption that inline has anything to do with speed per se, is incorrect. Answering the stated question with a straight answer would be misleading.
In my code, you see them associated with some data structures, or occasionally global variables.
A typical example is when I want to implement a Xorshift pseudorandom number generator in my own C code:
#include <inttypes.h>
static uint64_t prng_state = 1; /* Any nonzero uint64_t seed is okay */
static inline uint64_t prng_u64(void)
{
uint64_t state;
state = prng_state;
state ^= state >> 12;
state ^= state << 25;
state ^= state >> 27;
prng_state = state;
return state * UINT64_C(2685821657736338717);
}
The static uint64_t prng_state = 1; means that prng_state is a variable of type uint64_t, visible only in the current compilation unit, and initialized to 1. The prng_u64() function returns an unsigned 64-bit pseudorandom integer. However, if you do not use prng_u64(), the compiler will not generate code for it either.
Another typical use case is when I have data structures, and they need accessor functions. For example,
#ifndef GRID_H
#define GRID_H
#include <stdlib.h>
typedef struct {
int rows;
int cols;
unsigned char *cell;
} grid;
#define GRID_INIT { 0, 0, NULL }
#define GRID_OUTSIDE -1
static inline int grid_get(grid *const g, const int row, const int col)
{
if (!g || row < 0 || col < 0 || row >= g->rows || col >= g->cols)
return GRID_OUTSIDE;
return g->cell[row * (size_t)(g->cols) + col];
}
static inline int grid_set(grid *const g, const int row, const int col,
const unsigned char value)
{
if (!g || row < 0 || col < 0 || row >= g->rows || col >= g->cols)
return GRID_OUTSIDE;
return g->cell[row * (size_t)(g->cols) + col] = value;
}
static inline void grid_init(grid *g)
{
g->rows = 0;
g->cols = 0;
g->cell = NULL;
}
static inline void grid_free(grid *g)
{
free(g->cell);
g->rows = 0;
g->cols = 0;
g->cell = NULL;
}
int grid_create(grid *g, const int rows, const int cols,
const unsigned char initial_value);
int grid_load(grid *g, FILE *handle);
int grid_save(grid *g, FILE *handle);
#endif /* GRID_H */
That header file defines some useful helper functions, and declares the functions grid_create(), grid_load(), and grid_save(), that would be implemented in a separate .c file.
(Yes, those three functions could be implemented in the header file just as well, but it would make the header file quite large. If you had a large project, spread over many translation units (.c source files), each one including the header file would get their own local copies of the functions. The accessor functions defined as static inline above are short and trivial, so it is perfectly okay for them to be copied here and there. The three functions I omitted are much larger.)

Best strategy to call an arbitrary function without using JMP or LCALL

In embedded C is quite natural to have some fixed/generic algorithm but more than one possible implementation. This is due to several product presentations, sometimes options, other times its just part of product roadmap strategies, such as optional RAM, different IP-set MCU, uprated frequency, etc.
In most of my projects I deal with that by decoupling the core stuff, the algorithm and the logic architecture, from the actual functions that implement outside state evaluation, time keeping, memory storage, etc.
Naturally, I use the C function pointers mechanism, and I use a set of meaningful names for those pointers. E.G.
unsigned char (*ucEvalTemperature)(int *);
That one stores temperature in an int, and the return is OKness.
Now imagine that for a specific configuration of the product I have a
unsigned char ucReadI2C_TMP75(int *);
a function that reads temperature on the I2C bus from the TMP75 device and
unsigned char ucReadCh2_ADC(unsigned char *);
a function that reads a diode voltage drop, read by an ADC, which is a way to evaluate temperature in very broad strokes.
It's the same basic functionality but on different option-set products.
In some configs I'll have ucEvalTemperature setup to ucReadI2C_TMP75, while on other I'll have ucReadCh2_ADC. In this mild case, to avoid problems, I should change the argument type to int, because the pointer is always the same size, but the function signature isn't and the compiler will complaint. Ok... that's not a killer.
The issue becomes apparent on functions that may need to have different set of arguments. The signatures won't ever be right, and the compiler will not be able to resolve my Fpointers.
So, I have three ways:
use a global stack of arguments, and all functions are unsigned char Func(void);
use a helper function to each implementation that lets me switch on the right assignment to make/call;
use JMP / LCALL assembly calls (which of course is horrible), potentially causing major problems on the call stack.
Neither is elegant, or composed... what's your approach/advise?
I usually prefer to have a layered architecture. The communication with the hardware is achieved with "drivers". The algorithms layers call functions (readTemp), which are implemented by the driver. The key point is that an interface needs to defined and that must be honoured by all driver implementations.
The higher layer should know nothing about how the temperature is read (It doesn't matter if you use TMP75 or an ADC). The disadvantage of the drivers architecture is that you generally can't switch a driver at runtime. For most embedded projects this is not a problem. If you want to do it, define function pointers to the functions exposed by the driver (which follow the common interface) and not to the implementation functions.
If you can, use struct "interfaces":
struct Device {
int (*read_temp)(int*, Device*);
} *dev;
call it:
dev->read_temp(&ret, dev);
If you need additional arguments, pack them inside Device
struct COMDevice {
struct Device d;
int port_nr;
};
and when you use this, just downcast.
Then, you'll create functions for your devices:
int foo_read_temp(int* ret, struct Device*)
{
*ret = 100;
return 0;
}
int com_device_read_temp(int* ret, struct Device* dev)
{
struct COMDevice* cdev = (struct COMDevice*)dev; /* could be made as a macro */
... communicate with device on com port cdev->port_nr ...
*ret = ... what you got ...
return error;
}
and create the devices like this:
Device* foo_device= { .read_temp = foo_read_temp };
COMDevice* com_device_1= { .d = { .read_temp = com_read_temp },
.port_nr = 0x3e8
};
COMDevice* com_device_1= { .d = { .read_temp = com_read_temp },
.port_nr = 0x3f8
};
You'll pass the Device structure around to function that need to read the temperature.
This (or something similar) is used in the Linux kernel, exceptthat they don't put the function pointers inside the structs, but create a special static struct for it and stor pointer to this struct inside the Device struct. This is almost exactly how object oriented lingages like C++ implement polymorphism.
If you put the functions in a separate compilation unit, including the Device struct that refer to them, you can still save space and leave them out while linking.
If you need different types of arguments, or fewer arguments, just forget it. This means you cannot design a common interface (in any sense) for things you want to change, but without a common interface, no changeable implementation is possible. You can use compile-time polymorphism (eg. typedefs & separate compilation units for different implementations, one of which would be linked in your binary), but it still has to be at least source-compatible, that is, be called in the same way.
The correct way is to use a helper function. Sure, the unsigned char ucReadCh2_ADC(unsigned char *); may look like it stores the result as a value [0,255]. But who says the actual range is [0,255] ? And even if it did, what would those values represent?
On the other hand, if you'd typedef unsigned long milliKelvin, the typedef unsigned char (*EvalTemperature)(milliKelvin *out); is a lot clearer. And for every function, it becomes clear how it should be wrapped - quite often a trivial function.
Note that I removed the "uc" prefix from the typedef since the function didn't return an unsigned char anyway. It returns a boolean, OK-ness.
(Colbert fans may want to use a float to indicate truthiness ;-) )

Resources