Is this a decent home-made mutex implementation? Criticism? Potential Problems? - c

I'm wondering if anyone sees anything that would likely cause problems in this code. I know there are other ways/API calls I could used to have done this, but I'm trying to lay the foundation for my own platform independant? / cross-platform mutex framework.
Obviously I need to do some #ifdef's and define some macros for the Win32 Sleep() and GetCurrentThreadID() calls...
typedef struct aec {
unsigned long long lastaudibleframe; /* time stamp of last audible frame */
unsigned short aiws; /* Average mike input when speaker is playing */
unsigned short aiwos; /*Average mike input when speaker ISNT playing */
unsigned long long t_aiws, t_aiwos; /* Internal running total */
unsigned int c_aiws, c_aiwos; /* Internal counters */
unsigned long lockthreadid;
int stlc; /* Same thread lock count */
} AEC;
char lockecho( AEC *ec ) {
unsigned long tid=0;
static int inproc=0;
while (inproc) {
Sleep(1);
}
inproc=1;
if (!ec) {
inproc=0;
return 0;
}
tid=GetCurrentThreadId();
if (ec->lockthreadid==tid) {
inproc=0;
ec->stlc++;
return 1;
}
while (ec->lockthreadid!=0) {
Sleep(1);
}
ec->lockthreadid=tid;
inproc=0;
return 1;
}
char unlockecho( AEC *ec ) {
unsigned long tid=0;
if (!ec)
return 1;
tid=GetCurrentThreadId();
if (tid!=ec->lockthreadid)
return 0;
if (tid==ec->lockthreadid) {
if (ec->stlc>0) {
ec->stlc--;
} else {
ec->lockthreadid=0;
}
}
return 1;
}

No it's not, AFAIK you can't implement a mutex with plain C code without some low-level atomic operations (RMW, Test and Set... etc).. In your particular example, consider what happens if a context switch interrupts the first thread before it gets a chance to set inproc, then the second thread will resume and set it to 1 and now both threads "think" they have exclusive access to the struct.. this is just one of many things that could go wrong with your approach.
Also note that even if a thread gets a chance to set inproc, assignment is not guranteed to be atomic (it could be interrupted in the middle of assigning the variable).

As mux points out, your proposed code is incorrect due to many race conditions. You could solve this using atomic instructions like "Compare and Set", but you'll need to define those separately for each platform anyway. You're better off just defining a high-level "Lock" and "Unlock" interface, and implementing those using whatever the platform provides.

Related

Memory ordering for a spin-lock "call once" implementation

Suppose I wanted to implement a mechanism for calling a piece of code exactly once (e.g. for initialization purposes), even when multiple threads hit the call site repeatedly. Basically, I'm trying to implement something like pthread_once, but with GCC atomics and spin-locking. I have a candidate implementation below, but I'd like to know if
a) it could be faster in the common case (i.e. already initialized), and,
b) is the selected memory ordering strong enough / too strong?
Architectures of interest are x86_64 (primarily) and aarch64.
The intended use API is something like this
void gets_called_many_times_from_many_threads(void)
{
static int my_once_flag = 0;
if (once_enter(&my_once_flag)) {
// do one-time initialization here
once_commit(&my_once_flag);
}
// do other things that assume the initialization has taken place
}
And here is the implementation:
int once_enter(int *b)
{
int zero = 0;
int got_lock = __atomic_compare_exchange_n(b, &zero, 1, 0, __ATOMIC_RELAXED, __ATOMIC_RELAXED);
if (got_lock) return 1;
while (2 != __atomic_load_n(b, __ATOMIC_ACQUIRE)) {
// on x86, insert a pause instruction here
};
return 0;
}
void once_commit(int *b)
{
(void) __atomic_store_n(b, 2, __ATOMIC_RELEASE);
}
I think that the RELAXED ordering on the compare exchange is okay, because we don't skip the atomic load in the while condition even if the compare-exchange gives us 2 (in the "zero" variable), so the ACQUIRE on that load synchronizes with the RELEASE in once_commit (I think), but maybe on a successful compare-exchange we need to use RELEASE? I'm unclear here.
Also, I just learned that lock cmpxchg is a full memory barrier on x86, and since we are hitting the __atomic_compare_exchange_n in the common case (initialization has already been done), that barrier it is occurring on every function call. Is there an easy way to avoid this?
UPDATE
Based on the comments and accepted answer, I've come up with the following modified implementation. If anybody spots a bug please let me know, but I believe it's correct. Basically, the change amounts to implementing double-check locking. I also switched to using SEQ_CST because:
I mainly care that the common (already initialized) case is fast.
I observed that GCC doesn't emit a memory fence instruction on x86 for the first read (and it does do so on ARM even with ACQUIRE).
#ifdef __x86_64__
#define PAUSE() __asm __volatile("pause")
#else
#define PAUSE()
#endif
int once_enter(int *b)
{
if(2 == __atomic_load_n(b, __ATOMIC_SEQ_CST)) return 0;
int zero = 0;
int got_lock = __atomic_compare_exchange_n(b, &zero, 1, 0, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
if (got_lock) return 1;
while (2 != __atomic_load_n(b, __ATOMIC_SEQ_CST)) {
PAUSE();
};
return 0;
}
void once_commit(int *b)
{
(void) __atomic_store_n(b, 2, __ATOMIC_SEQ_CST);
}
a, What you need is a double-checked lock.
Basically, instead of entering the lock every time, you do an acquiring-load to see if the initialisation has been done yet, and only invoke once_enter if it has not.
void gets_called_many_times_from_many_threads(void)
{
static int my_once_flag = 0;
if (__atomic_load_n(&my_once_flag, __ATOMIC_ACQUIRE) != 2) {
if (once_enter(&my_once_flag)) {
// do one-time initialization here
once_commit(&my_once_flag);
}
}
// do other things that assume the initialization has taken place
}
b, I believe this is enough, your initialisation happens before the releasing store of 2 to my_once_flag, and every other thread has to observe the value of 2 with an acquiring load from the same variable.

would this be a functioning spinlock in c?

I am currently learning about operating systems and was wondering, using these definitions, would this be working as expected or am I missing some atomic operations?
int locked = 0;
void lsLock(){
while(1) {
if (locked){
continue;
} else {
locked = 1;
break;
}
}
}
void lsUnlock(){
locked = 0;
}
Thanks in advance!
The first problem is that the compiler assumes that nothing will be altered by anything "external" (unless it's marked volatile); which means that the compiler is able to optimize this:
int locked = 0;
void lsLock(){
while(1) {
if (locked){
continue;
} else {
locked = 1;
break;
}
}
}
..into this:
int locked = 0;
void lsLock(){
if (locked){
while(1) { }
}
locked = 1;
}
Obviously that won't work if something else modifies locked - once the while(1) {} starts it won't stop. To fix that, you could use volatile int locked = 0;.
This only prevents the compiler from assuming locked didn't change. Nothing prevents a theoretical CPU from playing similar tricks (e.g. even if volatile is used, a non-cache coherent CPU could not realize a different CPU altered locked). For a guarantee you need to use atomics or something else (e.g. memory barriers).
However; with volatile int locked it may work, especially for common 80x86 CPUs. Please note that "works" can be considered the worst possibility - it leads to assuming that the code is fine and then having a spectacular (and nearly impossible to debug due to timing issues) disaster when you compile the same code for a different CPU.

Fast non blocking keyboard IO in C under MinGW

I have written a CPU emulator in C on windows for fun, and I want it to handle its own IO in a non-blocking fashion: if there has been a keypress, return the char value of that keypress, else return 0.
At the moment I am using the following:
#include <conio.h>
...
unsigned int input(){
unsigned int input_data;
if (_kbhit()){
input_data = (unsigned int)_getch();
}
else{
input_data = 0;
}
return input_data;
}
And in terms of function, it is fine. The one problem I have is that it is very detrimental to the speed of the emulator - the emulator can go from 60-100 million instructions per second to the scale of tens or hundreds of thousands, just by running programs with lots of IO instructions. Is there a faster way to do this, whilst still keeping the same functionality?
Two options comes to my mind:
The first option is the easiest one. Do not check it every time. OS calls are expensive, and if your emulator calls this very often, it will slow everything down.
#include <conio.h>
...
unsigned int input(){
static int cheat = 0;
cheat = (cheat + 1) % 128;
if (cheat){
return 0;
}
unsigned int input_data;
if (_kbhit()){
input_data = (unsigned int)_getch();
}
else{
input_data = 0;
}
return input_data;
}
Second option is to receive the actual keyboard input async and store the input data into a buffer. And your input() function checks this buffer. This removes the call to the OS all together in the tight loop.

C: Minimising code duplication using functions in a header file

this is a bit of a strange use case so searching for existing discussion is difficult. I'm programming for embedded systems (Microchip PIC24 using XC16 compiler) and am currently implementing a communication protocol identically across 3 separate UART channels (each UART will grab data from a master data table).
The way I started out writing the project was to have each UART handled by a separate module, with a lot of code duplication, along the lines of the following pseudocode:
UART1.c:
static unsigned char buffer[128];
static unsigned char pointer = 0;
static unsigned char packet_received = 0;
void interrupt UART1Receive (void) {
buffer[pointer++] = UART1RX_REG;
if (end of packet condition) packet_received = 1;
}
void processUART1(void) { // This is called regularly from main loop
if (packet_received) {
// Process packet
}
}
UART2.c:
static unsigned char buffer[128];
static unsigned char pointer = 0;
static unsigned char packet_received = 0;
void interrupt UART2Receive (void) {
buffer[pointer++] = UART2RX_REG;
if (end of packet condition) packet_received = 1;
}
void processUART2(void) { // This is called regularly from main loop
if (packet_received) {
// Process packet
}
}
While the above is neat and works well, in practice the communication protocol itself is quite complex, so having it duplicated three times (simply with changes to references to the UART registers) is increasing the opportunity for bugs to be introduced. Having a single function and passing pointers to it is not an option, since this will have too great an impact on speed. The code needs to be physically duplicated in memory for each UART.
I gave it a lot of thought and despite knowing the rules of never putting functions in a header file, decided to try a specific header file that included the duplicate code, with references as #defined values:
protocol.h:
// UART_RECEIVE_NAME and UART_RX_REG are just macros to be defined
// in calling file
void interrupt UART_RECEIVE_NAME (void) {
buffer[pointer++] = UART_RX_REG;
if (end of packet condition) packet_received = 1;
}
UART1.c:
static unsigned char buffer[128];
static unsigned char pointer = 0;
static unsigned char packet_received = 0;
#define UART_RECEIVE_NAME UART1Receive
#define UART_RX_REG UART1RX_REG
#include "protocol.h"
void processUART1(void) { // This is called regularly from main loop
if (packet_received) {
// Process packet
}
}
UART2.c:
static unsigned char buffer[128];
static unsigned char pointer = 0;
static unsigned char packet_received = 0;
#define UART_RECEIVE_NAME UART2Receive
#define UART_RX_REG UART2RX_REG
#include "protocol.h"
void processUART2(void) { // This is called regularly from main loop
if (packet_received) {
// Process packet
}
}
I was slightly surprised when the code compiled without any errors! It does seem to work though, and post compilation MPLAB X can even work out all of the symbol references so that every macro reference in UART1.c and UART2.c don't get identified as an unresolvable identifier. I did then realise I should probably rename the protocol.h file to protocol.c (and update the #includes accordingly), but that's not practically a big deal.
There is only one downside: the IDE has no idea what to do while stepping through code included from protocol.h while simulating or debugging. It just stays at the calling instruction while the code executes, so debugging will be a little more difficult.
So how hacky is this solution? Will the C gods smite me for even considering this? Are there any better alternatives that I've overlooked?
An alternative is to define a function macro that contains the body of code. Some token pasting operators can automatically generate the symbol names required. Multi-line macros can be generated by using \ at the end of all but the last line.
#define UART_RECEIVE(n) \
void interrupt UART##n##Receive (void) { \
buffer[pointer++] = UART##n##RX_REG; \
if (end of packet condition) packet_received = 1; \
}
UART_RECEIVE(1)
UART_RECEIVE(2)
Using macros for this purpose seems for mee to be a bad idea. Making debugging impossible is just one disadvantage. It also makes it difficult to understand, by hiding the real meaning of symbols. And interrupt routines should realy be kept independant and short, with common functions hidden in handler functions.
The first thing I would do is to define a common buffer struct for each UART. This makes it possible with simultanous communications. If each uart needs a separate handler function for the messages, it can be included as a function pointer. The syntax is a bit
complicated, but it results in efficient code.
typedef struct uart_buf uart_buf_t;
struct uart_buf {
uint8_t* buffer;
int16_t inptr;
bool packet_received;
void (*handler_func)(uart_buf_t*);
};
uart_buf_t uart_buf_1;
uart_buf_t uart_buf_2;
Then each interrupt handler will be like this:
void interrupt UART1Receive (void) {
handle_input(UART1RX_REG, &uart_buf_1);
}
void interrupt UART2Receive (void) {
handle_input(UART2RX_REG, &uart_buf_2);
}
And the common handler will be:
void handle_input(uint8_t in_char, *buff) {
buf->buffer[buf->inptr++] = in_char;
if (in_char=LF)
buf->packet_received = true;
buf->handler_func(buf);
}
}
And the message handler is:
void hadle_packet(uart_buf_t* buf) {
... code to handle message
buf->packet_received=0;
}
And the function pointers must be initialized:
void init() {
uart_buf_1.handler_func=handler1;
uart_buf_2.handler_func=handler1;
}
The resulting code is very flexible, and can be easily changed. Single-steping the code is no problem.

What's a good way to implement simple clone() based multithread library?

I'm trying to build simple multithread library based on linux using clone() and other kernel utilities.I've come to a point where I'm not really sure what's the correct way to do things. I tried going trough original NPTL code but it's a bit too much.
That's how for instance I imagine the create method:
typedef int sk_thr_id;
typedef void *sk_thr_arg;
typedef int (*sk_thr_func)(sk_thr_arg);
sk_thr_id sk_thr_create(sk_thr_func f, sk_thr_arg a){
void* stack;
stack = malloc( 1024*64 );
if ( stack == 0 ){
perror( "malloc: could not allocate stack" );
exit( 1 );
}
return ( clone(f, (char*) stack + FIBER_STACK, SIGCHLD | CLONE_FS | CLONE_FILES | CLONE_SIGHAND | CLONE_VM, a ) );
}
1: I'm not really sure what the correct clone() flags should be. I just found these being used in a simple example. Any general directions here will be welcome.
Here are parts of the mutex primitives created using futexes(not my own code for now):
#define cmpxchg(P, O, N) __sync_val_compare_and_swap((P), (O), (N))
#define cpu_relax() asm volatile("pause\n": : :"memory")
#define barrier() asm volatile("": : :"memory")
static inline unsigned xchg_32(void *ptr, unsigned x)
{
__asm__ __volatile__("xchgl %0,%1"
:"=r" ((unsigned) x)
:"m" (*(volatile unsigned *)ptr), "0" (x)
:"memory");
return x;
}
static inline unsigned short xchg_8(void *ptr, char x)
{
__asm__ __volatile__("xchgb %0,%1"
:"=r" ((char) x)
:"m" (*(volatile char *)ptr), "0" (x)
:"memory");
return x;
}
int sys_futex(void *addr1, int op, int val1, struct timespec *timeout, void *addr2, int val3)
{
return syscall(SYS_futex, addr1, op, val1, timeout, addr2, val3);
}
typedef union mutex mutex;
union mutex
{
unsigned u;
struct
{
unsigned char locked;
unsigned char contended;
} b;
};
int mutex_init(mutex *m, const pthread_mutexattr_t *a)
{
(void) a;
m->u = 0;
return 0;
}
int mutex_lock(mutex *m)
{
int i;
/* Try to grab lock */
for (i = 0; i < 100; i++)
{
if (!xchg_8(&m->b.locked, 1)) return 0;
cpu_relax();
}
/* Have to sleep */
while (xchg_32(&m->u, 257) & 1)
{
sys_futex(m, FUTEX_WAIT_PRIVATE, 257, NULL, NULL, 0);
}
return 0;
}
int mutex_unlock(mutex *m)
{
int i;
/* Locked and not contended */
if ((m->u == 1) && (cmpxchg(&m->u, 1, 0) == 1)) return 0;
/* Unlock */
m->b.locked = 0;
barrier();
/* Spin and hope someone takes the lock */
for (i = 0; i < 200; i++)
{
if (m->b.locked) return 0;
cpu_relax();
}
/* We need to wake someone up */
m->b.contended = 0;
sys_futex(m, FUTEX_WAKE_PRIVATE, 1, NULL, NULL, 0);
return 0;
}
2: The main question for me is how to implement the "join" primitive? I know it's supposed to be based on futexes too. It's a struggle for me for now to come up with something.
3: I need some way to cleanup stuff(like the allocated stack) after a thread has finished. I can't really thing of a good way to do this too.
Probably for these I'll need to have additional structure in user space for every thread with some information saved in it. Can someone point me in good direction for solving these issues?
4: I'll want to have a way to tell how much time a thread has been running, how long it's been since it's last being scheduled and other stuff like that. Are there some kernel calls providing such info?
Thanks in advance!
The idea that there can exist a "multithreading library" as a third-party library separate from the rest of the standard library is an outdated and flawed notion. If you want to do this, you'll have to first drop all use of the standard library; particularly, your call to malloc is completely unsafe if you're calling clone yourself, because:
malloc will have no idea that multiple threads exist, and therefore may fail to perform proper synchronization.
Even if it knew they existed, malloc will need to access an unspecified, implementation-specific structure located at the address given by the thread pointer. As this structure is implementation-specific, you have no way of creating such a structure that will be interpreted correctly by both the current and all future versions of your system's libc.
These issues don't apply just to malloc but to most of the standard library; even async-signal-safe functions may be unsafe to use, as they might dereference the thread pointer for cancellation-related purposes, performing optimal syscall mechanisms, etc.
If you really insist on making your own threads implementation, you'll have to abstain from using glibc or any modern libc that's integrated with threads, and instead opt for something much more naive like klibc. This could be an educational experiment, but it would not be appropriate for a deployed application.
1) You are using an example of LinuxThreads. I will not rewrite good references for directions, but I advise you "The Linux Programming interface" of Michael Kerrisk, chapter 28. It explains in 25 pages, what you need.
2) If you set the CLONE_CHILD_CLEARID flag, when the child terminates, the ctid argument of clone is cleared. If you treat that pointer as a futex, you can implement the join primitive. Good luck :-) If you don't want to use futexes, have also a look to wait3 and wait4.
3) I do not know what you want to cleanup, but you can use the clone tls arugment. This is a thread local storage buffer. If the thread is finished, you can clean that buffer.
4) See getrusage.

Resources