Lock Free stack implementation idea - currently broken - c

I came up with an idea I am trying to implement for a lock free stack that does not rely on reference counting to resolve the ABA problem, and also handles memory reclamation properly. It is similar in concept to RCU, and relies on two features: marking a list entry as removed, and tracking readers traversing the list. The former is simple, it just uses the LSB of the pointer. The latter is my "clever" attempt at an approach to implementing an unbounded lock free stack.
Basically, when any thread attempts to traverse the list, one atomic counter (list.entries) is incremented. When the traversal is complete, a second counter (list.exits) is incremented.
Node allocation is handled by push, and deallocation is handled by pop.
The push and pop operations are fairly similar to the naive lock-free stack implementation, but the nodes marked for removal must be traversed to arrive at a non-marked entry. Push basically is therefore much like a linked list insertion.
The pop operation similarly traverses the list, but it uses atomic_fetch_or to mark the nodes as removed while traversing, until it reaches a non-marked node.
After traversing the list of 0 or more marked nodes, a thread that is popping will attempt to CAS the head of the stack. At least one thread concurrently popping will succeed, and after this point all readers entering the stack will no longer see the formerly marked nodes.
The thread that successfully updates the list then loads the atomic list.entries, and basically spin-loads atomic.exits until that counter finally exceeds list.entries. This should imply that all readers of the "old" version of the list have completed. The thread then simply frees the the list of marked nodes that it swapped off the top of the list.
So the implications from the pop operation should be (I think) that there can be no ABA problem because the nodes that are freed are not returned to the usable pool of pointers until all concurrent readers using them have completed, and obviously the memory reclamation issue is handled as well, for the same reason.
So anyhow, that is theory, but I'm still scratching my head on the implementation, because it is currently not working (in the multithreaded case). It seems like I am getting some write after free issues among other things, but I'm having trouble spotting the issue, or maybe my assumptions are flawed and it just won't work.
Any insights would be greatly appreciated, both on the concept, and on approaches to debugging the code.
Here is my current (broken) code (compile with gcc -D_GNU_SOURCE -std=c11 -Wall -O0 -g -pthread -o list list.c):
#include <pthread.h>
#include <stdatomic.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdlib.h>
#include <sys/resource.h>
#include <stdio.h>
#include <unistd.h>
#define NUM_THREADS 8
#define NUM_OPS (1024 * 1024)
typedef uint64_t list_data_t;
typedef struct list_node_t {
struct list_node_t * _Atomic next;
list_data_t data;
} list_node_t;
typedef struct {
list_node_t * _Atomic head;
int64_t _Atomic size;
uint64_t _Atomic entries;
uint64_t _Atomic exits;
} list_t;
enum {
NODE_IDLE = (0x0),
NODE_REMOVED = (0x1 << 0),
NODE_FREED = (0x1 << 1),
NODE_FLAGS = (0x3),
};
static __thread struct {
uint64_t add_count;
uint64_t remove_count;
uint64_t added;
uint64_t removed;
uint64_t mallocd;
uint64_t freed;
} stats;
#define NODE_IS_SET(p, f) (((uintptr_t)p & f) == f)
#define NODE_SET_FLAG(p, f) ((void *)((uintptr_t)p | f))
#define NODE_CLR_FLAG(p, f) ((void *)((uintptr_t)p & ~f))
#define NODE_POINTER(p) ((void *)((uintptr_t)p & ~NODE_FLAGS))
list_node_t * list_node_new(list_data_t data)
{
list_node_t * new = malloc(sizeof(*new));
new->data = data;
stats.mallocd++;
return new;
}
void list_node_free(list_node_t * node)
{
free(node);
stats.freed++;
}
static void list_add(list_t * list, list_data_t data)
{
atomic_fetch_add_explicit(&list->entries, 1, memory_order_seq_cst);
list_node_t * new = list_node_new(data);
list_node_t * _Atomic * next = &list->head;
list_node_t * current = atomic_load_explicit(next, memory_order_seq_cst);
do
{
stats.add_count++;
while ((NODE_POINTER(current) != NULL) &&
NODE_IS_SET(current, NODE_REMOVED))
{
stats.add_count++;
current = NODE_POINTER(current);
next = &current->next;
current = atomic_load_explicit(next, memory_order_seq_cst);
}
atomic_store_explicit(&new->next, current, memory_order_seq_cst);
}
while(!atomic_compare_exchange_weak_explicit(
next, &current, new,
memory_order_seq_cst, memory_order_seq_cst));
atomic_fetch_add_explicit(&list->exits, 1, memory_order_seq_cst);
atomic_fetch_add_explicit(&list->size, 1, memory_order_seq_cst);
stats.added++;
}
static bool list_remove(list_t * list, list_data_t * pData)
{
uint64_t entries = atomic_fetch_add_explicit(
&list->entries, 1, memory_order_seq_cst);
list_node_t * start = atomic_fetch_or_explicit(
&list->head, NODE_REMOVED, memory_order_seq_cst);
list_node_t * current = start;
stats.remove_count++;
while ((NODE_POINTER(current) != NULL) &&
NODE_IS_SET(current, NODE_REMOVED))
{
stats.remove_count++;
current = NODE_POINTER(current);
current = atomic_fetch_or_explicit(&current->next,
NODE_REMOVED, memory_order_seq_cst);
}
uint64_t exits = atomic_fetch_add_explicit(
&list->exits, 1, memory_order_seq_cst) + 1;
bool result = false;
current = NODE_POINTER(current);
if (current != NULL)
{
result = true;
*pData = current->data;
current = atomic_load_explicit(
&current->next, memory_order_seq_cst);
atomic_fetch_add_explicit(&list->size,
-1, memory_order_seq_cst);
stats.removed++;
}
start = NODE_SET_FLAG(start, NODE_REMOVED);
if (atomic_compare_exchange_strong_explicit(
&list->head, &start, current,
memory_order_seq_cst, memory_order_seq_cst))
{
entries = atomic_load_explicit(&list->entries, memory_order_seq_cst);
while ((int64_t)(entries - exits) > 0)
{
pthread_yield();
exits = atomic_load_explicit(&list->exits, memory_order_seq_cst);
}
list_node_t * end = NODE_POINTER(current);
list_node_t * current = NODE_POINTER(start);
while (current != end)
{
list_node_t * tmp = current;
current = atomic_load_explicit(&current->next, memory_order_seq_cst);
list_node_free(tmp);
current = NODE_POINTER(current);
}
}
return result;
}
static list_t list;
pthread_mutex_t ioLock = PTHREAD_MUTEX_INITIALIZER;
void * thread_entry(void * arg)
{
sleep(2);
int id = *(int *)arg;
for (int i = 0; i < NUM_OPS; i++)
{
bool insert = random() % 2;
if (insert)
{
list_add(&list, i);
}
else
{
list_data_t data;
list_remove(&list, &data);
}
}
struct rusage u;
getrusage(RUSAGE_THREAD, &u);
pthread_mutex_lock(&ioLock);
printf("Thread %d stats:\n", id);
printf("\tadded = %lu\n", stats.added);
printf("\tremoved = %lu\n", stats.removed);
printf("\ttotal added = %ld\n", (int64_t)(stats.added - stats.removed));
printf("\tadded count = %lu\n", stats.add_count);
printf("\tremoved count = %lu\n", stats.remove_count);
printf("\tadd average = %f\n", (float)stats.add_count / stats.added);
printf("\tremove average = %f\n", (float)stats.remove_count / stats.removed);
printf("\tmallocd = %lu\n", stats.mallocd);
printf("\tfreed = %lu\n", stats.freed);
printf("\ttotal mallocd = %ld\n", (int64_t)(stats.mallocd - stats.freed));
printf("\tutime = %f\n", u.ru_utime.tv_sec
+ u.ru_utime.tv_usec / 1000000.0f);
printf("\tstime = %f\n", u.ru_stime.tv_sec
+ u.ru_stime.tv_usec / 1000000.0f);
pthread_mutex_unlock(&ioLock);
return NULL;
}
int main(int argc, char ** argv)
{
struct {
pthread_t thread;
int id;
}
threads[NUM_THREADS];
for (int i = 0; i < NUM_THREADS; i++)
{
threads[i].id = i;
pthread_create(&threads[i].thread, NULL, thread_entry, &threads[i].id);
}
for (int i = 0; i < NUM_THREADS; i++)
{
pthread_join(threads[i].thread, NULL);
}
printf("Size = %ld\n", atomic_load(&list.size));
uint32_t count = 0;
list_data_t data;
while(list_remove(&list, &data))
{
count++;
}
printf("Removed %u\n", count);
}

You mention you are trying to solve the ABA problem, but the description and code is actually an attempt to solve a harder problem: the memory reclamation problem.
This problem typically arises in the "deletion" functionality of lock-free collections implemented in languages without garbage collection. The core issue is that a thread removing a node from a shared structure often doesn't know when it is safe to free the removed node as because other reads may still have a reference to it. Solving this problem often, as a side effect, also solves the ABA problem: which is specifically about a CAS operation succeeding even though the underlying pointer (and state of the object) has been been changed at least twice in the meantime, ending up with the original value but presenting a totally different state.
The ABA problem is easier in the sense that there are several straightforward solutions to the ABA problem specifically that don't lead to a solution to the "memory reclamation" problem. It is also easier in the sense that hardware that can detect the modification of the location, e.g., with LL/SC or transactional memory primitives, might not exhibit the problem at all.
So that said, you are hunting for a solution to the memory reclamation problem, and it will also avoid the ABA problem.
The core of your issue is this statement:
The thread that successfully updates the list then loads the atomic
list.entries, and basically spin-loads atomic.exits until that counter
finally exceeds list.entries. This should imply that all readers of
the "old" version of the list have completed. The thread then simply
frees the the list of marked nodes that it swapped off the top of the
list.
This logic doesn't hold. Waiting for list.exits (you say atomic.exits but I think it's a typo as you only talk about list.exits elsewhere) to be greater than list.entries only tells you there have now been more total exits than there were entries at the point the mutating thread captured the entry count. However, these exits may have been generated by new readers coming and going: it doesn't at all imply that all the old readers have finished as you claim!
Here's a simple example. First a writing thread T1 and a reading thread T2 access the list around the same time, so list.entries is 2 and list.exits is 0. The writing thread pops an node, and saves the current value (2) of list.entries and waits for lists.exits to be greater than 2. Now three more reading threads, T3, T4, T5 arrive and do a quick read of the list and leave. Now lists.exits is 3, and your condition is met and T1 frees the node. T2 hasn't gone anywhere though and blows up since it is reading a freed node!
The basic idea you have can work, but your two counter approach in particular definitely doesn't work.
This is a well-studied problem, so you don't have to invent your own algorithm (see the link above), or even write your own code since things like librcu and concurrencykit already exist.
For Educational Purposes
If you wanted to make this work for educational purposes though, one approach would be to use ensure that threads coming in after a modification have started use a different set of list.entry/exit counters. One way to do this would be a generation counter, and when the writer wants to modify the list, it increments the generation counter, which causes new readers to switch to a different set of list.entry/exit counters.
Now the writer just has to wait for list.entry[old] == list.exists[old], which means all the old readers have left. You could also just get away with a single counter per generation: you don't really two entry/exit counters (although it might help reduce contention).
Of course, you know have a new problem of managing this list of separate counters per generation... which kind of looks like the original problem of building a lock-free list! This problem is a bit easier though because you might put some reasonable bound on the number of generations "in flight" and just allocate them all up-front, or you might implement a limited type of lock-free list that is easier to reason about because additions and deletions only occur at the head or tail.

Related

Having a race condition in my MPSC ring buffer

I was trying to build a MPSC lock-free ring buffer for learning purpose, and am running into race conditions.
A description of the MPSC ring buffer:
It is guaranteed that poll() is never called when the buffer is empty.
Instead of mod'ing head and tail like a traditional ring buffer, it lets them proceed linearly, and AND's them before using them (since the buffer size is a power of 2, this works ok with overflow).
We keep MAX_PRODUCERS-1 slots open in the queue so that if multiple producers come and see one slot is available and proceed, they can all place their entries.
It uses 32-bit quantities for head and tail, so that it can snapshot them with a 64-bit atomic read without a lock.
My test involves a couple of threads writing some known set of values to the queue, and a consumer thread polling (when the buffer is not empty) and summing all, and verifying the correct result is obtained. With 2 or more producers, I get inconsistent sums (and with 1 producer, it works).
Any help would be much appreciated. Thank you!
Here is the code:
struct ring_buf_entry {
uint32_t seqn;
};
struct __attribute__((packed, aligned(8))) ring_buf {
union {
struct {
volatile uint32_t tail;
volatile uint32_t head;
};
volatile uint64_t snapshot;
};
volatile struct ring_buf_entry buf[RING_BUF_SIZE];
};
#define RING_SUB(x,y) ((x)>=(y)?((x)-(y)):((x)+(1ULL<<32)-(y)))
static void ring_buf_push(struct ring_buf* rb, uint32_t seqn)
{
size_t pos;
while (1) {
// rely on aligned, packed, and no member-reordering properties
uint64_t snapshot = __atomic_load_n(&(rb->snapshot), __ATOMIC_SEQ_CST);
// little endian.
uint64_t snap_head = snapshot >> 32;
uint64_t snap_tail = snapshot & 0xffffffffULL;
if (RING_SUB(snap_tail, snap_head) < RING_BUF_SIZE - MAX_PRODUCERS + 1) {
uint32_t exp = snap_tail;
if (__atomic_compare_exchange_n(&(rb->tail), &exp, snap_tail+1, 0, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)) {
pos = snap_tail;
break;
}
}
asm volatile("pause\n": : :"memory");
}
pos &= RING_BUF_SIZE-1;
rb->buf[pos].seqn = seqn;
asm volatile("sfence\n": : :"memory");
}
static struct ring_buf_entry ring_buf_poll(struct ring_buf* rb)
{
struct ring_buf_entry ret = rb->buf[__atomic_load_n(&(rb->head), __ATOMIC_SEQ_CST) & (RING_BUF_SIZE-1)];
__atomic_add_fetch(&(rb->head), 1, __ATOMIC_SEQ_CST);
return ret;
}

Linked list with a memory leak

I am struggling my head with this problem.
If I check my application with top command in Linux, I get that VIRT is always the same (running for a couple of days) while RES is increasing a bit (between 4 bytes and 32 bytes) after an operation. I perform an operation once each 60 minutes.
An operation consists in reading some frames by SPI, adding them to several linked lists and after a while, extracting them in another thread.
I executed Valgrind with the following options:
valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes -v ./application
Once I close it (My app would run forever if everything goes well) I check leaks and I see nothing but the threads I haven't closed. Only possibly lost bytes in threads. No definitely and no indirectly.
I do not know if this is normal. I have used Valgrind in the past to find some leaks and it always worked well, I am pretty sure it is working well right now too but the RES issue I can't explain.
I have checked the linked list to see if I was leaving some nodes without free but, to me, it seems to be right.
I have 32 linked lists in one array. I have done this to make the push/pop operations easier without having 32 separate lists. I don't know if this could be causing the problem. If this is the case, I will split them:
typedef struct PR_LL_M_node {
uint8_t val[60];
struct PR_LL_M_node *next;
} PR_LL_M_node_t;
pthread_mutex_t PR_LL_M_lock[32];
PR_LL_M_node_t *PR_LL_M_head[32];
uint16_t PR_LL_M_counter[32];
int16_t LL_M_InitLinkedList(uint8_t LL_M_number) {
if (pthread_mutex_init(&PR_LL_M_lock[LL_M_number], NULL) != 0) {
printf("Mutex LL M %d init failed\n", LL_M_number);
return -1;
}
PR_LL_M_ready[LL_M_number] = 0;
PR_LL_M_counter[LL_M_number] = 0;
PR_LL_M_head[LL_M_number] = NULL;
pthread_mutex_unlock(&PR_LL_M_lock[LL_M_number]);
return PR_LL_M_counter[LL_M_number];
}
int16_t LL_M_Push(uint8_t LL_M_number, uint8_t *LL_M_frame, uint16_t LL_M_size) {
pthread_mutex_lock(&PR_LL_M_lock[LL_M_number]);
PR_LL_M_node_t *current = PR_LL_M_head[LL_M_number];
if (current != NULL) {
while (current->next != NULL) {
current = current->next;
}
/* now we can add a new variable */
current->next = malloc(sizeof(PR_LL_M_node_t));
memset(current->next->val, 0x00, 60);
/* Clean buffer before using it */
memcpy(current->next->val, LL_M_frame, LL_M_size);
current->next->next = NULL;
} else {
PR_LL_M_head[LL_M_number] = malloc(sizeof(PR_LL_M_node_t));
memcpy(PR_LL_M_head[LL_M_number]->val, LL_M_frame, LL_M_size);
PR_LL_M_head[LL_M_number]->next = NULL;
}
PR_LL_M_counter[LL_M_number]++;
pthread_mutex_unlock(&PR_LL_M_lock[LL_M_number]);
return PR_LL_M_counter[LL_M_number];
}
int16_t LL_M_Pop(uint8_t LL_M_number, uint8_t *LL_M_frame) {
PR_LL_M_node_t *next_node = NULL;
pthread_mutex_lock(&PR_LL_M_lock[LL_M_number]);
if ((PR_LL_M_head[LL_M_number] == NULL)) {
pthread_mutex_unlock(&PR_LL_M_lock[LL_M_number]);
return -1;
}
if ((PR_LL_M_counter[LL_M_number] == 0)) {
pthread_mutex_unlock(&PR_LL_M_lock[LL_M_number]);
return -1;
}
next_node = PR_LL_M_head[LL_M_number]->next;
memcpy(LL_M_frame, PR_LL_M_head[LL_M_number]->val, 60);
free(PR_LL_M_head[LL_M_number]);
PR_LL_M_counter[LL_M_number]--;
PR_LL_M_head[LL_M_number] = next_node;
pthread_mutex_unlock(&PR_LL_M_lock[LL_M_number]);
return PR_LL_M_counter[LL_M_number];
}
This way I pass the number of the linked list I want to manage and I operate over it. What do you think? Is RES a real problem? I think it could be related to other parts of the application but I have commented out most of it and it always happens if the push/pop operation is used. If I leave push/pop out, RES maintains its number.
When I extract the values I use a do/while until I get -1 as response of the pop operation.
Your observations do not seem to indicate a problem:
VIRT and RES are expressed in KiB (units of 1024 bytes). Depending on how virtual memory works on your system, the numbers should always be multiples of the page size, which is most likely 4KiB.
RES is the amount of resident memory, in other words the amount of RAM actually mapped for your program at a given time.
If the program goes to sleep for 60 minutes at a time, the system (Linux) may determine that some of its pages are good candidates for discarding or swapping should it need to map memory for other processes. RES will diminish accordingly. Note incidentally that top is one such process that needs memory and thus can disturb your process while observing it, a computing variant of Heisenberg's Principle.
When the process wakes up, whatever memory is accessed by the running thread is mapped back into memory, either from the executable file(s) if it was discarded or from the swap file, or from nowhere if the discarded page was all null bytes or an unused part of the stack. This causes RES to increase again.
There might be other problems in the code, especially in code that you did not post, but if VIRT does not change, you do not seem to have a memory leak.

Ring Buffer with time stamp

I have a need for a ring buffer (In C language) which can hold objects of any type at the run time (almost the data will be different signal's values like current (100ms and 10ms) and temperature.etc) ( I am not sure if it have to be a fixed size or not) and it needs to be very high performance. although it's in a multi-tasking embedded environment.
Actually i need this buffer as a back up, which mean the embedded software will work as normal and save the data into the ring buffer, so far for any reason and when an error occurred, then i could have like a reference for the measured values then i will be able to have a look on them and determine the problem. Also i need to make a time stamp on the ring buffer, which mean every data (Signal value) is stored on the ring buffer will stored with the measurement's time.
Any code or ideas would be greatly appreciated. some of the operations required are:
create a ring buffer with specific size.
Link it with the whole software.
put at the tail.
get from the head.
at error, read the data and when its happen (time stamp).
return the count.
overwrite when the buffer is being full.
#include<stdint.h>
#include<stdio.h>
#include<stdlib.h>
typedef struct ring_buffer
{
void * buffer; // data buffer
void * buffer_end; // end of data buffer
void * data_start; // pointer to head
void * data_end; // pointer to tail
uint64_t capacity; // maximum number of items in buffer
uint64_t count; // number of items in the buffer
uint64_t size; // size of each item in the buffer
} ring_buffer;
void rb_init (ring_buffer *rb, uint64_t size, uint64_t capacity )
{
rb->buffer = malloc(capacity * size);
if(rb->buffer == NULL)
// handle error
rb->buffer_end = (char *)rb->buffer + capacity * size;
rb->capacity = capacity;
rb->count = 0;
rb->size = size;
rb->data_start = rb->buffer;
rb->data_end = rb->buffer;
}
void cb_free(ring_buffer *rb)
{
free(rb->buffer);
// clear out other fields too, just to be safe
}
void rb_push_back(ring_buffer *rb, const void *item)
{
if(rb->count == rb->capacity){
// handle error
}
memcpy(rb->data_start, item, rb->size);
rb->data_start = (char*)rb->data_start + rb->size;
if(rb->data_start == rb->buffer_end)
rb->data_start = rb->buffer;
rb->count++;
}
void rb_pop_front(ring_buffer *rb, void *item)
{
if(rb->count == 0){
// handle error
}
memcpy(item, rb->data_end, rb->size);
rb->data_end = (char*)rb->data_end + rb->size;
if(rb->data_end == rb->buffer_end)
rb->data_end = rb->buffer;
rb->count--;
}
Creating a ring buffer/FIFO with hardcopies of generic type is highly questionable design for embedded systems. You shouldn't need that high level of abstraction for code so close to the hardware.
Either you make a ring buffer with a data type tag (like an enum) plus a void* to data allocated elsewhere, or you make a ring buffer where all data is of the same type. Everything else is most likely confused program design ("XY problem").
You need some means to lock access to the ring buffer internally, to make it thread-safe/interrupt-safe. This, as well as the time stamp, has to be handled internally by the ring buffer ADT.

Audio samples producer multiple threads OSX

This question is a follow-up to a former question (Audio producer threads with OSX AudioComponent consumer thread and callback in C), including a test example, which works and behaves as expected but does not quite answer the question. I have substantially rephrased the question, and re-coded the example, so that it only contains plain-C code. (I've found out that few Objective-C portions of code in the former example only caused confusion and distracted the reader from what's essential in the question.)
In order to take advantage of multiple processor cores as well as to make the CoreAudio pull-model render thread as lightweight as possible, the LPCM samples' producer routine clearly has to "sit" on a different thread, outside the real-lime-priority render thread/callback. It must feed the samples to a circular buffer (TPCircularBuffer in this example), from which the system would schedule data pull-out in quants of inNumberFrames.
The Grand Central Dispatch API offers a simple solution, which I've deduced upon some individual research (including trial-and-error coding). This solution is elegant, since it doesn't block anything nor conflict between push and pull models. Yet the GCD, which is supposed to take care of "sub-threading" does not by far meet the specific parallelization requirements for the work threads of the producer code, so I had to explicitely spawn a number of POSIX threads, depending on the number of logical cores available. Although results are already remarkable in terms of speeding-up the computation I still feel a bit unconfortable mixing the POSIX and GCD. In particular it goes for the variable wait_interval, and computing it properly, not by predicting how many PCM samples may the render thread require for the next cycle.
Here's the shortened and simplified (pseudo)code for my test program, in plain-C.
Controller declaration:
#include "TPCircularBuffer.h"
#include <AudioToolbox/AudioToolbox.h>
#include <AudioUnit/AudioUnit.h>
#include <dispatch/dispatch.h>
#include <sys/sysctl.h>
#include <pthread.h>
typedef struct {
TPCircularBuffer buffer;
AudioComponentInstance toneUnit;
Float64 sampleRate;
AudioStreamBasicDescription streamFormat;
Float32* f; //array of updated frequencies
Float32* a; //array of updated amps
Float32* prevf; //array of prev. frequencies
Float32* preva; //array of prev. amps
Float32* val;
int* arg;
int* previous_arg;
UInt32 frames;
int state;
Boolean midif; //wip
} MyAudioController;
MyAudioController gen;
dispatch_semaphore_t mSemaphore;
Boolean multithreading, NF;
typedef struct data{
int tid;
int cpuCount;
}data;
Controller management:
void setup (void){
// Initialize circular buffer
TPCircularBufferInit(&(self->buffer), kBufferLength);
// Create the semaphore
mSemaphore = dispatch_semaphore_create(0);
// Setup audio
createToneUnit(&gen);
}
void dealloc (void) {
// Release buffer resources
TPCircularBufferCleanup(&buffer);
// Clean up semaphore
dispatch_release(mSemaphore);
// dispose of audio
if(gen.toneUnit){
AudioOutputUnitStop(gen.toneUnit);
AudioUnitUninitialize(gen.toneUnit);
AudioComponentInstanceDispose(gen.toneUnit);
}
}
Dispatcher call (launching producer queue from the main thread):
void dproducer (Boolean on, Boolean multithreading, Boolean NF)
{
if (on == true)
{
dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_HIGH, 0), ^{
if((multithreading)||(NF))
producerSum(on);
else
producer(on);
});
}
return;
}
Threadable producer routine:
void producerSum(Boolean on)
{
int rc;
int num = getCPUnum();
pthread_t threads[num];
data thread_args[num];
void* resulT;
static Float32 frames [FR_MAX];
Float32 wait_interval;
int bytesToCopy;
Float32 floatmax;
while(on){
wait_interval = FACT*(gen.frames)/(gen.sampleRate);
Float32 damp = 1./(Float32)(gen.frames);
bytesToCopy = gen.frames*sizeof(Float32);
memset(frames, 0, FR_MAX*sizeof(Float32));
availableBytes = 0;
fbuffW = (Float32**)calloc(num + 1, sizeof(Float32*));
for (int i=0; i<num; ++i)
{
fbuffW[i] = (Float32*)calloc(gen.frames, sizeof(Float32));
thread_args[i].tid = i;
thread_args[i].cpuCount = num;
rc = pthread_create(&threads[i], NULL, producerTN, (void *) &thread_args[i]);
}
for (int i=0; i<num; ++i) rc = pthread_join(threads[i], &resulT);
for(UInt32 samp = 0; samp < gen.frames; samp++)
for(int i = 0; i < num; i++)
frames[samp] += fbuffW[i][samp];
//code for managing producer state and GUI updates
{ ... }
float *head = TPCircularBufferHead(&(gen.buffer), &availableBytes);
memcpy(head,(const void*)frames,MIN(bytesToCopy, availableBytes));//copies frames to head
TPCircularBufferProduce(&(gen.buffer),MIN(bytesToCopy,availableBytes));
dispatch_semaphore_wait(mSemaphore, dispatch_time(DISPATCH_TIME_NOW, wait_interval * NSEC_PER_SEC));
if(gen.state == stopped){gen.state = idle; on = false;}
for(int i = 0; i <= num; i++)
free(fbuffW[i]);
free(fbuffW);
}
return;
}
A single producer thread may look somewhat like this:
void *producerT (void *TN)
{
Float32 samples[FR_MAX];
data threadData;
threadData = *((data *)TN);
int tid = threadData.tid;
int step = threadData.cpuCount;
int *ret = calloc(1,sizeof(int));
do_something(tid, step, &samples);
{ … }
return (void*)ret;
}
Here is the render callback (CoreAudio real-time consumer thread):
static OSStatus audioRenderCallback(void *inRefCon,
AudioUnitRenderActionFlags *ioActionFlags,
const AudioTimeStamp *inTimeStamp,
UInt32 inBusNumber,
UInt32 inNumberFrames,
AudioBufferList *ioData) {
MyAudioController *THIS = (MyAudioController *)inRefCon;
// An event happens in the render thread- signal whoever is waiting
if (THIS->state == active) dispatch_semaphore_signal(mSemaphore);
// Mono audio rendering: we only need one target buffer
const int channel = 0;
Float32* targetBuffer = (Float32 *)ioData->mBuffers[channel].mData;
memset(targetBuffer,0,inNumberFrames*sizeof(Float32));
// Pull samples from circular buffer
int32_t availableBytes;
Float32 *buffer = TPCircularBufferTail(&THIS->buffer, &availableBytes);
//copy circularBuffer content to target buffer
int bytesToCopy = ioData->mBuffers[channel].mDataByteSize;
memcpy(targetBuffer, buffer, MIN(bytesToCopy, availableBytes));
{ … };
TPCircularBufferConsume(&THIS->buffer, availableBytes);
THIS->frames = inNumberFrames;
return noErr;
}
Grand Central Dispatch already takes care of dispatching operations to multiple processor cores and threads. In typical real-time audio rendering or processing, one never needs to wait on a signal or semaphore, as the circular buffer consumption rate is very predictable, and drifts extremely slowly over time. The AVAudioSession API (if available) and Audio Unit API and callback allow you to set and determine the callback buffer size, and thus the maximum rate at which the circular buffer can change. Thus you can dispatch all render operations on a timer, render the exact number needed per timer period, and let the buffer size and state compensate for any jitter in thread dispatch time.
In extremely long running audio renders, you might want to measure the drift between timer operations and real-time audio consumption (sample rate), and tweak the number of samples rendered or the timer offset.

using UNIX Pipeline with C

I am required to create 6 threads to perform a task (increment/decrement a number) concurrently until the integer becomes 0. I am supposed to be using only UNIX commands (Pipelines to be specific) and I can't get my head around how pipelines work, or how I can implement this program.
This integer can be stored in a text file.
I would really appreciate it if anyone can explain how to implement this program
The book is right, pipes can be used to protect critical sections, although how to do so is non-obvous.
int *make_pipe_semaphore(int initial_count)
{
int *ptr = malloc(2 * sizeof(int));
if (pipe(ptr)) {
free(ptr);
return NULL;
}
while (initial_count--)
pipe_release(ptr);
return ptr;
}
void free_pipe_semaphore(int *sem)
{
close(sem[0]);
close(sem[1]);
free(sem);
}
void pipe_wait(int *sem)
{
char x;
read(sem[0], &x, 1);
}
void pipe_release(int *sem)
{
char x;
write(sem[1], &x, 1);
}
The maximum free resources in the semaphore varies from OS to OS but is usually at least 4096. This doesn't matter for protecting a critical section where the initial and maximum values are both 1.
Usage:
/* Initialization section */
int *sem = make_pipe_semaphore(1);
/* critical worker */
{
pipe_wait(sem);
/* do work */
/* end critical section */
pipe_release(sem);
}

Resources