I am looking for a ring buffer implementation (or pseudocode) in C with the following characteristics:
multiple producer single consumer pattern (MPSC)
consumer blocks on empty
producers block on full
lock-free (I expect high contention)
So far I've been working only with SPSC buffers - one per producer - but I would like to avoid the continuous spinning of the consumer to check for new data over all its input buffers (and maybe to get rid of some marshaling threads in my system).
I develop for Linux on Intel machines.
See liblfds, a lock-free MPMC ringbuffer. It won't block at all—lock-free data structures don't tend to do this, because the point of being lock-free is to avoid blocking; you need to handle this, when the data structure comes back to you with a NULL—returns NULL if you try to read on empty, but doesn't match your requirement when writing on full; here, it will throw away the oldest element and give you that for your write.
However, it would only take a small modification to obtain that behaviour.
But there may be a better solution. The tricky part of a ringbuffer is when full getting the oldest previous element and re-using that. You don't need this. I think you could take the SPSC memory-barrier only circular buffer and rewrite it using atomic operations. That will be a lot more performant that the MPMC ringbuffer in liblfds (which is a combination of a queue and a stack).
I think I have what you are looking for. It is a lock free ring buffer implementation that blocks producer/consumer. You only need access to atomic primitives - in this example I will use gcc's sync functions.
It has a known bug - if you overflow the buffer by more than 100% it is not guaranteed that the queue remains FIFO (it will still process them all eventually).
This implementation relies on reading/writing the buffer elements as being an atomic operation (which is pretty much guaranteed for pointers)
struct ringBuffer
{
void** buffer;
uint64_t writePosition;
size_t size;
sem_t* semaphore;
}
//create the ring buffer
struct ringBuffer* buf = calloc(1, sizeof(struct ringBuffer));
buf->buffer = calloc(bufferSize, sizeof(void*));
buf->size = bufferSize;
buf->semaphore = malloc(sizeof(sem_t));
sem_init(buf->semaphore, 0, 0);
//producer
void addToBuffer(void* newValue, struct ringBuffer* buf)
{
uint64_t writepos = __sync_fetch_and_add(&buf->writePosition, 1) % buf->size;
//spin lock until buffer space available
while(!__sync_bool_compare_and_swap(&(buf->buffer[writePosition]), NULL, newValue));
sem_post(buf->semaphore);
}
//consumer
void processBuffer(struct ringBuffer* buf)
{
uint64_t readPos = 0;
while(1)
{
sem_wait(buf->semaphore);
//process buf->buffer[readPos % buf->size]
buf->buffer[readPos % buf->size] = NULL;
readPos++;
}
}
Related
I would like to read (asynchronously) BLOCK_SIZE bytes of one file, and the BLOCK_SIZE bytes of the second file, printing what has been read to the buffer as soon as the respective buffer has been filled. Let me illustrate what I mean:
// in main()
int infile_fd = open(infile_name, O_RDONLY); // add error checking
int maskfile_fd = open(maskfile_name, O_RDONLY); // add error checking
char* buffer_infile = malloc(BLOCK_SIZE); // add error checking
char* buffer_maskfile = malloc(BLOCK_SIZE); // add error checking
struct aiocb cb_infile;
struct aiocb cb_maskfile;
// set AIO control blocks
memset(&cb_infile, 0, sizeof(struct aiocb));
cb_infile.aio_fildes = infile_fd;
cb_infile.aio_buf = buffer_infile;
cb_infile.aio_nbytes = BLOCK_SIZE;
cb_infile.aio_sigevent.sigev_notify = SIGEV_THREAD;
cb_infile.aio_sigevent.sigev_notify_function = print_buffer;
cb_infile.aio_sigevent.sigev_value.sival_ptr = buffer_infile;
memset(&cb_maskfile, 0, sizeof(struct aiocb));
cb_maskfile.aio_fildes = maskfile_fd;
cb_maskfile.aio_buf = buffer_maskfile;
cb_maskfile.aio_nbytes = BLOCK_SIZE;
cb_maskfile.aio_sigevent.sigev_notify = SIGEV_THREAD;
cb_maskfile.aio_sigevent.sigev_notify_function = print_buffer;
cb_maskfile.aio_sigevent.sigev_value.sival_ptr = buffer_maskfile;
and the print_buffer() function is defined as follows:
void print_buffer(union sigval sv)
{
printf("%s\n", __func__);
printf("buffer address: %p\n", sv.sival_ptr);
printf("buffer: %.128s\n", (char*)sv.sival_ptr);
}
By the end of the program I do the usual clean up, i.e.
// clean up
close(infile_fd); // add error checking
close(maskfile_fd); // add error checking
free(buffer_infile);
printf("buffer_inline freed\n");
free(buffer_maskfile);
printf("buffer_maskfile freed\n");
The problem is, every once in a while buffer_inline gets freed before print_buffer manages to print its contents to the console. In a usual case I would employ some kind of pthread_join() but as far as I know this is impossible since POSIX does not specify that sigev_notify_function must be implemented using threads, and besides, how would I get the TID of such thread to call pthread_join() on?
Don't do it this way, if you can avoid it. If you can, just let process termination take care of it all.
Otherwise, the answer indicated in Andrew Henle's comment above is right on. You need to be sure that no more sigev_notify_functions will improperly reference the buffers.
The easiest way to do this is simply to countdown the number of expected notifications before freeing the buffers.
Note: your SIGEV_THREAD function is executed in a separate thread, though not necessarily a new thread each time. (POSIX.1-2017 System Interfaces §2.4.2) Importantly, you are not meant to manage this thread's lifecycle: it is detached by default, with PTHREAD_CREATE_JOINABLE explicitly noted as undefined behavior.
As an aside, I'd suggest never using SIGEV_THREAD in robust code. Per spec, the signal mask of the sigev_notify_function thread is implementation-defined. Yikes. For me, that makes it per se unreliable. In my view, SIGEV_SIGNAL and a dedicated signal-handling thread are much safer.
I am right now trying to create a program where multiple threads are querying for data that needs to be processed and then written to disk. Currently I am using pragma and pragma critical in order to ensure that the data is being written to as intended.
This is quite costly though as threads are having to wait for one another. I read that it should be possible to have a single thread handle all write to disks for you while the others can focus on getting the incoming data and parsing it. How would I go about doing this?
The program is an XDP-based packet parser than only stores particular information regarding each packet. The code is based upon this project code here: https://github.com/xdp-project/xdp-tutorial/blob/master/tracing04-xdp-tcpdump/xdp_sample_pkts_user.c
static int print_bpf_output(void *data, int size)
{
struct {
__u16 cookie;
__u16 pkt_len;
__u8 pkt_data[SAMPLE_SIZE];
} __packed *e = data;
struct pcap_pkthdr h = {
.caplen = SAMPLE_SIZE,
.len = e->pkt_len,
};
struct timespec ts;
int i, err;
if (e->cookie != 0xdead) {
printf("BUG cookie %x sized %d\n",
e->cookie, size);
return LIBBPF_PERF_EVENT_ERROR;
}
err = clock_gettime(CLOCK_MONOTONIC, &ts);
if (err < 0) {
printf("Error with gettimeofday! (%i)\n", err);
return LIBBPF_PERF_EVENT_ERROR;
}
h.ts.tv_sec = ts.tv_sec;
h.ts.tv_usec = ts.tv_nsec / NANOSECS_PER_USEC;
if (verbose) {
printf("pkt len: %-5d bytes. hdr: ", e->pkt_len);
for (i = 0; i < e->pkt_len; i++)
printf("%02x ", e->pkt_data[i]);
printf("\n");
}
pcap_dump((u_char *) pdumper, &h, e->pkt_data);
pcap_pkts++;
return LIBBPF_PERF_EVENT_CONT;
}
This function would be called by numerous threads, and I want the pcap_dump calls to be executed by a single, different thread.
Yes,that is a common way to avoid delays where the disk is fast enough to handle the average data rate, but where occasional data peaks, disk cache writes, directory updates and other such cause intermittent data loss.
You need a producer-consumer queue. Such a class or code/struct, using condvars or semaphores,is easily found on SO or elsewhere on the net. The queue only needs to queue up pointers.
Don't use a wide queue to queue up the bulk data. As soon as it is read from [wherever], read it in to a malloced buffer/struct that has the data, path, command and anything else that the write thread might need to perform the write. Queue up the struct pointer to the write thread. In the write thread, loop round the P-C queue pop, get the pointers, do the write, (or whatever is commanded by the struct command field), and,if no error, free the struct. If there is some problem, you could load an error message into some field of the struct and queue it off again to some error-logging thread, store it in a queue to try again later, whatever you want, really.
This way, you insulate the rest of your app from those unavoidable, occasional disk delays. That is very important with high-latency disks, eg. those on a network. It also makes housekeeping operations much easier, for instance, some hour timer could queue up a struct whose command field instructs the thread to open a new file with a date-time stamp in the filename, so making it easier to track the data later without wading through one, massive, file:) Such operations, without the queue and write thread, would surely inflict a massive delay to your app:(
This must be a stupid question because this should be a very common and simple problem, but I haven't been able to find an answer anywhere, so I'll bite the bullet and ask.
How on earth should I go about reading from the standard input when there is no way of determining the size of the data? Obviously if the data ends in some kind of terminator like a NUL or EOF then this is quite trivial, but my data does not. This is simple IPC: the two programs need to talk back and forth and ending the file streams with EOF would break everything.
I thought this should be fairly simple. Clearly programs talk to each other over pipes all the time without needing any arcane tricks, so I hope there is a simple answer that I'm too stupid to have thought of. Nothing I've tried has worked.
Something obvious like (ignoring necessary realloc's for brevity):
int size = 0, max = 8192;
unsigned char *buf = malloc(max);
while (fread((buf + size), 1, 1, stdin) == 1)
++size;
won't work since fread() blocks and waits for data, so this loop won't terminate. As far as I know nothing in stdio allows nonblocking input, so I didn't even try any such function. Something like this is the best I could come up with:
struct mydata {
unsigned char *data;
int slen; /* size of data */
int mlen; /* maximum allocated size */
};
...
struct mydata *buf = xmalloc(sizeof *buf);
buf->data = xmalloc((buf->mlen = 8192));
buf->slen = 0;
int nread = read(0, buf->data, 1);
if (nread == (-1))
err(1, "read error");
buf->slen += nread;
fcntl(0, F_SETFL, oflags | O_NONBLOCK);
do {
if (buf->slen >= (buf->mlen - 32))
buf->data = xrealloc(buf->data, (buf->mlen *= 2));
nread = read(0, (buf->data + buf->slen), 1);
if (nread > 0)
buf->slen += nread;
} while (nread == 1);
fcntl(0, F_SETFL, oflags);
where oflags is a global variable containing the original flags for stdin (cached at the start of the program, just in case). This dumb way of doing it works as long as all of the data is present immediately, but fails otherwise. Because this sets read() to be non-blocking, it just returns -1 if there is no data. The program communicating with mine generally sends responses whenever it feels like it, and not all at once, so if the data is at all large this exits too early and fails.
How on earth should I go about reading from the standard input when there is no way of determining the size of the data?
There always has to be a way to determinate the size. Otherwise, the program would require infinite memory, and thus impossible to run on a physical computer.
Think about it this way: even in the case of a never-ending stream of data, there must be some chunks or points where you have to process it. For instance, a live-streamed video has to decode a portion of it (e.g. a frame). Or a video game which processes messages one by one, even if the game has undetermined length.
This holds true regardless of the type of I/O you decide to use (blocking/non-blocking, synchronous/asynchronous...). For instance, if you want to use typical blocking synchronous I/O, what you have to do is process the data in a loop: each iteration, you read as much data as is available, and process as much as you can. Whatever you can not process (because you have not received enough yet), you keep for the next iteration. Then, the rest of the loop is the rest of the logic of the program.
In the end, regardless of what you do, you (or someone else, e.g. a library, the operating system, the hardware buffers...) have to buffer incoming data until it can be processed.
Basically, you have two choices -- synchronous or asynchronous -- and both have their advantages and disadvantages.
For synchronous, you need either delimeters or a length field embedded in the record (or fixed length records, but that is pretty inflexible). This works best for synchronous protocols like synchronous rpc or simplex client-server interactions where only one side talks at a time while the other side waits. For ASCII/text based protocols, it is common to use a control-character delimiter like NL/EOL or NUL or CTX to mark the end of messages. Binary protocols more commonly use an embedded length field -- the receiver first reads the length and then reads the full amount of (expected) data.
For asynchronous, you use non-blocking mode. It IS possible to use non-blocking mode with stdio streams, it just requires some care. out-of-data conditions show up to stdio like error conditions, so you need to use ferror and clearerr on the FILE * as appropriate.
It's possible for both to be used -- for example in client-server interactions, the clients may use synchronous (they send a request and wait for a reply) while the server uses asynchronous (to be be robust in the presence of misbehaving clients).
The read api on Linux or the ReadFile Api on windows will immediately return and not wait for the specified number of bytes to fill the buffer (when reading a pipe or socket). Read then reurns the number of bytes read.
This means, when reading from a pipe, you set a buffersize, read as much as returned and the process it. You then read the next bit. The only time you are blocked is if there is no data available at all.
This differs from fread which only returns once the desired number of bytes are returned or the stream determines doing so is impossible (like eof).
I have a queue structure, that I attempted to implement using a circular buffer, which I am using in a networking application. I am looking for some guidance and feedback. First, let me present the relevant code.
typedef struct nwk_packet_type
{
uint8_t dest_address[NRF24_ADDR_LEN];
uint8_t data[32];
uint8_t data_len;
}nwk_packet_t;
/* The circular fifo on which outgoing packets are stored */
nwk_packet_t nwk_send_queue[NWK_QUEUE_SIZE];
nwk_packet_t* send_queue_in; /* pointer to queue head */
nwk_packet_t* send_queue_out; /* pointer to queue tail */
static nwk_packet_t* nwk_tx_pkt_allocate(void)
{
/* Make sure the send queue is not full */
if(send_queue_in == (send_queue_out - 1 + NWK_QUEUE_SIZE) % NWK_QUEUE_SIZE)
return 0;
/* return pointer to the next add and increment the tracker */
return send_queue_in++;//TODO: it's not just ++, it has to be modular by packet size
}
/* External facing function for application layer to send network data */
// simply adds the packet to the network queue if there is space
// returns an appropriate error code if anything goes wrong
uint8_t nwk_send(uint8_t* address, uint8_t* data, uint8_t len)
{
/* First check all the parameters */
if(!address)
return NWK_BAD_ADDRESS;
if(!data)
return NWK_BAD_DATA_PTR;
if(!len || len > 32)
return NWK_BAD_DATA_LEN;
//TODO: PROBABLY NEED TO START BLOCKING HERE
/* Allocate the packet on the queue */
nwk_packet_t* packet;
if(!( packet = nwk_tx_pkt_allocate() ))
return NWK_QUEUE_FULL;
/* Build the packet */
memcpy(packet->dest_address, address, NRF24_ADDR_LEN);
memcpy(packet->data, data, len);
packet->data_len = len;
//TODO: PROBABLY SAFE TO STOP BLOCKING HERE
return NWK_SUCCESS;
}
/* Only called during NWK_IDLE, pushes the next item on the send queue out to the chip's "MAC" layer over SPI */
void nwk_transmit_pkt(void)
{
nwk_packet_t tx_pkt = nwk_send_queue[send_queue_out];
nrf24_send(tx_pkt->data, tx_pkt->data_len);
}
/* The callback for transceiver interrupt when a sent packet is either completed or ran out of retries */
void nwk_tx_result_cb(bool completed)
{
if( (completed) && (nwk_tx_state == NWK_SENDING))
send_queue_out++;//TODO: it's not just ++, it has to be modular by packet size with in the buffer
}
Ok now for a quick explanation and then my questions. So the basic idea is that I've got this queue for data which is being sent onto the network. The function nwk_send() can be called from anywhere in application code, which by the wall will be a small pre-emptive task based operating system (FreeRTOS) and thus can happen from lots of places in the code and be interrupted by the OS tick interrupt.
Now since that function is modifying the pointers into the global queue, I know it needs to be blocking when it is doing that. Am I correct in my comments on the code about where I should be blocking (ie disabling interrupts)? Also would be smarter to make a mutex using a global boolean variable or something rather than just disabling interrupts?
Also, I think there's a second place I should be blocking when things are being taken off the queue, but I'm not sure where that is exactly. Is it in nwk_transmit_pkt() where I'm actually copying the data off the queue and into a local ram variable?
Final question, how do I achieve the modulus operation on my pointers within the arrays? I feel like it should look something like:
send_queue_in = ((send_queue_in + 1) % (NWK_QUEUE_SIZE*sizeof(nwk_packet_t))) + nwk_send_queue;
Any feedback is greatly appreciated, thank you.
About locking it will be best to use some existing mutex primitive from the OS you use. I am not familiar with FreeRTOS but it should have builtin primitives for locking between interrupt and user context.
For circular buffer you may use these:
check for empty queue
send_queue_in == send_queue_out
check for full queue
(send_queue_in + 1) % NWK_QUEUE_SIZE == send_queue_out
push element [pseudo code]
if (queue is full)
return error;
queue[send_queue_in] = new element;
send_queue_in = (send_queue_in + 1) % NWK_QUEUE_SIZE;
pop element [pseudo code]
if (queue is empty)
return error;
element = queue[send_queue_out];
send_queue_out = (send_queue_out + 1) % NWK_QUEUE_SIZE;
It looks that you copy and do not just reference the packet data before sending. This means that you can hold the lock until the copy is done.
Without an overall driver framework to develop with, and when communicating with interrupt-state on a uC, you need to be very careful.
You cannot use OS synchro primitives to communicate to interrupt state. Attmpting to do so will certainly crash your OS because interrupt-handlers cannot block.
Copying the actual bulk data should be avoided.
On an 8-bit uC, I suggest queueing an index onto a buffer array pool, where the number of buffers is <256. That means that only one byte needs to be queued up and so, with an appropriate queue class that stores the value before updating internal byte-size indexes, it is possible to safely communicate buffers into a tx handler without excessive interrupt-disabling.
Access to the pool array should be thread-safe and 'insertion/deletion' should be quick - I have 'succ/pred' byte-fields in each buffer struct, so forming a double-linked list, access protected by a mutex. As well as I/O, I use this pool of buffers for all inter-thread comms.
For tx, get a buffer struct from teh pool, fill with data, push the index onto a tx queue, disable interrupts for only long enough to determine whether the tx interrupt needs 'primimg'. If priming is required, shove in a FIFO-full of data before re-enabling interrupts.
When the tx interrupt-handler has sent the buffer, it can push the 'used' index back onto a 'scavenge' queue and signal a semaphore to make a handler thread run. This thread can then take the entry from the scavenge queue and return it to the pool.
This scheme only works if interrupt-handlers do not re-enable higher-priority interrupts using the same buffering scheme.
as stated in: http://www.kernel.org/doc/htmldocs/kernel-hacking.html#routines-copy this functions "can" sleep.
So, do I always have to do a lock (e.g. with mutexes) when using this functions or are there exceptions?
I'm currently working on a module and saw some Kernel Oops at my system, but cannot reproduce them. I have a feeling they are fired because I'm currently do no locking around copy_[to/from]_user(). Maybe I'm wrong, but it smells like it has something to do with it.
I have something like:
static unsigned char user_buffer[BUFFER_SIZE];
static ssize_t mcom_write (struct file *file, const char *buf, size_t length, loff_t *offset) {
ssize_t retval;
size_t writeCount = (length < BUFFER_SIZE) ? length : BUFFER_SIZE;
memset((void*)&user_buffer, 0x00, sizeof user_buffer);
if (copy_from_user((void*)&user_buffer, buf, writeCount)) {
retval = -EFAULT;
return retval;
}
*offset += writeCount;
retval = writeCount;
cleanupNewline(user_buffer);
dispatch(user_buffer);
return retval;
}
Is this save to do so or do I need locking it from other accesses, while copy_from_user is running?
It's a char device I read and write from, and if a special packet in the network is received, there can be concurrent access to this buffer.
You need to do locking iff the kernel side data structure that you are copying to or from might go away otherwise - but it is that data structure you should be taking a lock on.
I am guessing your function mcom_write is a procfs write function (or similar) right? In that case, you most likely are writing to the procfs file, your program being blocked until mcom_write returns, so even if copy_[to/from]_user sleeps, your program wouldn't change the buffer.
You haven't stated how your program works so it is hard to say anything. If your program is multithreaded and one thread writes while another can change its data, then yes, you need locking, but between the threads of the user-space program not your kernel module.
If you have one thread writing, then your write to the procfs file would be blocked until mcom_write finishes so no locking is needed and your problem is somewhere else (unless there is something else that is wrong with this function, but it's not with copy_from_user)