Transfer data from ISR/callback to a thread with RTOS

Transfer data from ISR/callback to a thread with RTOS - c

I am using ADC with DMA that captures analog vlaues and generates a callback when transfer is complete. I then, will be transfering the data to a thread to process the data as processing takes some time and I don't want to block callback function.
Buffer is of length 200, I am using the buffer as ping pong buffer and am generating a callback on ADC half complete and full complete events so that thee should be no overlap of data in same buffer.
Below is my current implementation on STM32 with CMSIS RTOS 2
#define BUFFER_SIZE 100
static int16_t buffer[BUFFER_SIZE*2] = {0};
static volatile int16_t *p_buf[2] = {&(buffer[0]), &(buffer[BUFFER_SIZE])};
typedef struct
{
void *addr;
uint32_t len;
} msg_t;
void HAL_ADC_ConvHalfCpltCallback(ADC_HandleTypeDef *hadc)
{
msg_t msg;
msg.addr = (void *)p_buf[0];
msg.len = BUFFER_SIZE;
osMessageQueuePut(queue, &msg, 0, 0);
}
void HAL_ADC_ConvCpltCallback(ADC_HandleTypeDef *hadc)
{
msg_t msg;
msg.addr = (void *)p_buf[1];
msg.len = BUFFER_SIZE;
osMessageQueuePut(queue, &msg, 0, 0);
}
static void process_thread(void *argument)
{
msg_t msg;
queue = osMessageQueueNew(4, sizeof(msg_t), NULL);
while (1)
{
osMessageQueueGet(queue, &msg, 0, osWaitForever);
// process data
}
}
What is recommended way to transfer to data from half buffer to a thread from callback/ISR using CMSIS RTOS 2?
Currently queue size is set to 4, for some reason if processing thread takes too much of time; the queue becomes useless as buffer pointer will point to stale oe ongoing data. How to overcome this issue?

If you are using double buffering but only passing a pointer directly to the DMA buffer, there is no purpose in having a queue length of more than 1 (one buffer being processed, one buffer being written) and if your thread cannot process that in time, it is a flawed software design or unsuitable hardware (too slow). With a queue length of 1, if the receiving task has not completed processing in time osMessageQueuePut in the ISR will return osErrorResource - better to detect the overrun that to let it happen with undefined consequences.
Generally you need to pass the data to a thread that is sufficiently deterministic such that it is guaranteed to meet deadlines. If you have some occasional non-deterministic or slow processing, then that should be deferred to yet another lower priority task rather then disturbing the primary signal processing.
A simple solution is to copy the data to the message queue rather than passing a pointer. i.e. a queue of buffers rather then a queue of pointers to buffers. That will increase your interrupt processing time, for the memcpy but will still be deterministic (i.e. a constant processing time), and the aim is to meet deadlines rather than necessarily be "as fast as possible". It is robust and if you fail to meet deadlines you will get no data (a gap in the signal) rather than inconsistent data. That condition is then detectable by virtue of the queue being full and osMessageQueuePut returning osErrorResource.
A more complex but more efficient solution is to use the DMA controller's support for double-buffering (not available on all STM32 parts, but you have not specified). That differs from the circular half/full transfer mode in that the two buffers are independent (need not be contiguous memory) and can be changed dynamically. In that case you would have a memory block pool with as many blocks as your queue length. Then you assign two blocks as the DMA buffers and when each block becomes filled, in the ISR, you switch to the next block in the pool and pass the pointer to the just filled block on to the queue. The receiving task must return the received block back to the pool when it has completed processing it.
In CMSIS RTOS2 you can use the memory pool API to achieve that, but it is simple enough to do in any RTOS using a message queue pre-filled with pointers to memory blocks. You simply allocate by taking a pointer from the queue, and de-allocate by putting the pointer back on the queue. Equally however in this case you could simply have an array of memory blocks and maintain a circular index since the blocks will be used and returned sequentially and used exclusively by this driver. In that case overrun detection is when the queue is full rather than when block allocation fails. But if that is happening regardless of queue length, you have a scheduling/processing time problem.
To summarise, possible solutions are:
Ensure the receiving thread will process one buffer before the next is available (fast, deterministic and appropriate priority with respect to other threads), and use an appropriate queue length such that any overrun is detectable.
Use a queue of buffers rather then a queue of pointers and memcpy the data to the message and enqueue it.
Use true double-buffering (if supported) and switch DMA buffers dynamically from a memory pool.
One final point. If you are using an STM32 part with a data-cache (e.g. STM32F7xx Cortex-M7 parts), your DMA buffers must be located in a non-cachable region (by MPU configuration) - you will otherwise slow down your processor considerably if you are constantly invalidating the cache to read coherent DMA data, and unlikely to get correct data if you don't. If you use a CMSIS RTOS memory pool in that case, you will need to use the osMemoryPoolNew attr structure parameter to provide a suitable memory-block rather then using the kernel memory allocator.

If you're stuck with half buffer notifications because of hardware limitation, one possibility is to copy from the half buffer to another buffer from a larger pool.
You'll (probably) eventually need to do this anyways so you don't lose data (as you're experiencing) and so you can bridge the non-cached/cache gap. Your hardware pingpong DMA buffer is going to be necessarily be non-cached and you'll want whatever buffer you do work with, particularly if you're doing filtering or other postprocessing on it, to be cached.
You can wait on a queue in the ISR (same stipulation...timeout must be 0), so have the ISR get a buffer from the "empty" queue, fills it, then puts it in the "filled" queue. Application takes from the "filled" queue, processes, returns to "empty" queue.
If the ISR ever encounters a situation where it can't get an "empty" buffer, you need to decide how to handle that. (skip? halt?) It basically means the application repeatedly ran over its deadlines until the queue emptied. If its a transient load, then you can increase the queue depth and use more buffers to cover the transient load. If it just slowly gets there and can't recover, you need to optimize your application or decide how to gracefully drop data because you can't process data fast enough in general.
You can get away with using ring buffers where only one side modifies the write pointer and the other the read pointer, but if you've got os queues that work across ISRs, it makes the code cleaner and more obvious what's going on.

Related

Lockfree buffer updates with variable-length messages in C

I have 2 buffer of size N. I want to write to the buffer from different threads without using locks.
I maintain a buffer index (0 and 1) and an offset where the new write operation to buffer starts. If I can get the current offset and set the offset at offset + len_of_the_msg in an atomic manner, it will guarantee that the different threads will not overwrite each other. I also have to take care of buffer overflow. Once a buffer is full, switch buffer and set offset to 0.
Task to do in order:
set a = offset
increment offset by msg_len
if offset > N: switch buffer, set a to 0, set offset to msg_len
I am implementing this in C. Compiler is gcc.
How to do this operations in an atomic manner without using locks? Is it possible to do so?
EDIT:
I don't have to use 2 buffers. What I want to do is "Collect log message from different threads into a buffer and send the buffer to a server once some buffer usage threshold is reached"

re: your edit:
I don't have to use 2 buffers. What I want to do is: Collect log message from different threads into a buffer and send the buffer to a server once some buffer usage threshold is reached
A lock-free circular buffer could maybe work, with the reader collecting all data up to the last written entry. Extending an existing MPSC or MPMC queue based on using an array as a circular buffer is probably possible; see below for hints.
Verifying that all entries have been fully written is still a problem, though, as are variable-width entries. Doing that in-band with a length + sequence number would mean you couldn't just send the byte-range to the server, and the reader would have to walk through the "linked list" (of length "pointers") to check the sequence numbers, which is slow when they inevitably cache miss. (And can possibly false-positive if stale binary data from a previous time through the buffer happens to look like the right sequence number, because variable-length messages might not line up the same way.)
Perhaps a secondary array of fixed-width start/end-position pairs could be used to track "done" status by sequence number. (Writers store a sequence number with a release-store after writing the message data. Readers seeing the right sequence number know that data was written this time through the circular buffer, not last time. Sequence numbers provide ABA protection vs. a "done" flag that the reader would have to unset as it reads. The reader can indicate its read position with an atomic integer.)
I'm just brainstorming ATM, I might get back to this and write up more details or code, but I probably won't. If anyone else wants to build on this idea and write up an answer, feel free.
It might still be more efficient to do some kind of non-lock-free synchronization that makes sure all writers have passed a certain point. Or if each writer stores the position it has claimed, the reader can scan that array (if there are only a few writer threads) and find the lowest not-fully-written position.
I'm picturing that a writer should wake the reader (or even perform the task itself) after detecting that its increment has pushed the used space of the queue up past some threshold. Make the threshold a little higher than you normally want to actually send with, to account for partially-written entries from previous writers not actually letting you read this far.
If you are set on switching buffers:
I think you probably need some kind of locking when switching buffers. (Or at least stronger synchronization to make sure all claimed space in a buffer has actually been written.)
But within one buffer, I think lockless is possible. Whether that helps a lot or a little depends on how you're using it. Bouncing cache lines around is always expensive, whether that's just the index, or whether that's also a lock plus some write-index. And also false sharing at the boundaries between two messages, if they aren't all 64-byte aligned (to cache line boundaries.)
The biggest problem is that the buffer-number can change while you're atomically updating the offset.
It might be possible with a separate offset for each buffer, and some extra synchronization when you change buffers.
Or you can pack the buffer-number and offset into a single 64-bit struct that you can attempt to CAS with atomic_compare_exchange_weak. That can let a writer thread claim that amount of space in a known buffer. You do want CAS, not fetch_add because you can't build an upper limit into fetch_add; it would race with any separate check.
So you read the current offset, check there's enough room, then try to CAS with offset+msg_len. On success, you've claimed that region of that buffer. On fail, some other thread got it first. This is basically the same as what a multi-producer queue does with a circular buffer, but we're generalizing to reserving a byte-range instead of just a single entry with CAS(&write_idx, old, old+1).
(Maybe possible to use fetch_add and abort if the final offset+len you got goes past the end of the buffer. If you can avoid doing any fetch_sub to undo it, that could be good, but it would be worse if you had multiple threads trying to undo their mistakes with more modifications. That would still leave the possible problem of a large message stopping other small messages from packing into the end of a buffer, given some orderings. CAS avoids that because only actually-usable offsets get swapped in.)
But then you also need a mechanism to know when that writer has finished storing to that claimed region of the buffer. So again, maybe extra synchronization around a buffer-change is needed for that reason, to make sure all pending writes have actually happened before we let readers touch it.
A MPMC queue using a circular buffer (e.g. Lock-free Progress Guarantees) avoids this by only having one buffer, and giving writers a place to mark each write as done with a release-store, after they claimed a slot and stored into it. Having fixed-size slots makes this much easier; variable-length messages would make that non-trivial or maybe not viable at all.
The "claim a byte-range" mechanism I'm proposing is very much what lock-free array-based queues, to, though. A writer tries to CAS a write-index, then uses that claimed space.
Obviously all of this would be done with C11 #include <stdatomic.h> for _Atomic size_t offsets[2], or with GNU C builtin __atomic_...

I believe this is not solvable in a lock-free manner, unless you're only ruling out OS-level locking primitives and can live with brief spin locks in application code (which would be a bad idea).
For discussion, let's assume your buffers are organized this way:
#define MAXBUF 100
struct mybuffer {
char data[MAXBUF];
int index;
};
struct mybuffer Buffers[2];
int currentBuffer = 0; // switches between 0 and 1
Though parts can be done with atomic-level primitives, in this case the entire operation has to be done atomically so is really one big critical section. I cannot imagine any compiler with a unicorn primitive for this.
Looking at the GCC __atomic_add_fetch() primitive, this adds a given value (the message size) to a variable (the current buffer index), returning the new value; this way you could test for overflow.
Looking at some rough code that is not correct;
// THIS IS ALL WRONG!
int oldIndex = Buffers[current]->index;
if (__atomic_add_fetch(&Buffers[current]->index, mysize, _ATOMIC_XX) > MAXBUF)
{
// overflow, must switch buffers
// do same thing with new buffer
// recompute oldIndex
}
// copy your message into Buffers[current] at oldIndex
This is wrong in every way, because at almost every point some other thread could sneak in and change things out from under you, causing havoc.
What if your code grabs the oldIndex that happens to be from buffer 0, but then some other thread sneaks in and changes the current buffer before your if test even gets to run?
The __atomic_add_fetch() would then be allocating data in the new buffer but you'd copy your data to the old one.
This is the NASCAR of race conditions, I do not see how you can accomplish this without treating the whole thing as a critical section, making other processes wait their turn.
void addDataTobuffer(const char *msg, size_t n)
{
assert(n <= MAXBUF); // avoid danger
// ENTER CRITICAL SECTION
struct mybuffer *buf = Buffers[currentBuffer];
// is there room in this buffer for the entire message?
// if not, switch to the other buffer.
//
// QUESTION: do messages have to fit entirely into a buffer
// (as this code assumes), or can they be split across buffers?
if ((buf->index + n) > MAXBUF)
{
// QUESTION: there is unused data at the end of this buffer,
// do we have to fill it with NUL bytes or something?
currentBuffer = (currentBuffer + 1) % 2; // switch buffers
buf = Buffers[currentBuffer];
}
int myindex = buf->index;
buf->index += n;
// copy your data into the buffer at myindex;
// LEAVE CRITICAL SECTION
}
We don't know anything about the consumer of this data, so we can't tell how it gets notified of new messages, or if you can move the data-copy outside the critical section.
But everything inside the critical section MUST be done atomically, and since you're using threads anyway, you may as well use the primitives that come with thread support. Mutexes probably.
One benefit of doing it this way, in addition to avoiding race conditions, is that the code inside the critical section doesn't have to use any of the atomic primitives and can just be ordinary (but careful) C code.
An additional note: it's possible to roll your own critical section code with some interlocked exchange shenanigans, but this is a terrible idea because it's easy to get wrong, makes the code harder to understand, and avoids tried-and-true thread primitives designed for exactly this purpose.

how to design a server for variable size messages

I want some feedback or suggestion on how to design a server that handles variable size messages.
to simplify the answer lets assume:
single thread epoll() based
the protocol is: data-size + data
data is stored on a ringbuffer
the read code, with some simplification for clarity, looks like this:
if (client->readable) {
if (client->remaining > 0) {
/* SIMPLIFIED FOR CLARITY - assume we are always able to read 1+ bytes */
rd = read(client->sock, client->buffer, client->remaining);
client->buffer += rd;
client->remaining -= rd;
} else {
/* SIMPLIFIED FOR CLARITY - assume we are always able to read 4 bytes */
read(client->sock, &(client->remaining), 4);
client->buffer = acquire_ringbuf_slot(client->remaining);
}
}
please, do not focus on the 4 byte. just assume we have the data size in the beginning compressed or not does not make difference for this discussion.
now, the question is: what is the best way to do the above?
assume both small "data", few bytes and large data MBs
how can we reduce the number of read() calls? e.g. in case we have 4 message of 16 bytes on the stream, it seems a waste doing 8 calls to read().
are there better alternatives to this design?

PART of the solution depends on the transport layer protocol you use.
I assume you are using TCP which provides connection oriented and reliable communication.
From your code I assume you understand TCP is a stream-oriented protocol
(So when a client sends a piece of data, that data is stored in the socket send buffer and TCP may use one or more TCP segments to convey it to the other end (server)).
So the code, looks very good so far (considering you have error checks and other things in the real code).
Now for your questions, I give you my responses, what I think is best based on my experience (but there could be better solutions):
1-This is a solution with challenges similar to how an OS manages memory, dealing with fragmentation.
For handling different message sizes, you have to understand there are always trade-offs depending on your performance goals.
One solution to improve memory utilization and parallelization is to have a list of free buffer chunks of certain size, say 4KB.
You will retrieve as many as you need for storing your received message. In the last one you will have unused data. You play with internal fragmentation.
The drawback could be when you need to apply certain type of processing (maybe a visitor pattern) on the message, like parsing/routing/transformation/etc. It will be more complex and less efficient than a case of a huge buffer of contiguous memory. On the other side, the drawback of a huge buffer is much less efficient memory utilization, memory bottlenecks, and less parallelization.
You can implement something smarter in the middle (think about chunks that could also be contiguous whenever available). Always depending on your goals. Something useful is to implement an abstraction over the fragmented memory so that every function (or visitor) that is applied works as it were dealing with contiguous memory.
If you use these chunks, when the message was processed and dropped/forwarded/eaten/whatever, you return the unused chunks to the list of free chunks.
2-The number of read calls will depend on how fast TCP conveys the data from client to server. Remember this is stream oriented and you don't have much control over it. Of course, I'm assuming you try to read the max possible (remaining) data in each read.
If you use the chunks I mentioned above the max data to read will also depend on the chunk size.
Something you can do at TCP layer is to increase the server receive buffer. Thus, it can receive more data even when server cannot read it fast enough.
3-The ring buffer is OK, if you used chunked, the ring buffer should provide the abstraction. But I don't know why you need a ring buffer.
I like ring buffers because there is a way of implementing producer-consumer synchronization without locking (Linux Kernel uses this for moving packets from L2 to IP layer) but I don't know if that's your goal.
To pass messages to other components and/or upper-layers you could also use ring buffers of pointers to messages.

A better design may be as follows:
Set up your user-space socket read buffer to be the same size as the kernel socket buffer. If your user-space socket read buffer is smaller, then you would need more than one read syscall to read the kernel buffer. If your user-space buffer is bigger, then the extra space is wasted.
Your read function should only read as much data as possible in one read syscall. This function must not know anything about the protocol. This way you do not need to re-implement this function for different wire formats.
When your read function has read into the user-space buffer it should call a callback passing the iterators to the data available in the buffer. That callback is a parser function that should extract all available complete messages and pass these messages to another higher-level callback. Upon return the parser function should return the number of bytes consumed, so that these bytes can be discarded from the user-space socket buffer.

IPC bottleneck?

I have two processes, a producer and a consumer. IPC is done with OpenFileMapping/MapViewOfFile on Win32.
The producer receives video from another source, which it then passes over to the consumer and synchronization is done through two events.
For the producer:
Receive frame
Copy to shared memory using CopyMemory
Trigger DataProduced event
Wait for DataConsumed event
For the consumer
Indefinitely wait for DataProducedEvent
Copy frame to own memory and send for processing
Signal DataConsumed event
Without any of this, the video averages at 5fps.
If I add the events on both sides, but without the CopyMemory, it's still around 5fps though a tiny bit slower.
When I add the CopyMemory operation, it goes down to 2.5-2.8fps. Memcpy is even slower.
I find hard to believe that a simple memory copy can cause this kind of slowdown.
Any ideas on a remedy?
Here's my code to create the shared mem:
HANDLE fileMap = CreateFileMapping(INVALID_HANDLE_VALUE, 0, PAGE_READWRITE, 0, fileMapSize, L"foomap");
void* mapView = MapViewOfFile(fileMap, FILE_MAP_WRITE | FILE_MAP_READ, 0, 0, fileMapSize);
The size is 1024 * 1024 * 3
Edit - added the actual code:
On the producer:
void OnFrameReceived(...)
{
// get buffer
BYTE *buffer = 0;
...
// copy data to shared memory
CopyMemory(((BYTE*)mapView) + 1, buffer, length);
// signal data event
SetEvent(dataProducedEvent);
// wait for it to be signaled back!
WaitForSingleObject(dataConsumedEvent, INFINITE);
}
On the consumer:
while(WAIT_OBJECT_0 == WaitForSingleObject(dataProducedEvent, INFINITE))
{
SetEvent(dataConsumedEvent);
}
Well, it seems that copying from the DirectShow buffer onto shared memory was the bottleneck after all. I tried using a Named Pipe to transfer the data over and guess what - the performance is restored.
Does anyone know of any reasons why this may be?
To add a detail that I didn't think was relevant before: the producer is injected and hooks onto a DirectShow graph to retrieve the frames.

Copying of memory involves certain operations under the hood, and for video this can be significant.
I'd try another route: create a shared block for each frame or several of frames. Name them consequently, i.e. block1, block2, block3 etc, so that the recipient knows what block to read next. Now receive the frame directly to the allocated blockX, notify the consumer about availability of the new block and allocate and start using another block immediately. Consumer maps the block and doesn't copy it - the block belongs to consumer now and consumer can use the original buffer in further processing. Once the consumer closes mapping of the block, this mapping is destroyed. So you get a stream of blocks and avoid blocking.
If frame processing doesn't take much time and creation of shared block does, you can create a pool of shared blocks, large enough to ensure that producer and consumer never attempt to use the same block (you can complicate scheme by using a semaphore or mutx to guard each block).
Hope my idea is clear - avoid copying by using the block in producer, than in consumer

The time it takes to copy 3MB of memory really shouldn't be at all noticeable. A quick test on my old (and busted) laptop was able to complete 10,000 memcpy(buf1, buf2, 1024 * 1024 * 3) operations in around 10 seconds. At 1/1000th of a second it shouldn't be slowing down your frame rate by a noticeable amount.
Regardless, it would seem that there is probably some optimisation that could occur to speed things up. Currently you seem to be either double or tripple handling the data. Double handling because you "recieve the frame" then "copy to shared memory". Triple handling if "Copy frame to own memory and send for processing" means that you truly copy to a local buffer and then process instead of just processing from the buffer.
The alternative is to receive the frame into the shared buffer directly and process it directly out of the buffer. If, as I suspect, you want to be able to receive one frame while processing another you just increase the size of the memory mapping to accomodate more than one frame and use it as a circular array. On the consumer side it would look something like this.
char *sharedMemory;
int frameNumber = 0;
...
WaitForSingleObject(...) // Consume data produced event
frame = &sharedMemory[FRAME_SIZE * (frameNumber % FRAMES_IN_ARRAY_COUNT)
processFrame(frame);
ReleaseSemaphore(...) // Generate data consumed event
And the producer
char *sharedMemory;
int frameNumber = 0;
...
WaitForSingleObject(...) // Consume data consumed event
frame = &sharedMemory[FRAME_SIZE * (frameNumber % FRAMES_IN_ARRAY_COUNT)
recieveFrame(frame);
ReleaseSemaphore(...) // Generate data produced event
Just make sure that the semaphore the data consumed semaphore is initialised to FRAME_IN_ARRAY_COUNT and the data produced semaphore is initialised to 0.

types of buffers

Recently an interviewer asked me about the types of buffers. What types of buffers are there ? Actually this question came up when I said I will be writing all the system calls to a log file to monitor the system. He said it will be slow to write each and every call to a file. How to prevent it. I said I will use a buffer and he asked me what type of buffer ? Can some one explain me types of buffers please.

In C under UNIX (and probably other operating systems as well), there are usually two buffers, at least in your given scenario.
The first exists in the C runtime libraries where information to be written is buffered before being delivered to the OS.
The second is in the OS itself, where information is buffered until it can be physically written to the underlying media.
As an example, we wrote a logging library many moons ago that forced information to be written to the disk so that it would be there if either the program crashed or the OS crashed.
This was achieved with the sequence:
fflush (fh); fsync (fileno (fh));
The first of these actually ensured that the information was handed from the C runtime buffers to the operating system, the second that it was written to disk. Keep in mind that this is an expensive operation and should only be done if you absolutely need the information written immediately (we only did it at the SUPER_ENORMOUS_IMPORTANT logging level).
To be honest, I'm not entirely certain why your interviewer thought it would be slow unless you're writing a lot of information. The two levels of buffering already there should perform quite adequately. If it was a problem, then you could just introduce another layer yourself which wrote the messages to an in-memory buffer and then delivered that to a single fprint-type call when it was about to overflow.
But, unless you do it without any function calls, I can't see it being much faster than what the fprint-type buffering already gives you.
Following clarification in comments that this question is actually about buffering inside a kernel:
Basically, you want this to be as fast, efficient and workable (not prone to failure or resource shortages) as possible.
Probably the best bet would be a buffer, either statically allocated or dynamically allocated once at boot time (you want to avoid the possibility that dynamic re-allocation will fail).
Others have suggested a ring (or circular) buffer but I wouldn't go that way (technically) for the following reason: the use of a classical circular buffer means that to write out the data when it has wrapped around will take two independent writes. For example, if your buffer has:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|s|t|r|i|n|g| |t|o| |w|r|i|t|e|.| | | | | | |T|h|i|s| |i|s| |t|h|e| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
^ ^
| |
Buffer next --+ +-- Buffer start
then you'll have to write "This is the " followed by "string to write.".
Instead, maintain the next pointer and, if the bytes in the buffer plus the bytes to be added are less than the buffer size, just add them to the buffer with no physical write to the underlying media.
Only if you are going to overflow the buffer do you start doing tricky stuff.
You can take one of two approaches:
Either flush the buffer as it stands, set the next pointer back to the start for processing the new message; or
Add part of the message to fill up the buffer, then flush it and set the next pointer back to the start for processing the rest of the message.
I would probably opt for the second given that you're going to have to take into account messages that are too big for the buffer anyway.
What I'm talking about is something like this:
initBuffer:
create buffer of size 10240 bytes.
set bufferEnd to end of buffer + 1
set bufferPointer to start of buffer
return
addToBuffer (size, message):
while size != 0:
xfersz = minimum (size, bufferEnd - bufferPointer)
copy xfersz bytes from message to bufferPointer
message = message + xfersz
bufferPointer = bufferPointer + xfersz
size = size - xfersz
if bufferPointer == bufferEnd:
write buffer to underlying media
set bufferPointer to start of buffer
endif
endwhile
That basically handles messages of any size efficiently by reducing the number of physical writes. There will be optimisations of course - it's possible that the message may have been copied into kernel space so it makes little sense to copy it to the buffer if you're going to write it anyway. You may as well write the information from the kernel copy directly to the underlying media and only transfer the last bit to the buffer (since you have to save it).
In addition, you'd probably want to flush an incomplete buffer to the underlying media if nothing had been written for a time. That would reduce the likelihood of missing information on the off chance that the kernel itself crashes.
Aside: Technically, I guess this is sort of a circular buffer but it has special case handling to minimise the number of writes, and no need for a tail pointer because of that optimisation.

There are also ring buffers which have bounded space requirements and are probably best known in the Unix dmesg facility.

What comes to mind for me is time-based buffers and size-based. So you could either just write whatever is in the buffer to file once every x seconds/minutes/hours or whatever. Alternatively, you could wait until there are x log entries or x bytes worth of log data and write them all at once. This is one of the ways that log4net and log4J do it.

Overall, there are "First-In-First-Out" (FIFO) buffers, also known as queues; and there are "Latest*-In-First-Out" (LIFO) buffers, also known as stacks.
To implement FIFO, there are circular buffers, which are usually employed where a fixed-size byte array has been allocated. For example, a keyboard or serial I/O device driver might use this method. This is the usual type of buffer used when it is not possible to dynamically allocate memory (e.g., in a driver which is required for the operation of the Virtual Memory (VM) subsystem of the OS).
Where dynamic memory is available, FIFO can be implemented in many ways, particularly with linked-list derived data structures.
Also, binomial heaps implementing priority queues may be used for the FIFO buffer implementation.
A particular case of neither FIFO nor LIFO buffer is the TCP segment reassembly buffers. These may hold segments received out-of-order ("from the future") which are held pending the receipt of intermediate segments not-yet-arrived.
* My acronym is better, but most would call LIFO "Last In, First Out", not Latest.

Correct me if I'm wrong, but wouldn't using a mmap'd file for the log avoid both the overhead of small write syscalls and the possibility of data loss if the application (but not the OS) crashed? It seems like an ideal balance between performance and reliability to me.

How do I do dynamic data transfer and memory management across threads in C?

Platform: ARM9
Programming Language C
Requirements - plain C and no external libs and no boost.
OS - REX RTOS
I have two threads running in an embedded platform -
one is at driver level handling all the comms and data transfer with the hardware.
the second thread runs the application that uses the data to/from the hardware.
The idea is to decouple the app thread from the driver thread so we can change the hardware and implementation in the hardware driver thread but have minimal impact on the application thread.
My challenge is that the data received from the hardware may be dynamic i.e. we do not know upfront how much memory the application thread should set aside for each request to/from the hardware as this is determined at run-time.
I was thinking the driver thread could inform the application thread that there is so much data to read. The application thread then allocates the memory and requests the driver thread to read the data. It is then up to the application thread to process the data accordingly. This way, all the memory management is within the application thread.

Couple options come to mind:
1) malloc the memory in the driver, free it in the app. But... we tend to avoid malloc use in anything that approaches a real time requirement. If you have access to malloc/free, and there are no "real time" concerns or memory fragmentation issues (i.e. your heap is large enough), then this is a fairly easy approach. The driver just sends the allocated pointer to the app thread via a message queue and the app free's the memory when done. Watch out for memory leaks.
2) Ring or circular buffers. The driver completely manages a fixed size ring buffer and simply sends a message to the application when a buffer is ready. See here for some details: Circular buffer. Then the application marks the data "available" again via a driver API, which helps hide the ring buffer details from the app thread. We use this approach for one of our drivers that has a very similiar set of requirements as you describe. In this case, you need to be concerned with determining the "best" size for the ring buffer, overflow handling in the driver, etc.
good luck!

You don't specify an OS, but you somehow have "threads". Except that one of them is at driver level (interrupt handler) and the other sounds like an application (userland/kernel). But that doesn't match up either, because your driver and app are communicating before the data is even processed.
Your terminology is confusing and not encouraging. Is this a homebrew (RT)OS or not?
If you have a real OS, there are established methods for writing drivers and handing data to userland. Read the documentation or use one of the existing drivers as reference.
If this is a custom OS, you can still refer to other open source drivers for ideas, but you clearly won't have things set up as conveniently. Preallocate all the memory in the driver code, fill it with data as it arrives, and hand it off to the application code. The amount of memory will be a function of how fast your app can process data, the largest amount of data you plan to accept, and how much internal data queuing is needed to support your app.

This being C, I have ended up having to make the app register a callback with the driver. The purpose of the callback is to process the data after the driver reads it from the device. The driver manages the memory i.e. allocates memory, invokes the callback and finally frees memory. Additionally, the callback only has read permission on the memory. Therefore, the app should ideally just copy the contents of the buffer to its own memory and exit from the callback right away. Then it is free to process the data when and how it wishes.
I updated the documentation to make it clear to uses of the app callback that it is assumed when the callback returns, the memory should no longer be considered valid. If the callback is used any other way, the behavior is undefined.

My first thought would be to use circular buffers. Here is some example code. Feel free to adapt this to your own uses. You probably wouldn't want global variables. And you might not want #defines:
#define LENGTH (1024)
#define MASK (LENGTH-1)
uint8 circularBuffer[ LENGTH ];
int circularBuffer_add = 0;
int circularBuffer_rmv = 0;
void copyIn( uint8 * circularBuffer, uint8 * inputBuffer, int n ) {
int i;
for( i = 0; i < n; i++ ) {
circularBuffer[ circularBuffer_add ] = inputBuffer[ i ];
circularBuffer_add = ( circularBuffer_add + 1 ) & MASK;
}
}
void copyOut( uint8 * circularBuffer, uint8 * outputBuffer, int n ) {
int i;
for( i = 0; i < n; i++ ) {
outputBuffer[ i ] = circularBuffer[ circularBuffer_rmv ];
circularBuffer_rmv = ( circularBuffer_rmv + 1 ) & MASK;
}
}
Also the above code assumes that your unit of data is datatype "uint8". You can change it so that it uses some other datatype. Or you can even make it generic and use memcpy() to to copy into the circularBuffer.
The main feature of this code is how it handles the add and rmv ptr.
Once you get things working with the above code. I suggest at some point switching over all your reads from the hardware to use your platform's direct-memory-access API.
It is important to switch to direct-memory-access because the above code uses a lot of cycles relative to DMA which uses almost zero cycles.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight