Sending large amount of data from ISR using queues in RTOS - c

I am working on an STM32F401 MC for audio acquisition and I am trying to send the audio data(384 bytes exactly) from ISR to a task using queues. The frequency of the ISR is too high and hence I believe some data is dropped due to the queue being full. The audio recorded from running the code is noisy. Is there any easier way to send large amounts of data from an ISR to a task?
The RTOS used is FreeRTOS and the ISR is the DMA callback from the I2S mic peripheral.

The general approach in these cases is:
Down-sample the raw data received in the ISR (e.g., save only 1 out of 4 samples)
Accumulate a certain number of samples before sending them in a message to the task

You can implement a "zero copy" queue by creating a queue of pointers to memory blocks rather than copying the memory itself. Have the audio data written directly to a block (by DMA for example), then when full, enqueue a pointer to the block, and switch to the next available block in the pool. The receiving task can then operate directly on the memory block without needing to copy the data either into and out of the queue - the only thing copied is the pointer.
The receiving task when done, returns the block back to the pool. The pool should have the same number of blocks as queue length.
To create a memory pool you would start with a static array:
tAudioSample block[QUEUE_LENGTH][BLOCK_SIZE] ;
Then fill a block_pool queue with pointers to each block element - pseudocode:
for( int i = 0; i < QUEUE_LENGTH; i++ )
{
queue_send( block_pool, block[i] ) ;
}
Then to get an "available" block, you simply take a pointer from the queue, fill it, and then send to your audio stream queue, and the receiver when finished with the block posts the pointer back to the block_pool.
Some RTOS provide a fixed block allocator that does exactly what I described above with the block_pool queue. If you are using the CMSIS RTOS API rather than native FreeRTOS API, that provides a memory pool API.
However, it sounds like this may be an X-Y problem - you have presented your diagnosis, which may or may not be correct and decided on a solution which you are then asking for help with. But what if it is the wrong or nor the optimum solution? Better to include some code showing how the data is generated and consumed, and provide concrete information such as where this data is coming from, how often the ISR is generated, sample rates, the platform it is running on, the priority and scheduling of the receiving task, and what other tasks are running that might delay it.
On most platforms 384 bytes is not a large amount of data, and the interrupt rate would have to be extraordinarily high or the receiving task to be excessively delayed (i.e not real time) or doing excessive or non-deterministic work to cause this problem. It may not be the ISR frequency that is the problem, but rather the performance and schedulability of the receiving task.
It is not clear if you 384 bytes results in a single interrupt or 384 interrupts or what?
That is to say that it may be a more holistic design issue rather than simply how to pass data more efficiently - though that can't be a bad thing.

If the thread receiving the data is called at periodic intervals, the queue should be sized sufficiently to hold all data that may be received in that interval. It would probably be a good idea to make sure the queue is large enough to hold data for at least two intervals.
If the thread receiving the data is simply unable to keep up with the incoming data, then one could consider increasing its priority.
There is some overhead processing associated with each push to and pull from the queue, since FreeRTOS will check to determine whether a higher priority task should wake up in response to the action. When writing or reading multiple items to or from the queue at the same time, it may help to suspend the scheduler while the transfer is taking place.
Another solution would be to implement a circular buffer and place it into shared memory. This will basically perform the same function as a queue, but without the extra overhead. You may need to use a mutex to block simultaneous access to the buffer, depending on how the circular buffer is implemented.

Related

Accumulate data in receive buffer in order to prevent busy-waiting when epoll_waiting on slow connections

Clients sending sufficient large amount of data with sufficient slow internet connection are causing me to busy-wait in a classic non-blocking server-client setup in C with sockets.
The busy-waiting is caused in detail by this procedure
I install EPOLLIN for client, (monitor for receiving data)
client sends data.
epoll_wait signalizes me there is data to be read (EPOLLIN)
coroutine is being resumed, data is being consumed, more data is needed in order to finish this client. EWOULDBLOCK and BACK TO 1.
This above procedure is being repeated for minutes (due to the slow internet connection and large data). It's basically just a useless hopping around without doing anything meaningful other than consuming cpu time. Additionally it's kind of killing the purpose of epoll_wait.
So, I wanted to avoid this busy-waiting by some mechanism which does accumulate the data in receive buffer until either a minimum size has been reached or a maximal timeout has passed since the first byte arrived and only then epoll_wait should wake me up with EPOLLIN for this client.
I first looked into tcp(7), I was hoping for something like TCP_CORK but for the receive buffer, but could not find anything.
Then I looked into unix(7) and tried to implement it myself via SIOCINQ right after step 3. The problem is that I end up busy-waiting again because step 3. is immediately going to return because data is available for read. Alternatively I could deregister the client right after 3., but this would block this specific client until epoll_wait returns from a different client.
Is it a stalemate, or is there any solution to the above problem to accumulate data inside receive buffer upon a min size or max time without busy-waiting?
#ezgoing and I chatted at length about this, and I'm convinced this is a non problem (as #user207421 noted as well).
When I first read the question, I thought perhaps they were worried about tiny amounts (say, 16 bytes at a time), and that would have been worth investigating, but once it turns out that it's 4KiB at a time, it's so routine that this is not worth looking into.
Interestingly, the serial I/O module does support this, with a mode that wakes up only after so many characters are available or so much time has passed, but no such thing with the network module.
The only time this would be worth addressing is if there is actual evidence that it's impacting the application's responsiveness in a meaningful way, not a hypothetical concern for packet rates.

Tricky queue use between two CPUs

cliff notes version
The TI F28377S has two CPUs, a main and a secondary CPU (CLA, which can only perform one task at a time, with uninterrupted tasks) -they share message areas of RAM. When quickly feeding a queue about 15 bytes (of a max 32 queue length) that the CLA will send out, sometimes a few bytes will never be transmitted. I think there is some issue with the CPU interrupts that is causing single bytes to occasionally get "lost" while handing them over to the buffer.
full version
(This is using the TI F28377S which has a main CPU clocked at 200 MHz, and a secondary independent CLA that runs at the same speed, but can only execute one task at a time. They can share one-way writeable variables).
I'm a little stumped on how to do this more complex task, involving the CLA and a queue.
Some quick background: I have two main CLA tasks, the first (Task1) is triggered by the ADC end of conversion (which itself is triggered by Timer0 at 100 kHz), and the second (Task2) is triggered by Timer0 itself (this was arrived at after much experimenting, and tweaking, as whenever I had Task2 running more often than the ADC task, the ADC task would never start - so I set them both up to use the same interval, only staggered). Task1 works perfectly, storing the ADC results in a simplistic ring buffer, and performing a simple calculation in the Task1 after-completion ISR. The second mostly works.
Task2 is used to toggle some GPIO pins for communicating with an external device. Because the total length of the codes are on the order of 100's of microseconds, instead of delaying, I use a simple case structure on each trigger to determine if it should: do nothing, turn on the code pins, turn on the strobe pin, turn off the strobe pin, turn off the code pins. This way each time the task is called, it completes nearly instantaneously, with the output codes being the proper length for the external device. The task works on one code per time, and once it is done, attempts to grab another from a queue. If none, it just continues passing right through.
Now, the tricky part. I have two requirements: 1) that I can add bytes to the end of the queue faster than the task will consume them (pretty easy in theory and practice) and 2) that I can add a byte to the front of the queue (not replacing the currently transmitting byte, just the front of the queue). The first ability is to send medium-short messages (2-20 characters). This second ability is necessary to send a single byte about any external interrupts that come in - as quickly as possible, and even in the middle of transmitting a message. I've set it up so that the Task sends exactly 1 byte per 500 microseconds (~300 "on" and ~200 "off). This way, if an interrupt message comes in, it will be guaranteed to be received less than 1 ms after occurring.
What is currently work is this: a function on the CPU that takes incoming bytes (one at a time) and adds them to a CPU2CLA buffer and increments a CPU2CLA length counter. Each time Task2 is run, it checks this queue and grabs one byte from the front of a CLAonly buffer, increases its own buffer length, and flags that a byte was consumed. When the Task2 after-task ISR is run, it will check if a byte was consumed, and remove the first-most byte from the CPU2CLA buffer. Currently this double buffer system doesn't have a flag for adding to front, so it doesn't take care of the interrupt case.
What I tried previously was to have a Task3 which took one byte that was passed CPU2CLA and run it from the CPU with a Task3andWait. Although this method should in theory take care of both requirements, about half of the time a byte or two of a message would never get transmitted (a single byte always got sent).
A CLA task can never be interrupted, but a CPU task can. This is why I tried to have all modifications of the queue occur only in the CLA, so that way there was never an indeterminate state that could interrupt a queue modification.
It sounds like splitting the high-priority and normal priority items into separate buffers would be a near-optimal solution here.
It would also ensure that if a high-priority item, a normal-priority item, and another high-priority item are produced before anything is consumed, the high-priority items will be consumed before the normal-priority items.
(Using a single buffer, that case leads to the normal-priority item being consumed before the second high-priority item. I suspect that is highly undesirable.)
If there is an item in the high-priority buffer, that will be consumed next. Otherwise, an item in the normal-priority buffer will be consumed.
Both buffers have a single producer and a single consumer (thus, SPSC type), and are handled in a simple first-in-first-out manner; therefore, a lockless circular buffer implementation (for each buffer) should work just fine here.
(If only 32 bytes are available for the two buffers, consider trying a 8:24 split first.)

Ownership of frame in buffer - C programming

I am programming interface between HW and SW. I know what should I get as result, and now I am thinking how to make it most efficiently. I have sort of circular FIFO buffer in which Operating System will write data, and HW will read data from it. So basically I have read and write pointer, read is shifted when DMAC (DMA controller is reading data from memory) and write is shifted when my program is writing to memory. Basic blocks in this circular FIFO buffer are called frames (I call them that way). So I am always reading and writing to frames in buffer. Now I am wondering is it possible to indicate who owns frame (HW or SW)? I have idea to put sort of flag at beginning of every frame to indicate is frame owned by HW or SW. But I do not know should I do it on that way, or there is better way to do it in C??? For example at beginning all frames in buffer are owned by OS (SW), then when my program completes writing to first frame, I am passing ownership to HW (or my DMA Controller). Again, when DMA Controller completes reading from memory, I am passing ownership of frame to OS. So I have one way to do this with flags at beginning of every frame, but I am wondering is there better way to do it?
Thank you in advance on answers :)
What I did in the past was to pass the pointer to the DMA driver whenever it's done. The driver switch to the new pointer on next clock cycle.
The DMA driver is tied to a display sync signal at 60Hz, while the application only updates the pointer at about 10Hz, but it doesn't hurt to display the old image while waiting for a new one.
I'm not sure if this fits your problem.
What I ususally do is queue pointers or frame indices, rather than actual I/O data. An index into a frame array only needs to be one byte, and so is easily queued/manipulated. I put most of the indices onto a user-state pool queue at startup and the rest into a 'rxPool' queue for the rx driver/s to draw from for new rx data. Each frame has a 2-bit status field that indicates its current usage state, (in user pool, in rxPool, holding tx data, holding rx data).
I queue the indices into the DMA/interrupt driver and fire it off, (if not already running). When the DMA/whatever is done, I queue the tx/rx indices back to a 'scavenge' queue and signal a semaphore. The I/O driver thread, (waits on the semaphore), then signals the I/O originator thread, (new rx frame), release the index back to a 'pool' queue, ready for re-use, (used tx frame), or 'top-up the rxPool queue if not full.

Overlapping communications with computations in MPI (mvapich2) for large messages

I have a very simple code, a data decomposition problem in which in a loop each process sends two large messages to the ranks before and after itself at each cycle. I run this code in a cluster of SMP nodes (AMD Magny cores, 32 core per node, 8 cores per socket). It's a while I'm in the process of optimizing this code. I have used pgprof and tau for profiling and it looks to me that the bottleneck is the communication. I have tried to overlap the communication with the computations in my code however it looks that the actual communication starts when the computations finish :(
I use persistent communication in ready mode (MPI_Rsend_init) and in between the MPI_Start_all and MPI_Wait_all bulk of the computation is done. The code looks like this:
void main(int argc, char *argv[])
{
some definitions;
some initializations;
MPI_Init(&argc, &argv);
MPI_Rsend_init( channel to the rank before );
MPI_Rsend_init( channel to the rank after );
MPI_Recv_init( channel to the rank before );
MPI_Recv_init( channel to the rank after );
for (timestep=0; temstep<Time; timestep++)
{
prepare data for send;
MPI_Start_all();
do computations;
MPI_Wait_all();
do work on the received data;
}
MPI_Finalize();
}
Unfortunately the actual data transfer does not start until the computations are done, I don't understand why. The network uses QDR InfiniBand Interconnect and mvapich2. each message size is 23MB (totally 46 MB message is sent). I tried to change the message passing to eager mode, since the memory in the system is large enough. I use the following flags in my job script:
MV2_SMP_EAGERSIZE=46M
MV2_CPU_BINDING_LEVEL=socket
MV2_CPU_BINDING_POLICY=bunch
Which gives me an improvement of about 8%, probably because of better placement of the ranks inside the SMP nodes however still the problem with communication remains. My question is why can't I effectively overlap the communications with the computations? Is there any flag that I should use and I'm missing it? I know something is wrong, but whatever I have done has not been enough.
By the order of ranks inside the SMP nodes the actual message sizes between the nodes is also 46MB (2x23MB) and the ranks are in a loop. Can you please help me? To see the flags that other users use I have checked /etc/mvapich2.conf however it is empty.
Is there any other method that I should use? do you think one sided communication gives better performance? I feel there is a flag or something that I'm not aware of.
Thanks alot.
There is something called progression of operations in MPI. The standard allows for non-blocking operations to only be progressed to completion once the proper testing/waiting call was made:
A nonblocking send start call initiates the send operation, but does not complete it. The send start call can return before the message was copied out of the send buffer. A separate send complete call is needed to complete the communication, i.e., to verify that the data has been copied out of the send buffer. With suitable hardware, the transfer of data out of the sender memory may proceed concurrently with computations done at the sender after the send was initiated and before it completed. Similarly, a nonblocking receive start call initiates the receive operation, but does not complete it. The call can return before a message is stored into the receive buffer. A separate receive complete call is needed to complete the receive operation and verify that the data has been received into the receive buffer. With suitable hardware, the transfer of data into the receiver memory may proceed concurrently with computations done after the receive was initiated and before it completed.
(words in bold are also bolded in the standard text; emphasis added by me)
Although this text comes from the section about non-blocking communication (ยง3.7 of MPI-3.0; the text is exactly the same in MPI-2.2), it also applies to persistent communication requests.
I haven't used MVAPICH2, but I am able to speak about how things are implemented in Open MPI. Whenever a non-blocking operation is initiated or a persistent communication request is started, the operation is added to a queue of pending operations and is then progressed in one of the two possible ways:
if Open MPI was compiled without an asynchronous progression thread, outstanding operations are progressed on each call to a send/receive or to some of the wait/test operations;
if Open MPI was compiled with an asynchronous progression thread, operations are progressed in the background even if no further communication calls are made.
The default behaviour is not to enable the asynchronous progression thread as doing so increases the latency of the operations somehow.
The MVAPICH site is unreachable at the moment from here, but earlier I saw a mention of asynchronous progress in the features list. Probably that's where you should start from - search for ways to enable it.
Also note that MV2_SMP_EAGERSIZE controls the shared memory protocol eager message size and does not affect the InfiniBand protocol, i.e. it can only improve the communication between processes that reside on the same cluster node.
By the way, there is no guarantee that the receive operations would be started before the ready send operations in the neighbouring ranks, so they might not function as expected as the ordering in time is very important there.
For MPICH, you can set MPICH_ASYNC_PROGRESS=1 environment variable when runing mpiexec/mpirun. This will spawn a background process which does "asynchronous progress" stuff.
MPICH_ASYNC_PROGRESS - Initiates a spare thread to provide
asynchronous progress. This improves progress semantics for
all MPI operations including point-to-point, collective,
one-sided operations and I/O. Setting this variable would
increase the thread-safety level to
MPI_THREAD_MULTIPLE. While this improves the progress
semantics, it might cause a small amount of performance
overhead for regular MPI operations.
from MPICH Environment Variables
I have tested on my cluster with MPICH-3.1.4, it worked! I believe MVAPICH will also work.

Suggestion for callbacks oriented library in C

I'm making small library for controlling various embedded devices using C language. I'm using UDP sockets to communicate with each of the devices. Devices send me various interesting data, alarms and notifications and at the same time they send some data that is used internally by the library but may not be interesting to users. So, I've implemented a callback approach, where user could register a callback function with some interesting events on each of the devices. Right now, overall design of this library is something like this:-
I've two threads running.
In one of the thread, there is a infinite while event-loop that uses select and non-blocking sockets to maintain the communication with each of the devices.
Basically, every time I receive a packet from any of devices, I strip off the header which is 20 bytes of some useless information and add my own header containing DEVICE_ID, REQUES_TIME (time request was sent to retrieve that packet and RETRIEVAL_TIME (time now when packet actually arrived) and REQUEST_ID and REQUEST_TYPE (alarm, data, notification etc..).
Now, this thread (one with infinite loop) put packet with new header into ring buffer and then notifies other thread (thread #2) to parse this information.
In thread #2, when notification is received, it locks the buffer and read pop the packet and start parsing it.
Every message contains some information that user may not be interested, so I'm providing user call back approach to act upon data which is useful to user.
Basically, I'm doing something like this in thread 2:-
THREAD #2
wait(data_put_in_buffer_cond)
lock(buffer_mutex)
packet_t* packet = pop_packet_from_buffer(buf);
unlock(buffer_mutex)
/* parsing the package... */
parsed_packet_t* parsed_packet = parse_and_change_endianess(packet->data);
/* header for put by thread #1 with host byte order only not parsing necessary */
header_t* header = get_header(packet);
/* thread 1 sets free callback for kind of packet it puts in buffer
* This not a critical section section of buffer, so fine without locks
*/
buffer.free_callback(packet);
foreach attribute in parsed_packet->attribute_list {
register_info_t* rinfo = USER_REGISTRED_EVENT_TABLE[header->device_id][attribute.attr_id];
/*user is register with this attribute ID on this device ID */
if(rinfo != NULL) {
rinof->callback(packet);
}
// Do some other stuff with this attribute..
}
free(parsed_packet);
Now, my concerned is that what will happen if callback function that user implements takes some time to complete and meanwhile I may drop some packet because ring buffer is in overwriting mode? I've tested my API for 3 to 4 devices, I don't see much drop event if callback function wait decent amount of time..I'm speculating that this approach may not be best.
Would it be a better design, if I use some sort of thread-pool to run user callback functions? In that case I would need to make explicit copy of packet before I send it to user callback? Each packet is about 500 to 700 bytes, I get around 2 packets per second from each device. Any suggestions or comments on improving the current design or solving this issues would be appreciated.
Getting 500-700 bytes per device is not a problem at all, especially if you only have 3-4 devices. Even if you had, let's say, 100 devices, it should not be a problem. The copy overhead would be most probably negligible. So, my suggest would be: do not try to optimize beforehand until you are certain that buffer copying is your bottleneck.
About losing packets, as you say in your question, you are already using a buffer ring (I assume that is something like a circular queue, right?). If the queue becomes full, then you just need to make thread #1 to wait until there is some available space in the queue. Clearly, more events from different devices may arrive, but that should not be a problem. Once, you have space again, select will let you know that you have available data from different devices, so you will just need to process all that data. Of course, in order to have a balanced system, you can set the size of the queue to a value that reduces as much as possible the number of times that the queue is full, and thus, thread #1 needs to wait.

Resources