Tricky queue use between two CPUs

Tricky queue use between two CPUs - c

cliff notes version
The TI F28377S has two CPUs, a main and a secondary CPU (CLA, which can only perform one task at a time, with uninterrupted tasks) -they share message areas of RAM. When quickly feeding a queue about 15 bytes (of a max 32 queue length) that the CLA will send out, sometimes a few bytes will never be transmitted. I think there is some issue with the CPU interrupts that is causing single bytes to occasionally get "lost" while handing them over to the buffer.
full version
(This is using the TI F28377S which has a main CPU clocked at 200 MHz, and a secondary independent CLA that runs at the same speed, but can only execute one task at a time. They can share one-way writeable variables).
I'm a little stumped on how to do this more complex task, involving the CLA and a queue.
Some quick background: I have two main CLA tasks, the first (Task1) is triggered by the ADC end of conversion (which itself is triggered by Timer0 at 100 kHz), and the second (Task2) is triggered by Timer0 itself (this was arrived at after much experimenting, and tweaking, as whenever I had Task2 running more often than the ADC task, the ADC task would never start - so I set them both up to use the same interval, only staggered). Task1 works perfectly, storing the ADC results in a simplistic ring buffer, and performing a simple calculation in the Task1 after-completion ISR. The second mostly works.
Task2 is used to toggle some GPIO pins for communicating with an external device. Because the total length of the codes are on the order of 100's of microseconds, instead of delaying, I use a simple case structure on each trigger to determine if it should: do nothing, turn on the code pins, turn on the strobe pin, turn off the strobe pin, turn off the code pins. This way each time the task is called, it completes nearly instantaneously, with the output codes being the proper length for the external device. The task works on one code per time, and once it is done, attempts to grab another from a queue. If none, it just continues passing right through.
Now, the tricky part. I have two requirements: 1) that I can add bytes to the end of the queue faster than the task will consume them (pretty easy in theory and practice) and 2) that I can add a byte to the front of the queue (not replacing the currently transmitting byte, just the front of the queue). The first ability is to send medium-short messages (2-20 characters). This second ability is necessary to send a single byte about any external interrupts that come in - as quickly as possible, and even in the middle of transmitting a message. I've set it up so that the Task sends exactly 1 byte per 500 microseconds (~300 "on" and ~200 "off). This way, if an interrupt message comes in, it will be guaranteed to be received less than 1 ms after occurring.
What is currently work is this: a function on the CPU that takes incoming bytes (one at a time) and adds them to a CPU2CLA buffer and increments a CPU2CLA length counter. Each time Task2 is run, it checks this queue and grabs one byte from the front of a CLAonly buffer, increases its own buffer length, and flags that a byte was consumed. When the Task2 after-task ISR is run, it will check if a byte was consumed, and remove the first-most byte from the CPU2CLA buffer. Currently this double buffer system doesn't have a flag for adding to front, so it doesn't take care of the interrupt case.
What I tried previously was to have a Task3 which took one byte that was passed CPU2CLA and run it from the CPU with a Task3andWait. Although this method should in theory take care of both requirements, about half of the time a byte or two of a message would never get transmitted (a single byte always got sent).
A CLA task can never be interrupted, but a CPU task can. This is why I tried to have all modifications of the queue occur only in the CLA, so that way there was never an indeterminate state that could interrupt a queue modification.

It sounds like splitting the high-priority and normal priority items into separate buffers would be a near-optimal solution here.
It would also ensure that if a high-priority item, a normal-priority item, and another high-priority item are produced before anything is consumed, the high-priority items will be consumed before the normal-priority items.
(Using a single buffer, that case leads to the normal-priority item being consumed before the second high-priority item. I suspect that is highly undesirable.)
If there is an item in the high-priority buffer, that will be consumed next. Otherwise, an item in the normal-priority buffer will be consumed.
Both buffers have a single producer and a single consumer (thus, SPSC type), and are handled in a simple first-in-first-out manner; therefore, a lockless circular buffer implementation (for each buffer) should work just fine here.
(If only 32 bytes are available for the two buffers, consider trying a 8:24 split first.)

Related

Linux UART imx8 how to quickly detect frame end?

I have an imx8 module running Linux on my PCB and i would like some tips or pointers on how to modify the UART driver to allow me to be able to detect the end of frame very quickly (less than 2ms) from my user space C application. The UART frame does not have any specific ending character or frame length. The standard VTIME of 100ms is much too long
I am reading from a Sim card, i have no control over the data, no control over the size or content of the data. I just need to detect the end of frame very quickly. The frame could be 3 bytes or 500. The SIM card reacts to data that it receives, typically I send it a couple of bytes and then it will respond a couple of ms later with an uninterrupted string of bytes of unknown length. I am using an iMX8MP
I thought about using the IDLE interrupt to detect the frame end. Turn it on when any byte is received and off once the idle interrupt fires. How can I propagate this signal back to user space? Or is there an existing method to do this?

Waiting for an "idle" is a poor way to do this.
Use termios to set raw mode with VTIME of 0 and VMIN of 1. This will allow the userspace app to get control as soon as a single byte arrives. See:
How to read serial with interrupt serial?
How do I use termios.h to configure a serial port to pass raw bytes?
How to open a tty device in noncanonical mode on Linux using .NET Core
But, you need a "protocol" of sorts, so you can know how much to read to get a complete packet. You prefix all data with a struct that has (e.g.) A type and a payload length. Then, you send "payload length" bytes. The receiver gets/reads that fixed length struct and then reads the payload which is "payload length" bytes long. This struct is always sent (in both directions).
See my answer: thread function doesn't terminate until Enter is pressed for a working example.
What you have/need is similar to doing socket programming using a stream socket except that the lower level is the UART rather than an actual socket.
My example code uses sockets, but if you change the low level to open your uart in raw mode (as above), it will be very similar.
UPDATE:
How quickly after the frame finished would i have the data at the application level? When I try to read my random length frames currently reading in 512 byte chunks, it will sometimes read all the frame in one go, other times it reads the frame broken up into chunks. –
Engo
In my link, in the last code block, there is an xrecv function. It shows how to read partial data that comes in chunks.
That is what you'll need to do.
Things missing from your post:
You didn't post which imx8 board/configuration you have. And, which SIM card you have (the protocols are card specific).
And, you didn't post your other code [or any code] that drives the device and illustrates the problem.
How much time must pass without receiving a byte before the [uart] device is "idle"? That is, (e.g.) the device sends 100 bytes and is then finished. How many byte times does one wait before considering the device to be "idle"?
What speed is the UART running at?
A thorough description of the device, its capabilities, and how you intend to use it.
A uart device doesn't have an "idle" interrupt. From some imx8 docs, the DMA device may have an "idle" interrupt and the uart can be driven by the DMA controller.
But, I looked at some of the linux kernel imx8 device drivers, and, AFAICT, the idle interrupt isn't supported.
I need to read everything in one go and get this data within a few hundred microseconds.
Based on the scheduling granularity, it may not be possible to guarantee that a process runs in a given amount of time.
It is possible to help this a bit. You can change the process to use the R/T scheduler (e.g. SCHED_FIFO). Also, you can use sched_setaffinity to lock the process to a given CPU core. There is a corresponding call to lock IRQ interrupts to a given CPU core.
I assume that the SIM card acts like a [passive] device (like a disk). That is, you send it a command, and it sends back a response or does a transfer.
Based on what command you give it, you should know how many bytes it will send back. Or, it should tell you how many optional bytes it will send (similar to the struct in my link).
The method you've described (e.g.) wait for idle, then "race" to get/process the data [for which you don't know the length] is fraught with problems.
Even if you could get it to work, it will be unreliable. At some point, system activity will be just high enough to delay wakeup of your process and you'll miss the window.
If you're reading data, why must you process the data within a fixed period of time (e.g. 100 us)? What happens if you don't? Does the device catch fire?
Without more specific information, there are probably other ways to do this.
I've programmed such systems before that relied on data races. They were unreliable. Either missing data. Or, for some motor control applications, device lockup. The remedy was to redesign things so that there was some positive/definitive way to communicate that was tolerant of delays.
Otherwise, I think you've "fallen in love" with "idle interrupt" idea, making this an XY problem: https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem

Sending large amount of data from ISR using queues in RTOS

I am working on an STM32F401 MC for audio acquisition and I am trying to send the audio data(384 bytes exactly) from ISR to a task using queues. The frequency of the ISR is too high and hence I believe some data is dropped due to the queue being full. The audio recorded from running the code is noisy. Is there any easier way to send large amounts of data from an ISR to a task?
The RTOS used is FreeRTOS and the ISR is the DMA callback from the I2S mic peripheral.

The general approach in these cases is:
Down-sample the raw data received in the ISR (e.g., save only 1 out of 4 samples)
Accumulate a certain number of samples before sending them in a message to the task

You can implement a "zero copy" queue by creating a queue of pointers to memory blocks rather than copying the memory itself. Have the audio data written directly to a block (by DMA for example), then when full, enqueue a pointer to the block, and switch to the next available block in the pool. The receiving task can then operate directly on the memory block without needing to copy the data either into and out of the queue - the only thing copied is the pointer.
The receiving task when done, returns the block back to the pool. The pool should have the same number of blocks as queue length.
To create a memory pool you would start with a static array:
tAudioSample block[QUEUE_LENGTH][BLOCK_SIZE] ;
Then fill a block_pool queue with pointers to each block element - pseudocode:
for( int i = 0; i < QUEUE_LENGTH; i++ )
{
queue_send( block_pool, block[i] ) ;
}
Then to get an "available" block, you simply take a pointer from the queue, fill it, and then send to your audio stream queue, and the receiver when finished with the block posts the pointer back to the block_pool.
Some RTOS provide a fixed block allocator that does exactly what I described above with the block_pool queue. If you are using the CMSIS RTOS API rather than native FreeRTOS API, that provides a memory pool API.
However, it sounds like this may be an X-Y problem - you have presented your diagnosis, which may or may not be correct and decided on a solution which you are then asking for help with. But what if it is the wrong or nor the optimum solution? Better to include some code showing how the data is generated and consumed, and provide concrete information such as where this data is coming from, how often the ISR is generated, sample rates, the platform it is running on, the priority and scheduling of the receiving task, and what other tasks are running that might delay it.
On most platforms 384 bytes is not a large amount of data, and the interrupt rate would have to be extraordinarily high or the receiving task to be excessively delayed (i.e not real time) or doing excessive or non-deterministic work to cause this problem. It may not be the ISR frequency that is the problem, but rather the performance and schedulability of the receiving task.
It is not clear if you 384 bytes results in a single interrupt or 384 interrupts or what?
That is to say that it may be a more holistic design issue rather than simply how to pass data more efficiently - though that can't be a bad thing.

If the thread receiving the data is called at periodic intervals, the queue should be sized sufficiently to hold all data that may be received in that interval. It would probably be a good idea to make sure the queue is large enough to hold data for at least two intervals.
If the thread receiving the data is simply unable to keep up with the incoming data, then one could consider increasing its priority.
There is some overhead processing associated with each push to and pull from the queue, since FreeRTOS will check to determine whether a higher priority task should wake up in response to the action. When writing or reading multiple items to or from the queue at the same time, it may help to suspend the scheduler while the transfer is taking place.
Another solution would be to implement a circular buffer and place it into shared memory. This will basically perform the same function as a queue, but without the extra overhead. You may need to use a mutex to block simultaneous access to the buffer, depending on how the circular buffer is implemented.

Performance benefit when using DMA for PWM

I have a segment of code below as a FreeRTOS task running on an STM32F411RE microcontroller:
static void TaskADCPWM(void *argument)
{
/* Variables used by FreeRTOS to set delays of 50ms periodically */
const TickType_t DelayFrequency = pdMS_TO_TICKS(50);
TickType_t LastActiveTime;
/* Update the variable RawAdcValue through DMA */
HAL_ADC_Start_DMA(&hadc1, (uint32_t*)&RawAdcValue, 1);
#if PWM_DMA_ON
/* Initialize PWM CHANNEL2 with DMA, to automatically change TIMx->CCR by updating a variable */
HAL_TIM_PWM_Start_DMA(&htim3, TIM_CHANNEL_2, (uint32_t*)&RawPWMThresh, 1);
#else
/* If DMA is not used, user must update TIMx->CCRy manually to alter duty cycle */
HAL_TIM_PWM_Start(&htim3, TIM_CHANNEL_2);
#endif
while(1)
{
/* Record last wakeup time and use it to perform blocking delay the next 50ms */
LastActiveTime = xTaskGetTickCount();
vTaskDelayUntil(&LastActiveTime, DelayFrequency);
/* Perform scaling conversion based on ADC input, and feed value into PWM CCR register */
#if PWM_DMA_ON
RawPWMThresh = (uint16_t)((RawAdcValue * MAX_TIM3_PWM_VALUE)/MAX_ADC_12BIT_VALUE);
#else
TIM3->CCR2 = (uint16_t)((RawAdcValue * MAX_TIM3_PWM_VALUE)/MAX_ADC_12BIT_VALUE);
#endif
}
}
The task above uses RawAdcValue value to update a TIM3->CCR2 register either through DMA or manually. The RawAdcValue gets updated periodically through DMA, and the value stored in this variable is 12-bits wide.
I understand how using DMA could benefit reading the ADC samples above as the CPU will not need to poll/wait for the ADC samples, or using the DMA to transfer long streams of data through I2C or SPI. But, is there a significant performance advantage to using DMA to update the TIM3->CCR2 register instead of manually modifying the TIM3->CCR2 register through:
TIM3->CCR2 &= ~0xFFFF;
TIM3->CCR2 |= SomeValue;
What would be the main differences between updating the CCR register through DMA or non-DMA?

Let's start by assuming you need to achieve "N samples per second". E.g. for audio this might be 44100 samples per second.
For PWM, you need to change the state of the output multiple times per sample. For example; for audio this might mean writing to the CCR around four times per sample, or "4*44100 = 176400" times per second.
Now look at what vTaskDelayUntil() does - most likely it sets up a timer and does a task switch, then (when the timer expires) you get an IRQ followed by a second task switch. It might add up to a total overhead of 500 CPU cycles each time you change the CCR. You can convert this into a percentage. E.g. (continuing the audio example), "176400 CCR updates per second * 500 cycles per update = about 88.2 million cycles per second of overhead", then, for 100 MHz CPU, you can do "88.2 million / 100 million = 88.2% of all CPU time wasted because you didn't use DMA".
The next step is to figure out where the CPU time comes from. There's 2 possibilities:
a) If your task is the highest priority task in the system (including being higher priority than all IRQs, etc); then every other task will become victims of your time consumption. In this case you've single-handedly ruined any point of bothering with a real time OS (probably better to just use a faster/more efficient non-real-time OS that optimizes "average case" instead of optimizing "worst case", and using DMA, and using a less powerful/cheaper CPU, to get a much better end result at a reduced "cost in $").
b) If your task isn't the highest priority task in the system, then the code shown above is broken. Specifically, an IRQ (and possibly a task switch/preemption) can occur immediately after the vTaskDelayUntil(&LastActiveTime, DelayFrequency);, causing theTIM3->CCR2 = (uint16_t)((RawAdcValue * MAX_TIM3_PWM_VALUE)/MAX_ADC_12BIT_VALUE); to occur at the wrong time (much later than intended). In pathological cases (e.g. where some other event like disk or network just happens to occur at a similar related frequency - e.g. at half your "CCR update frequency") this can easily become completely unusable (e.g. because turning the output on is often delayed more than intended and turning the output off is not).
However...
All of this depends on how many samples per second (or better, how many CCR updates per second) you actually need. For some purposes (e.g. controlling an electric motor's speed in a system that changes the angle of a solar panel to track the position of the sun throughout the day); maybe you only need 1 sample per minute and all the problems caused by using CPU disappear. For other purposes (e.g. AM radio transmissions) DMA probably won't be good enough either.
WARNING
Unfortunately, I can't/didn't find any documentation for HAL_ADC_Start_DMA(), HAL_TIM_PWM_Start() or HAL_TIM_PWM_Start_DMA() online, and don't know what the parameters are or how the DMA is actually being used. When I first wrote this answer I simply relied on a "likely assumption" that may have been a false assumption.
Typically, for DMA you have a block of many pieces of data (e.g. for audio, maybe you have a block 176400 values - enough for a whole second of sound at "4 values per sample, 44100 samples per second"); and while that transfer is happening the CPU is free to do other work (and not wasted). For continuous operation, the CPU might prepare the next block of data while the DMA transfer is happening, and when the DMA transfer completes the hardware would generate an IRQ and the IRQ handler will start the next DMA transfer for the next block of values (alternatively, the DMA channel could be configured for "auto-repeat" and the block of data might be a circular buffer). In that way, the "88.2% of all CPU time wasted because you didn't use DMA" would be "almost zero CPU time used because DMA controller is doing almost everything"; and the whole thing would be immune to most timing problems (an IRQ or higher priority task preempting can not influence the DMA controller's timing).
This is what I assumed the code is doing when it uses DMA. Specifically, I assumed that the every "N nanoseconds" the DMA would take the next raw value from a large block of raw values and use that next raw value (representing the width of the pulse) to set a timer's threshold to a value from 0 to N nanoseconds.
In hindsight; it's possibly more likely that the code sets up the DMA transfer for "1 value per transfer, with continual auto-repeat". In that case the DMA controller would be continually pumping whatever value happens to be in RawPWMThresh to the timer at a (possibly high) frequency, and then the code in the while(1) loop would be changing the value in RawPWMThresh at a (possibly much lower) frequency. For example (continuing the audio example); it could be like doing "16 values per sample (via. the DMA controller), with 44100 samples per second (via. the while(1) loop)". In that case; if something (an unrelated IRQ, etc) causes an unexpected extra delay after the vTaskDelayUntil(); then it's not a huge catastrophe (the DMA controller simply repeats the existing value for a little longer).
If that is the case; then the real difference could be "X values per sample with 20 samples per second" (with DMA) vs. "1 value per sample with 20 samples per second" (without DMA); where the overhead is the same regardless, but the quality of the output is much better with DMA.
However; without knowing what the code actually does (e.g. without knowing the frequency of the DMA channel and how things like the timer's prescaler are configured) it's also technically possible that when using DMA the "X values per sample with 20 samples per second" is actually "1 value per sample with 20 samples per second" (with X == 1). In that case, using DMA would be almost pointless (none of the performance benefits I originally assumed; and almost none of the "output quality" benefits I'm tempted to assume in hindsight, except for the "repeat old value if there's unexpected extra delay after the vTaskDelayUntil()").

First, remember that premature optimization is the cause of uncountably many problems. The question you need to ask is "what ELSE does the processor need to do?". If the processor has nothing better to do, then just poll and save yourself some programming effort.
If the processor does have something better to do (or you are running from batteries and want to save power) then you need to time how long the processor spends waiting between each thing that it needs to do.
In your case, you are using an operating system context switch in place of "waiting". You can time the cost of the switch-write-to-pwm-switch-back cycle by measuring the performance of some other thread.
Set up a system with two threads. Perform some task that you know the performance of in one thread, eg, some fixed computation or processor benchmark. Now set up the other thread to do your timer business above. Measure the performance of the first thread.
Next set up a similar system with only the first thread plus DMA doing the PWM. Measure the performance change, you have you answer.
Obviously this all depends very much on your exact system. There is no general answer that can be given. The closer your test is to your real system the more accurate the answer you will get.
PS: Your PWM will glitch using the above code. Replace the two writes with a single one:
TIM3->CCR2 &= ~0xFFFF;
TIM3->CCR2 |= SomeValue;
should be:
TIM3->CCR2 = ((TIM3->CCR2 & ~0xFFFF) | SomeValue);

How to handle asynchronous input and synchronous output?

I am currently working on a project where I have USART input and SAI(Serial Audio Interface, similar to SPI) output on an STM32 system.
I created a circular buffer which act as pinpong buffer(double buffer) structure. The input samples which received from USART are stored in this buffer where the head pointer points. When SAI peripheral requests new data, data is pulled from this buffer's tail pointer.
At the start of my code I wait until half the buffer is filled then activate SAI. SAI outputs at constant rate which is 40kHz. Input samples are received from external device's USART at approximately at same rate 40kHz.
Ideally, I expect the difference between my head and tail pointer to be constant.
I also implemented a protection mechanism which makes the Tail pointer wait and output the last sample from SAI until half of the buffer to fill when two pointers are pointing at same location.
The code works at start. The problem is when some time passes like approximately 2 minutes we see the head and tail pointers are pointing at the same location which creates discontinuity in our samples. Which means the one pointer is slow or fast than expected. I am sure of SAI protocol outputting 40kHz constantly (I checked it with scope). However, I am not so sure about accuracy of USARTs timing. I cannot modify the external USART device's code and I cannot change the output rate 40kHz it must be this value.
Is there a another way (maybe other than ping pong buffer method) to handle asynchronous input and synchronous outputs?

If what you are saying is you are receiving (continuous) serial data from some external device, and then you are forwarding it out some interface of your own at some rate...based on your clock. Even if the data is the same format and the clocks are "the same", then a buffer overflow is expected somewhere.
Same thing happens with ethernet or any other source if 1) continuous at line rate 2) source for the input and source for the output are a different clock 3) there are guaranteed to be differences in the reference clocks, if the input source clock is a little faster then so long as the stream stays continuous and at line rate, then you will overflow eventually.
The clocks change with temperature and voltage so the delta can change.
Possible to even reduce the percentage of the data you output from the input and still overflow if the input is continuous. Depends on if your output is also at line rate or of you have margin and the margin can overcome the difference in the clocks.
Also remember uarts hardly run at that exact rate, the use clock dividers and get close, you can have two computers using uarts at the same rate and the delta can be relatively large and the overflow can happen very soon. For uart to work the clock has to only be good enough to get through one character so can be several percent off if not more than that, even if the oscillator is very good and no plls and both sides use the same reference clock (but not the same uart, clocking system, etc).
If you increase your output rate or reduce the data being output so that it is not at line rate then the problem may go away or may take hours or days before it happens...
if I have misunderstood the problem, forgive me, I will delete this answer.

create a small delay in a Linux interrupt handler

I'm working on an interrupt handler with a hardware design group and we're trying to figure out where a bug is. I'm reading a chip over the SPI bus at 5khz. The chip loads 4 bytes and triggers a data ready pin.
My interrupt handler wakes up and read 4 bytes off the SPI bus and stores the data in a buffer. Strangely enough though, every 17th read gives 4 bytes of all 0's, which is not right. One of the options we're exploring is that the chip isn't always actually ready when it sends the data ready signal.
So, I know I can't sleep in an interrupt handler, but I'd like to try and introduce a delay of 10 or 20 microseconds. Right now I have a for loop which counts to 100,000 then processes the interrupt. I haven't seen any changes, so I thought I might see if someone has a better technique for busy waiting. Or at least a better way of figuring out how many loop iterations I should go through, as I'm not sure how long this takes, or if the compiler is simply optimizing out the whole thing.

I dont know if you have access to any pseudorandom number generation libraries on your embedded device, but doing large number multiplication followed by mod will definately take some cycles. Instead of simply adding 1 (which is very fast at the hardware level and the compiler can optimize it to shifting since you're doing it a static number of times) use a random number seed (does the system have access to a time clock?) if available and do large number multiplication, modulus or factorial operations, negative number division also takes forever. Remember, division takes the longest at the hardware level. Use that to your advantage.

I assume your compiler will strip out a simple loop.
You should use volatile.
volatile unsigned long i;
for (i=0;i< 1000000; i++)
continue;
I assume also that this will not remove the problem or help you.
I can't believe, that a SPI peripheral has such a bug.
But it's possible that you read to slow the data from the SPI-Fifo.
So some of the received data will be dropped.
You should check the error flags of the SPI module and check the RX-empty RX-fullflags of the SPI.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight