How to handle asynchronous input and synchronous output? - c

I am currently working on a project where I have USART input and SAI(Serial Audio Interface, similar to SPI) output on an STM32 system.
I created a circular buffer which act as pinpong buffer(double buffer) structure. The input samples which received from USART are stored in this buffer where the head pointer points. When SAI peripheral requests new data, data is pulled from this buffer's tail pointer.
At the start of my code I wait until half the buffer is filled then activate SAI. SAI outputs at constant rate which is 40kHz. Input samples are received from external device's USART at approximately at same rate 40kHz.
Ideally, I expect the difference between my head and tail pointer to be constant.
I also implemented a protection mechanism which makes the Tail pointer wait and output the last sample from SAI until half of the buffer to fill when two pointers are pointing at same location.
The code works at start. The problem is when some time passes like approximately 2 minutes we see the head and tail pointers are pointing at the same location which creates discontinuity in our samples. Which means the one pointer is slow or fast than expected. I am sure of SAI protocol outputting 40kHz constantly (I checked it with scope). However, I am not so sure about accuracy of USARTs timing. I cannot modify the external USART device's code and I cannot change the output rate 40kHz it must be this value.
Is there a another way (maybe other than ping pong buffer method) to handle asynchronous input and synchronous outputs?

If what you are saying is you are receiving (continuous) serial data from some external device, and then you are forwarding it out some interface of your own at some rate...based on your clock. Even if the data is the same format and the clocks are "the same", then a buffer overflow is expected somewhere.
Same thing happens with ethernet or any other source if 1) continuous at line rate 2) source for the input and source for the output are a different clock 3) there are guaranteed to be differences in the reference clocks, if the input source clock is a little faster then so long as the stream stays continuous and at line rate, then you will overflow eventually.
The clocks change with temperature and voltage so the delta can change.
Possible to even reduce the percentage of the data you output from the input and still overflow if the input is continuous. Depends on if your output is also at line rate or of you have margin and the margin can overcome the difference in the clocks.
Also remember uarts hardly run at that exact rate, the use clock dividers and get close, you can have two computers using uarts at the same rate and the delta can be relatively large and the overflow can happen very soon. For uart to work the clock has to only be good enough to get through one character so can be several percent off if not more than that, even if the oscillator is very good and no plls and both sides use the same reference clock (but not the same uart, clocking system, etc).
If you increase your output rate or reduce the data being output so that it is not at line rate then the problem may go away or may take hours or days before it happens...
if I have misunderstood the problem, forgive me, I will delete this answer.

Related

Performance benefit when using DMA for PWM

I have a segment of code below as a FreeRTOS task running on an STM32F411RE microcontroller:
static void TaskADCPWM(void *argument)
{
/* Variables used by FreeRTOS to set delays of 50ms periodically */
const TickType_t DelayFrequency = pdMS_TO_TICKS(50);
TickType_t LastActiveTime;
/* Update the variable RawAdcValue through DMA */
HAL_ADC_Start_DMA(&hadc1, (uint32_t*)&RawAdcValue, 1);
#if PWM_DMA_ON
/* Initialize PWM CHANNEL2 with DMA, to automatically change TIMx->CCR by updating a variable */
HAL_TIM_PWM_Start_DMA(&htim3, TIM_CHANNEL_2, (uint32_t*)&RawPWMThresh, 1);
#else
/* If DMA is not used, user must update TIMx->CCRy manually to alter duty cycle */
HAL_TIM_PWM_Start(&htim3, TIM_CHANNEL_2);
#endif
while(1)
{
/* Record last wakeup time and use it to perform blocking delay the next 50ms */
LastActiveTime = xTaskGetTickCount();
vTaskDelayUntil(&LastActiveTime, DelayFrequency);
/* Perform scaling conversion based on ADC input, and feed value into PWM CCR register */
#if PWM_DMA_ON
RawPWMThresh = (uint16_t)((RawAdcValue * MAX_TIM3_PWM_VALUE)/MAX_ADC_12BIT_VALUE);
#else
TIM3->CCR2 = (uint16_t)((RawAdcValue * MAX_TIM3_PWM_VALUE)/MAX_ADC_12BIT_VALUE);
#endif
}
}
The task above uses RawAdcValue value to update a TIM3->CCR2 register either through DMA or manually. The RawAdcValue gets updated periodically through DMA, and the value stored in this variable is 12-bits wide.
I understand how using DMA could benefit reading the ADC samples above as the CPU will not need to poll/wait for the ADC samples, or using the DMA to transfer long streams of data through I2C or SPI. But, is there a significant performance advantage to using DMA to update the TIM3->CCR2 register instead of manually modifying the TIM3->CCR2 register through:
TIM3->CCR2 &= ~0xFFFF;
TIM3->CCR2 |= SomeValue;
What would be the main differences between updating the CCR register through DMA or non-DMA?
Let's start by assuming you need to achieve "N samples per second". E.g. for audio this might be 44100 samples per second.
For PWM, you need to change the state of the output multiple times per sample. For example; for audio this might mean writing to the CCR around four times per sample, or "4*44100 = 176400" times per second.
Now look at what vTaskDelayUntil() does - most likely it sets up a timer and does a task switch, then (when the timer expires) you get an IRQ followed by a second task switch. It might add up to a total overhead of 500 CPU cycles each time you change the CCR. You can convert this into a percentage. E.g. (continuing the audio example), "176400 CCR updates per second * 500 cycles per update = about 88.2 million cycles per second of overhead", then, for 100 MHz CPU, you can do "88.2 million / 100 million = 88.2% of all CPU time wasted because you didn't use DMA".
The next step is to figure out where the CPU time comes from. There's 2 possibilities:
a) If your task is the highest priority task in the system (including being higher priority than all IRQs, etc); then every other task will become victims of your time consumption. In this case you've single-handedly ruined any point of bothering with a real time OS (probably better to just use a faster/more efficient non-real-time OS that optimizes "average case" instead of optimizing "worst case", and using DMA, and using a less powerful/cheaper CPU, to get a much better end result at a reduced "cost in $").
b) If your task isn't the highest priority task in the system, then the code shown above is broken. Specifically, an IRQ (and possibly a task switch/preemption) can occur immediately after the vTaskDelayUntil(&LastActiveTime, DelayFrequency);, causing theTIM3->CCR2 = (uint16_t)((RawAdcValue * MAX_TIM3_PWM_VALUE)/MAX_ADC_12BIT_VALUE); to occur at the wrong time (much later than intended). In pathological cases (e.g. where some other event like disk or network just happens to occur at a similar related frequency - e.g. at half your "CCR update frequency") this can easily become completely unusable (e.g. because turning the output on is often delayed more than intended and turning the output off is not).
However...
All of this depends on how many samples per second (or better, how many CCR updates per second) you actually need. For some purposes (e.g. controlling an electric motor's speed in a system that changes the angle of a solar panel to track the position of the sun throughout the day); maybe you only need 1 sample per minute and all the problems caused by using CPU disappear. For other purposes (e.g. AM radio transmissions) DMA probably won't be good enough either.
WARNING
Unfortunately, I can't/didn't find any documentation for HAL_ADC_Start_DMA(), HAL_TIM_PWM_Start() or HAL_TIM_PWM_Start_DMA() online, and don't know what the parameters are or how the DMA is actually being used. When I first wrote this answer I simply relied on a "likely assumption" that may have been a false assumption.
Typically, for DMA you have a block of many pieces of data (e.g. for audio, maybe you have a block 176400 values - enough for a whole second of sound at "4 values per sample, 44100 samples per second"); and while that transfer is happening the CPU is free to do other work (and not wasted). For continuous operation, the CPU might prepare the next block of data while the DMA transfer is happening, and when the DMA transfer completes the hardware would generate an IRQ and the IRQ handler will start the next DMA transfer for the next block of values (alternatively, the DMA channel could be configured for "auto-repeat" and the block of data might be a circular buffer). In that way, the "88.2% of all CPU time wasted because you didn't use DMA" would be "almost zero CPU time used because DMA controller is doing almost everything"; and the whole thing would be immune to most timing problems (an IRQ or higher priority task preempting can not influence the DMA controller's timing).
This is what I assumed the code is doing when it uses DMA. Specifically, I assumed that the every "N nanoseconds" the DMA would take the next raw value from a large block of raw values and use that next raw value (representing the width of the pulse) to set a timer's threshold to a value from 0 to N nanoseconds.
In hindsight; it's possibly more likely that the code sets up the DMA transfer for "1 value per transfer, with continual auto-repeat". In that case the DMA controller would be continually pumping whatever value happens to be in RawPWMThresh to the timer at a (possibly high) frequency, and then the code in the while(1) loop would be changing the value in RawPWMThresh at a (possibly much lower) frequency. For example (continuing the audio example); it could be like doing "16 values per sample (via. the DMA controller), with 44100 samples per second (via. the while(1) loop)". In that case; if something (an unrelated IRQ, etc) causes an unexpected extra delay after the vTaskDelayUntil(); then it's not a huge catastrophe (the DMA controller simply repeats the existing value for a little longer).
If that is the case; then the real difference could be "X values per sample with 20 samples per second" (with DMA) vs. "1 value per sample with 20 samples per second" (without DMA); where the overhead is the same regardless, but the quality of the output is much better with DMA.
However; without knowing what the code actually does (e.g. without knowing the frequency of the DMA channel and how things like the timer's prescaler are configured) it's also technically possible that when using DMA the "X values per sample with 20 samples per second" is actually "1 value per sample with 20 samples per second" (with X == 1). In that case, using DMA would be almost pointless (none of the performance benefits I originally assumed; and almost none of the "output quality" benefits I'm tempted to assume in hindsight, except for the "repeat old value if there's unexpected extra delay after the vTaskDelayUntil()").
First, remember that premature optimization is the cause of uncountably many problems. The question you need to ask is "what ELSE does the processor need to do?". If the processor has nothing better to do, then just poll and save yourself some programming effort.
If the processor does have something better to do (or you are running from batteries and want to save power) then you need to time how long the processor spends waiting between each thing that it needs to do.
In your case, you are using an operating system context switch in place of "waiting". You can time the cost of the switch-write-to-pwm-switch-back cycle by measuring the performance of some other thread.
Set up a system with two threads. Perform some task that you know the performance of in one thread, eg, some fixed computation or processor benchmark. Now set up the other thread to do your timer business above. Measure the performance of the first thread.
Next set up a similar system with only the first thread plus DMA doing the PWM. Measure the performance change, you have you answer.
Obviously this all depends very much on your exact system. There is no general answer that can be given. The closer your test is to your real system the more accurate the answer you will get.
PS: Your PWM will glitch using the above code. Replace the two writes with a single one:
TIM3->CCR2 &= ~0xFFFF;
TIM3->CCR2 |= SomeValue;
should be:
TIM3->CCR2 = ((TIM3->CCR2 & ~0xFFFF) | SomeValue);

Log data from MPU6050 through serial (UART) fails (data loss)

here is the problem I am facing. I have interfaced my ATmega328P with a 6-axis IMU (MPU6050 with the GY521 breakout board). I can read data through the TWI interface (Atmel's I2C) and send it to my PC (running Ubuntu) via the UART. I am using custom-built libraries for both these communication protocols, but they are pretty standard and seem to work just fine. The goal of the project is to compute orientation data from the IMU readings in real-time, say at 100 Hz.
The main problem is that I cannot log data from the device at 100 Hz (not even at 50 Hz). The orientation filter I am using (here) requires a quite high frequency and 100 Hz turned out to work fine (tested offline acquiring data from another device).
Right now, I am using the 16-bit timer of the ATmega328P to sample data at 100 Hz and this seem to work, as I have added to the ISR a line to toggle the built-in LED and it looks to me that it is blinking at 100 Hz (I can barely see it turning on and off). In the same ISR, I read the values from the inertial sensor and, just to log them, send these values through the serial port. Every 10 ms (maximum), I send 9 floats (36 bytes) with a baud rate of 115200. If I use the Arduino IDE's Serial Monitor to visualize this data stream, I notice something very weird, as in the following screenshot.
https://imgur.com/zTBdkhv
As you notice taking a look at the timestamps, there is a common 33 ms delay every 2 or 3 sets of samples received. Moreover, I get roughly the 60% of the data. For example, an acquisition of 10 seconds only gets me less than 600 samples (per each variable) instead of 1000. Moreover, I tested the same sending only one variable through the UART (i.e. only a single float, 4 bytes) and this results in the same behavior!
By the way, I am exploiting the following to send each byte (char) via the UART interface.
void writeCharUART(char c) {
loop_until_bit_is_set(UCSR0A, UDRE0);
UDR0 = c;
}
Even though my ISR runs at 100 Hz (LED blinking seem to confirm that), data loss may occur at the level of the TWI transmission. To prove that, I modified the code of the ISR to send just a normal char (T) instead of data from the MPU and I got a similar behavior. Something like this:
00:10:05.203 -> T
00:10:05.203 -> T
00:10:05.236 -> T
00:10:05.236 -> T
00:10:05.236 -> T
00:10:05.236 -> T
00:10:05.269 -> T
So, I guess there is something wrong with the UART library and I actually sample at 100 Hz, but the logging frequency is much lower (and not constant). How can I solve this issue and/or debug the UART library? Do you see other reasons to justify this issue?
EDIT 1
As pointed out in the comments, it seems to be a problem of the receiving software that limits the frequency to ~30 Hz by some sort of buffering. To confirm that, I programmed the ATmega328P with the following code (this time using the IDE).
void loop() {
Serial.println("T");
}
At first, I thought there was no delay this time, but I could find it after 208 samples. So, there are ~200 samples received at the same timestamp and another bunch of samples after 33 ms. This may be proof that the receiving software introduces this delay.
I also tested a simple serial monitor that I had developed in C and, even though there is no timestamp functionality, I am also loosing samples if I fix the duration of the acquisition sampling at 100 Hz. My serial monitor is based on the termios.h library, but I could not find any documentation about its way of buffering incoming data.
There are two issues here:
You are missing messages. You checked the sample rate just with your eyes and told us that you can still see a very fast blinking. Depending on the colour of your LED, the ambient light, your physical state, and your eyes this could mean anything from 30 Hz to 100 Hz.
I would not trust my eyes to estimate and rather use an oscilloscope or a frequency counter to measure.
You could reduce the frequency of the LED blinking to 1Hz or even lower by dividing in software. Such a low frequency can be measured by hand via a stop watch. For example count 30 blinks and check the time needed for this.
Add a counter to the message and increment it with each message. You will see it right away if you're losing data.
The timestamps seem to indicate that the messages are "clustered" at about 30 Hz.
I'm guessing that the source of the timestamp in running at 30 Hz. So it can not give you more accurate values.
I kind of solved my issues! First of all, thanks to the comments I have checked that my ISR was correctly running at 100 Hz. Doing so, I could be sure that the problem where somewhere else, namely in the UART communication.
I found this very helpful: Linux, serial port, non-buffering mode
Apparently, the Serial Monitor provided by the Arduino IDE uses exploits the termios.h library and uses its default settings. I checked also the user manual and switched to the polling-read mode. Quoting from the user manual
If data is available, read(2) returns immediately, with the lesser of the number of bytes available, or the number of bytes requested. If no data is available, read(2) returns 0.
Hence, I switched back to my serial monitor code and changed the initPort() function adding the following lines of code.
struct termios options;
(...)
options.c_cc[VTIME] = 0;
options.c_cc[VMIN] = 0;
I noticed right away a much higher data frequency in the terminal. I kept the 1 Hz LED blinking in the ISR and there is no period stretching. Moreover, an acquisition of 10 seconds this time gave me roughly 1000 samples per variable, consistent with a sampling rate of 100 Hz.
On the AVR side, I also changed the way I send data through the UART. Before, I was sending 9 floats like this:
sprintf(buffer, "%f, %f, %f", value1_x, value1_y, value1_z);
serial_print(buffer); // no "\n" sent here
sprintf(buffer, "%f, %f, %f", value2_x, value2_y, value2_z);
serial_print(buffer); // again, no "\n" sent
sprintf(buffer, "%f, %f, %f", roll, pitch, yaw);
serial_println(buffer); // "\n" is sent here once the last data byte is sent
Now, I replaced all this with a single call to the function serial_println() and I write only 6 floats to the buffer.

Speed up / modify tcdrain() function

I'll skirt round the long and tedious story of how we got where we are, but the situation is this:
We are using half-duplex RS485 serial comms and (by necessity) driving the TX/RX flag "manually" via GPIO pin toggling. In order to make this work we're using tcdrain() to wait until the Tx buffer is empty before flipping back to Rx mode.
The problem is that tcdrain() seems to wait (block) for quite a while after the last character has been transmitted, which causes us a bit of a bottleneck.
I've seen suggestions that the default tcdrain() code just multiplies the baud rate by the (maximum) size of the serial buffer, sleep()s for that time period and then returns.- and I could easily believe that.
So, can anyone suggest ways to either:
Speed up tcdrain() perhaps by shortening the serial buffer
Modify tcdrain() (or related code/parameters) to actually wait for the last character to be sent by the hardware, or wait for a period more closely related to the buffer contents
I've grepped our (embedded) kernel (2.6.x) code and can't see any references other than a single header file (termios.h).
Edit to add: As per this post, if for example we could reduce the serial Tx buffer to 1 byte using an IOCTL I assume the write() call would/could block while chars were written, then return, which would allow us to avoid relying on tcdrain() and just use a very short usleep() before toggling the Tx/Rx pin. I will experiment when I get a moment, in the meantime any suggestions/examples welcome.

Calculating CAN bus speed

I need to validate and characterize CAN bus traffic for our product (call it the Unit Under Test, UUT). I have a machine that sends a specified number of can frames to our product. Our product is running a Linux based custom kernel. The CAN frames are pre-built in software on the sender machine using a specific algorithm. The UUT uses the algorithm to verify the received frames.
Also, and here is where my questions lie, I am trying to calculate some timing data in the UUT software. So I basically do a read loop as fast as possible. I have a pre-allocated buffer to store the frames, so I just call read and increment the pointer to the buffer:
clock_gettime(clocK_PROCESS_CPUTIME_ID, timespec_start_ptr);
while ((frames_left--) > 0)
read(can_sock_fd, frame_mem_ptr++, sizeof(struct can_frame));
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, timespec_stop_ptr);
My question has to do with the times I get when I calculate the difference in these two timespecs (the calculation I use is correct I have verified it, it is GNUs algorithm).
Also, running the program under the time utility agrees with my times. For example, my program is called tcan, so I might run
[prompt]$ time ./tcan can1 -nf 10000
to run on can1 socket with 10000 frames. (This is FlexCAN, socket based interface, BTW)
Then, I use the time difference to calculate the data transfer speed I obtained. I received num_frames in the time span, so I calculate the frames/sec and the bits/sec
I am getting bus speeds that are 10 times the CAN bus speed of 250000 bits per sec. How can this be? I only get 2.5% CPU utilization according to both my program and the time program (and the top utility as well).
Are the values I am calculating meaningful? Is there something better I could do? I am assuming that since time reports real times that are much greater than user+sys, there must be some time-accounting lost somewhere. Another possibility is that maybe it's correct, I don't know, it's puzzling.
This is kind of a long shot, but what if read() is returning early because otherwise it would have to wait for incoming data? The fastest data to read is none at all :)
It would mess up the timings, but have you tried doing this loop whilst error checking? Or implement the loop via a recv() which should block unless you have asked it not to?
Hopefully this helps.

Data structure for storing serial port data in firmware

I am sending data from a linux application through serial port to an embedded device.
In the current implementation a byte circular buffer is used in the firmware. (Nothing but an array with a read and write pointer)
As the bytes come in, it is written to the circular bufffer.
Now the PC application appears to be sending the data too fast for the firmware to handle. Bytes are missed resulting in the firmware returning WRONG_INPUT too mant times.
I think baud rate (115200) is not the issue. A more efficient data structure at the firmware side might help. Any suggestions on choice of data structure?
A circular buffer is the best answer. It is the easiest way to model a hardware FIFO in pure software.
The real issue is likely to be either the way you are collecting bytes from the UART to put in the buffer, or overflow of that buffer.
At 115200 baud with the usual 1 start bit, 1 stop bit and 8 data bits, you can see as many as 11520 bytes per second arrive at that port. That gives you an average of just about 86.8 µs per byte to work with. In a PC, that will seem like a lot of time, but in a small microprocessor, it might not be all that many total instructions or in some cases very many I/O register accesses. If you overfill your buffer because bytes are arriving on average faster than you can consume them, then you will have errors.
Some general advice:
Don't do polled I/O.
Do use a Rx Ready interrupt.
Enable the receive FIFO, if available.
Empty the FIFO completely in the interrupt handler.
Make the ring buffer large enough.
Consider flow control.
Sizing your ring buffer large enough to hold a complete message is important. If your protocol has known limits on the message size, then you can use the higher levels of your protocol to do flow control and survive without the pains of getting XON/XOFF flow to work right in all of the edge cases, or RTS/CTS to work as expected in both ends of the wire which can be nearly as hairy.
If you can't make the ring buffer that large, then you will need some kind of flow control.
There is nothing better than a circular buffer.
You could use a slower baud rate or speed up the application in the firmware so that it can handle data coming at full speed.
If the output of the PC is in bursts it may help to make the buffer big enough to handle one burst.
The last option is to implement some form of flow control.
What do you mean by embedded device ? I think most of current DSP and processor can easily handle this kind of load. The problem is not with the circular buffer, but how do you collect bytes from the serial port.
Does your UART have a hardware fifo ? If yes, then you should enable it. If you have an interrupt per byte, you can quickly get into trouble, especially if you are working with an OS or with virtual memory, where the IRQ cost can be quit high.
If your receiving firmware is very simple (no multitasking), and you don't have an hardware fifo, polled mode can be a better solution than interrupt driven, because then your processor is doing only UART data reception, and you have no interrupt overhead.
Another problem might be with the transfer protocol. For example if you have long packet of data that you have to checksum, and you do the whole checksum at the end of the packet, then all the processing time of the packet is at the end of it, and that is why you may miss the beginning of the next packet.
So circular buffer is fine and you have to way to improve :
- The way you interact with the hardware
- The protocol (packet length, acknoledgment etc ...)
Before trying to solve the problem, first you need to establish what the problem really is. Otherwise you might waste time trying to fix something that isn't actually broken.
Without knowing more about your set-up it's hard to give more specific advice. But you should investigate further to establish what exactly the hardware and software is currently doing when the bytes come in, and then what is the weak point where they're going missing.
A circular buffer with Interrupt driven IO will work on the smallest and slowest of embedded targets.
First try it at the lowest baud rate and only then try at high speeds.
Using a circular buffer in conjunction with IRQ is an excellent suggestion. If your processor generates an interrupt each time a byte is received take that byte and store it in the buffer. How you decide to empty that buffer depends on if you are processing a stream of data or data packets. If you are processing a stream simply have your background process remove the bytes from the buffer and process them first-in-first-out. If you are processing packets then just keep filing the buffer until you have a complete packet. I've used the packet method successfully many times in the past. I would implement some type of flow control as well to signal to the PC if something went wrong like a full buffer or if packet-processing time is long to indicate to the PC when it is ready for the next packet.
You could implement something like IP datagram which contains data length, id, and checksum.
Edit:
Then you could hard-code some fixed length for the packets, for example 1024 byte or whatever that makes sense for the device. PC side would then check if the queue is full at the device every time it writes in a packet. Firmware side would run checksum to see if all data is valid, and read up till the data length.

Resources