Do I need buffering in this case, if so, how?

Do I need buffering in this case, if so, how? - c

I have an issue where the rate the data is sent to serial port (50Hz) is way faster than the serial port can receive (8Hz).
Basically, what I have is, data is sent from fpga vhdl block to NIos II system, Nios II system sends the data to serial port, Matlab access the serial port to real time plot the graph.
I am using Quartus 12.1 sp1, vhdl and Nios II programmed in C code for DE0-Nano. In Qsys, I am using a few cores such as timer, sdram, uart, pios etc.
I dont know if this is feasible. What I am tempted to do, since the rate I can plot a graph is slower than the rate the incoming signal is read.
If I know the beginning of every incoming signal let say this point is indicated by a count=0, everytime I detect count=0, I know it is the beginning of each incoming signal, I keep track of how many cycles, let say, I only want to plot the graph every 10 cycles of the incoming signals, I plot cycle no.1, skip 10 cycles, plot cycle no. 11, skip cycle no.12 to 20, next plot cycle no. 21 etc ...in other words, it means I will lose some cycles in between which I am okay with this.
Do you think this method make any sense?
Do I still need buffering in this case?
How do I know how many cycles I need to skip? I assume it would depend on the incoming signal rate (50 Hz) and plot/receive rate (say 8 Hz), But I am not entirely clear about how to calculate...
The incoming signal (50 Hz) is from fpga vhdl, NIos system can read at 50 Mhz, but serial port receive rate or Matlab plot rate is 8 Hz.
Appreciate any input...thank you

Related

lwip to send data bigger than 64kb

i'm trying to send over lwip a RT data (4 bytes) sampled at 100kHz for 10 channels.
I've understood that lwip has a timer which loops every 250ms and it cannot be changed.
In this case i'm saving the RT over RAM at 100kHz and every 250ms sending the sampled data over TCP.
The problem is that i cannot go over 65535 bytes every 250ms because i get the MEM_ERR.
i already increased the buffer up to 65535 but when i try to increase it more i get several error during compiling.
So my doubt is: can lwip manage buffer bigger than 16bits?
Thanks,
Marco

Better to focus on throughput.
You neglected to show any code, describe which Xilinx board/system you're using, or which OS you're using (e.g. FreeRTOS, linux, etc.).
Your RT data is: 4 bytes * 10 channels * 100kHz --> 400,000 bytes / second.
From your lwip description, you have 65536 byte packets * 4 packets/sec --> 256,000 bytes / second.
This is too slow. And, it much slower than what a typical TCP / Ethernet link can process, so I think your understanding of the maximum rate is off a bit.
You probably can't increase the packet size.
But, you probably can increase the number of packets sent.
You say that the 250ms interval for lwip can't be changed. I believe that it can.
From: https://www.xilinx.com/support/documentation/application_notes/xapp1026.pdf we have the section: Creating an lwIP Application Using the RAW API:
Set up a timer to interrupt at a constant interval. Usually, the interval is around 250 ms. In the timer interrupt, update necessary flags to invoke the lwIP TCP APIs tcp_fasttmr and tcp_slowtmr from the main application loop explained previously
The "usually" seems to me to imply that it's a default and not a maximum.
But, you may not need to increase the timer rate as I don't think it dictates the packet rate, just the servicing/completion rate [in software].
A few guesses ...
Once a packet is queued to the NIC, normally, other packets may be queued asynchronously. Modern NIC hardware often has a hardware queue. That is, the NIC H/W supports multiple pending requests. It can service those at line speed without CPU intervention.
The 250ms may just be a timer interrupt that retires packet descriptors of packets completed by the NIC hardware.
That is, more than one packet can be processed/completed per interrupt. If that were not the case, then only 4 packets / second could be sent and that would be ridiculously low.
Generating an interrupt from the NIC for each packet incurs an overhead. My guess is that interrupts from the NIC are disabled. And, the NIC is run in a "polled" mode. The polling occurs in the timer ISR.
The timer interrupt will occur 4 times per second. But, will process any packets it sees that are completed. So, the ISR overhead is only 4 interrupts / second.
This increases throughput because the ISR overhead is reduced.
UPDATE:
Thanks for the reply, indeed is 4 bytes * 10 channels * 100kHz --> 4,000,000 bytes / second but I agree that we are quite far from the 100Mbit/s.
Caveat: I don't know lwip, so most of what I'm saying is based on my experience with other network stacks (e.g. linux), but it appears that lwip should be similar.
Of course, lwip will provide a way to achieve full line speed.
Regarding the changing of the 250ms timer period, to achieve what i want it should be lowered more than 10times which seems it is too much and it can compromise the stability of the protocol.
When you say that, did you actually try that?
And, again, you didn't post your code or describe your target system, so it's difficult to provide help.
Issues could be because of the capabilities [or lack thereof] of your target system and its NIC.
Or, because of the way your code is structured, you're not making use of the features that can make it fast.
So basically your suggestion is to enable the interrupt on each message? In this case i can send the remaining data in the ACK callback if I understood correctly. – Marco Martines
No.
The mode for interrupt on every packet is useful for low data rates where the use of the link is sparse/sporadic.
If you have an interrupt for every packet, the overhead of entering/exiting the ISR (the ISR prolog/epilog) code will become significant [and possibly untenable] at higher data rates.
That's why the timer based callback is there. To accumulate finished request blocks and [quickly] loop on the queue/chain and complete/reclaim them periodically. If you wish to understand the concept, look at NAPI: https://en.wikipedia.org/wiki/New_API
Loosely, on most systems, when you do a send, a request block is created with all info related to the given buffer. This block is then queued to the transmit queue. If the transmitter is idle, it is started. For subsequent blocks, the block is appended to the queue. The driver [or NIC card] will, after completing a previous request, immediately start a new/fresh one from the front of the queue.
This allows you to queue multiple/many blocks quickly [and return immediately]. They will be sent in order at line speed.
What actually happens depends on system H/W, NIC controller, and OS and what lwip modes you use.

How to determine MCU Clock speed requirements

Overview:
I spent a while trying to think of how to formulate this question. To narrow the scope, I wanted to provide my initial HW requirements in the form of a ‘real life’ example application.
I understand that clock speed is probably relative, in the sense that it is a case by case basis. For example, your requirement for a certain speed may be impacted on by the on-chip peripherals offered by the MCU. As an example, you may spend (n) cycles servicing an ISR for an encoder, or, you could pick an MCU that has a QEI input to do it for you (to some degree), which in turn, may loosen your requirement?
I am not an expert, and am very much still learning, so please call me out if I use an incorrect term, or completely misinterpret something. I assure you; the feedback is welcome!
Example Application:
This application is relatively simple. It can be thought of as a non-blocking state machine, where each ‘iteration’ of the machine must complete within 20ms. A single iteration of this machine has 4 main tasks:
Decode a serial payload, consisting of 32 bytes. The length is fixed at 32 bytes, payload is dynamic, baud is 115200bps (See Task #2 below)
Read 4 incremental shaft encoder signals, which are coupled with 4 DC Motors, 1 encoder for each motor (See Task #1 Below)
Determine the position of 4 limit switches. ISR driven, trigger on rising edge for each switch.
Based on the 3 categories of inputs above, the MCU will output 4 separate PWM signals # 50Hz (20ms) to a motor controller for its next set of movements. (See Task #3 below)
From an IO perspective, I know that the MCU is on the hook for reading 8 digital signals (4 quadrature encoders, 4 limit switches), and decoding a serial frame of 32 bytes over UART.
Based on that data, the MCU will output 4 independent PWM signals, with a pulse width of [1000usec -3200usec], per motor, to the motor controller.
The Question:
After all is said and done, I am trying to think through how I can map my requirements into MCU selection, solely from a speed point of view.
It’s easy for me to look through the datasheet and say, this chip meets my requirements because it has (n) UARTS, (n) ISR input pins, (n) PWM outputs etc. But my projects are so small that I always assume the processor is ‘fast enough’. Aside from my immediate peripheral needs, I never really look into the actual MCU speed, which is an issue on my end.
To resolve that, I am trying to understand what goes into selecting a particular clock speed, based on the needs of a given application. Or, another way to say it, which is probably wrong, but how to you quantify the theoretical load on the processor for that specific application?
Additional Information
Task #1: Encoder:
Each of the 4 motors have different tasks within the system, but regardless, they are the same brand/model motor, and have a maximum RPM of 230. My assumption is, if at its worst case, one of the motors is spinning at 230 RPM, that would mean, at full quadrature resolution (count rising/falling for channel A/B) the 1000PPR encoder would generate 4K interrupts per revolution. As such, the MCU would have to service those interrupts, potentially creating a bottleneck for the system. For example, if (n) number of clock cycles are required to service the ISR, and for 1 revolution of 1 motor, we expect 4K interrupts, that would be … 230(RPM) * 4K (ISR per rev) == 920,000 interrupts per minute? Yikes! And then I guess you could just extrapolate and say, again, at it’s worst case, where each of the 4 motors are spinning at 230 RPM, there’s a potential that, if the encoders are full resolution, the system would have to endure 920K interrupts per minute for each encoder. So 920K * 4 motors == 3,680,000 interrupts per minute? I am 100% sure I am doing something wrong, so please, feel free to set me straight.
Task #2: Serial Decoding
The MCU will require a dedicated HW serial port to decode a packet of 32 bytes, which repeats, with different values, every 7ms. Baud rate will be set to 115200bps.
Task #3: PWM Output
Based on the information from tasks 1 and 2, the MCU will write to 4 separate PWM outputs. The pulse for each output will be between 1000-3200usec with a frequency of 50Hz.

You need to separate real-time critical parts from the rest of the application. For example, the actual reception of an UART frame is somewhat time-critical if you do so interrupt-based. But the protocol decoding is not critical at all unless you are expected to respond within a certain time.
Decode a serial payload, consisting of 32 bytes.
You can either do this the old school way with interrupts filling up a buffer, or you could look for a part with DMA, which is fairly common nowadays. DMA means that you won't have to consider some annoying, relatively low frequency UART interrupt disrupting other tasks.
Read 4 incremental shaft encoder signals
I haven't worked with such encoders so I can't tell how time-critical they are. If you have to catch every single interrupt and your calculations are correct, then 3,680,000 interrupts per minute is still not that bad. 60*60/3680000 = 978us. So roughly one interrupt every millisecond, that's not a "hard real-time" requirement. If that's the only time-critical thing you need to do, then any shabby 8-bitter running at 8MHz could keep up.
Determine the position of 4 limit switches
You don't mention timing here but I assume this is something that could be polled cyclically by a low priority cyclic timer.
the MCU will output 4 separate PWM signals
Not a problem, just pick one with a decent PWM hardware peripheral. You should just need to update some PWM duty cycle registers now and then.
Overall, this doesn't sound all that real-time critical. I've done much worse real-time projects with icky 8 and 16 bitters. However, each time I did, I always regret not picking a faster MCU, because you always come up with stuff to add as the project/product goes on.
It sounds like your average mainstream Cortex M0+ would be a good candidate for this project. Clock it at ~48MHz and you'll have plenty of CPU power. Cortex M4 or larger if you actually expect floating point math (I don't quite see why you'd need that though).
Given the current component crisis, be careful with which brand you pick though! In particular stay clear of STM32, since ST can't produce them right now and you might end up waiting over a year until you get parts.

The answer to the question is "experience". But intuitively your example is not particularly taxing - although there are plenty of ways you could mess it up. I once worked on a project that ran on a 200MHz C5502 DSP at near 100% CPU load. The application now runs on a 72MHz Cortex-M3 at only 60% with additional functionality and I/O not present in the original implementation..
Your application is I/O bound; depending on data rates (and critically interrupt rates), I/O seldom constitutes the highest CPU load, and DMA, hardware FIFOs, input capture timer/counters, and hardware PWM etc. can be used to minimise the I/O impact. I shan't go into it in detail; #Lundin has already done that.
Note also that raw processor speed is important for data or signal processing and number crunching - but what I/O generally requires is deterministic real-time response, and that is seldom simply a matter of MHz or MIPS - you will get more deterministic and possibly faster response from an 8bit AVR running at a few MHz than you can guarantee from a 500MHz application processor running Linux - and it won't take 30 seconds to boot!

Log data from MPU6050 through serial (UART) fails (data loss)

here is the problem I am facing. I have interfaced my ATmega328P with a 6-axis IMU (MPU6050 with the GY521 breakout board). I can read data through the TWI interface (Atmel's I2C) and send it to my PC (running Ubuntu) via the UART. I am using custom-built libraries for both these communication protocols, but they are pretty standard and seem to work just fine. The goal of the project is to compute orientation data from the IMU readings in real-time, say at 100 Hz.
The main problem is that I cannot log data from the device at 100 Hz (not even at 50 Hz). The orientation filter I am using (here) requires a quite high frequency and 100 Hz turned out to work fine (tested offline acquiring data from another device).
Right now, I am using the 16-bit timer of the ATmega328P to sample data at 100 Hz and this seem to work, as I have added to the ISR a line to toggle the built-in LED and it looks to me that it is blinking at 100 Hz (I can barely see it turning on and off). In the same ISR, I read the values from the inertial sensor and, just to log them, send these values through the serial port. Every 10 ms (maximum), I send 9 floats (36 bytes) with a baud rate of 115200. If I use the Arduino IDE's Serial Monitor to visualize this data stream, I notice something very weird, as in the following screenshot.
https://imgur.com/zTBdkhv
As you notice taking a look at the timestamps, there is a common 33 ms delay every 2 or 3 sets of samples received. Moreover, I get roughly the 60% of the data. For example, an acquisition of 10 seconds only gets me less than 600 samples (per each variable) instead of 1000. Moreover, I tested the same sending only one variable through the UART (i.e. only a single float, 4 bytes) and this results in the same behavior!
By the way, I am exploiting the following to send each byte (char) via the UART interface.
void writeCharUART(char c) {
loop_until_bit_is_set(UCSR0A, UDRE0);
UDR0 = c;
}
Even though my ISR runs at 100 Hz (LED blinking seem to confirm that), data loss may occur at the level of the TWI transmission. To prove that, I modified the code of the ISR to send just a normal char (T) instead of data from the MPU and I got a similar behavior. Something like this:
00:10:05.203 -> T
00:10:05.203 -> T
00:10:05.236 -> T
00:10:05.236 -> T
00:10:05.236 -> T
00:10:05.236 -> T
00:10:05.269 -> T
So, I guess there is something wrong with the UART library and I actually sample at 100 Hz, but the logging frequency is much lower (and not constant). How can I solve this issue and/or debug the UART library? Do you see other reasons to justify this issue?
EDIT 1
As pointed out in the comments, it seems to be a problem of the receiving software that limits the frequency to ~30 Hz by some sort of buffering. To confirm that, I programmed the ATmega328P with the following code (this time using the IDE).
void loop() {
Serial.println("T");
}
At first, I thought there was no delay this time, but I could find it after 208 samples. So, there are ~200 samples received at the same timestamp and another bunch of samples after 33 ms. This may be proof that the receiving software introduces this delay.
I also tested a simple serial monitor that I had developed in C and, even though there is no timestamp functionality, I am also loosing samples if I fix the duration of the acquisition sampling at 100 Hz. My serial monitor is based on the termios.h library, but I could not find any documentation about its way of buffering incoming data.

There are two issues here:
You are missing messages. You checked the sample rate just with your eyes and told us that you can still see a very fast blinking. Depending on the colour of your LED, the ambient light, your physical state, and your eyes this could mean anything from 30 Hz to 100 Hz.
I would not trust my eyes to estimate and rather use an oscilloscope or a frequency counter to measure.
You could reduce the frequency of the LED blinking to 1Hz or even lower by dividing in software. Such a low frequency can be measured by hand via a stop watch. For example count 30 blinks and check the time needed for this.
Add a counter to the message and increment it with each message. You will see it right away if you're losing data.
The timestamps seem to indicate that the messages are "clustered" at about 30 Hz.
I'm guessing that the source of the timestamp in running at 30 Hz. So it can not give you more accurate values.

I kind of solved my issues! First of all, thanks to the comments I have checked that my ISR was correctly running at 100 Hz. Doing so, I could be sure that the problem where somewhere else, namely in the UART communication.
I found this very helpful: Linux, serial port, non-buffering mode
Apparently, the Serial Monitor provided by the Arduino IDE uses exploits the termios.h library and uses its default settings. I checked also the user manual and switched to the polling-read mode. Quoting from the user manual
If data is available, read(2) returns immediately, with the lesser of the number of bytes available, or the number of bytes requested. If no data is available, read(2) returns 0.
Hence, I switched back to my serial monitor code and changed the initPort() function adding the following lines of code.
struct termios options;
(...)
options.c_cc[VTIME] = 0;
options.c_cc[VMIN] = 0;
I noticed right away a much higher data frequency in the terminal. I kept the 1 Hz LED blinking in the ISR and there is no period stretching. Moreover, an acquisition of 10 seconds this time gave me roughly 1000 samples per variable, consistent with a sampling rate of 100 Hz.
On the AVR side, I also changed the way I send data through the UART. Before, I was sending 9 floats like this:
sprintf(buffer, "%f, %f, %f", value1_x, value1_y, value1_z);
serial_print(buffer); // no "\n" sent here
sprintf(buffer, "%f, %f, %f", value2_x, value2_y, value2_z);
serial_print(buffer); // again, no "\n" sent
sprintf(buffer, "%f, %f, %f", roll, pitch, yaw);
serial_println(buffer); // "\n" is sent here once the last data byte is sent
Now, I replaced all this with a single call to the function serial_println() and I write only 6 floats to the buffer.

Increase Beaglebone Black ADC sampling rate?

I'm working on a project that requires the use of a microcontroller, and for this reason, I decided to use the Beaglebone Black. I'm still new to the Beaglebone world and I'm facing some problems that I hope you guys can help me with.
In my project I will have to continuously read from all the 7 analog read pins and do some processing accordingly. My question is, what will be the fastest programming language to do so (I must read as much samples as possible and in a very short time!) and how to increase the sampling rate from KHz to MHz?
I tried the following codes:
Javascript Code:
var b = require('bonescript');//this variable is to refer to my beaglebone
time = new Date();
b.analogRead("P9_39");
console.log(new Date() - time);
this code will simply perform one analog read and will print out the time needed to perform the read. Surprisingly, the result was 111ms!! which means that my sampling rate is 10 if I'm not wrong.
An alternative was to use pyhton:
import Adafruit_BBIO.ADC as ADC
import time
ADC.setup()
millis = int(round(time.time() * 1000))
ADC.read_raw("P9_39")
millis = millis = int(round(time.time() * 1000)) - millis
print millis
this code took less time (4ms) but still, if I wanted to read form the 7 analog input pins, I will only be able to read around 35 samples from each.
Using the terminal:
echo cape-bone-iio > /sys/devices/bone_capemgr.*/slots
time cat /sys/devices/ocp.3/helper.15/AIN0
############OR############
time cat /sys/devices/ocp.3/44e0d000.tscadc/tiadc/iio\:device0/in_voltage0_raw
and this took 50ms.
I want my sampling rate to be something in MHz. How can I do so? I know that the Beaglebone Black is capable of that but I could not find a clear way to do so. Any help is appreciated.
Thanks in advance.

Sampling rate of AM335x ADC is 200K (link). This means you won't get into MHz range with stock BeagleBone Black ADC.
To get something working with a latency of 5 µs in non-real-time OS like Linux is impossible. You will be at a mercy of OS to schedule your execution thread. Other kernel threads will take priority and will preempt your thread, even if you assign it the highest scheduling priority.
From my experience with digital IO on BeagleBone Black, I stated seeing missed frames starting around 1K samples per second. Now, it will depend on your level of tolerance to missing samples -- if you only need working semi-reliably you can probably squeeze out 10 K samples per second by switching to C/C++ and increasing priority of your process with nice --10 ... command. However if you cannot tolerate missed frames, you have to do one of these:
Bypass OS entirely and write C program for naked AM335x processor (no OS).
Use another hardware -- an ADC with a buffer to accumulate samples while your program is preempted.
Use PRUSS processors on BBB. They run at 200 MHz, so if you have a tight loop with e.g. 20 assembly instructions you will get reliable sampling rate of 10 MHz. That is if you had a faster ADC in the first place, and of course it would handle the stock 200 KHz ADC easily.
I personally went with option #3 and was happy to see my device perform sub-millisecond GPIO operations extremely reliably.

Use 127 beaglebone blacks plugged into 127 usb hub ports and breakout visual basic and write a usb program to automatically sequencially fire 127 beagle bones 1 after the other and read the data in a textbox...You will get around 16 mhz / msps consective adcs per fast cpu with say windows 10....lyj2021
You may have over lapping data...But you can track this with each fire of each beagle bone black...consecutively...

Scheduling routines in C and timing requirements

I'm working on a C program that transmits samples over USB3 for a set period of time (1-10 us), and then receives samples for 100-1000 us. I have a rudimentary pthread implementation where the TX and RX routines are each handled as a thread. The reason for this is that in order to test the actual TX routine, the RX needs to run and sample before the transmitter is activated.
Note that I have very little C experience outside of embedded applications and this is my first time dabbling with pthread.
My question is, since I know exactly how many samples I need to transmit and receive, how can I e.g. start the RX thread once the TX thread is done executing and vice versa? How can I ensure that the timing stays consistent? Sampling at 10 MHz causes some harsh timing requirements.
Thanks!
EDIT:
To provide a little more detail, my device is a bladeRF x40 SDR, and communication to the device is handled by a FX3 microcontroller, which occurs over a USB3 connection. I'm running Xubuntu 14.04. Processing, scheduling and configuration however is handled by a C program which runs on the PC.

You don't say anything about your platform, except that it supports pthreads.
So, assuming Linux, you're going to have to realize that in general Linux is not a real-time operating system, and what you're doing sure sounds as if has real-time timing requirements.
There are real-time variants of Linux, I'm not sure how they'd suit your needs. You might also be able to achieve better performance by doing the work in a kernel driver, but then you won't have access to pthreads so you're going to have to be a bit more low-level.

Thought I'd post my solution.
While the next build of the bladeRF firmware and FPGA image will include the option to add metadata (timestamps) to the synchronous interface, until then there's no real way in which I can know at which time instants certain events occurred.
What I do know is my sampling rate, and exactly how many samples I need to transmit and receive at which times relative to each other. Therefore, by using conditional variables (with pthread), I can signal my receiver to start receiving samples at the desired instant. Since TX and RX operations happen in a very specific sequence, I can calculate delays by counting the number of samples and multiplying by the sampling rate, which has proven to be within 95-98% accurate.
This obviously means that since my TX and RX threads are running simultaneously, there are chunks of data within the received set of samples that will be useless, and I have another routine in place to discard those samples.