I'm working on a C program that transmits samples over USB3 for a set period of time (1-10 us), and then receives samples for 100-1000 us. I have a rudimentary pthread implementation where the TX and RX routines are each handled as a thread. The reason for this is that in order to test the actual TX routine, the RX needs to run and sample before the transmitter is activated.
Note that I have very little C experience outside of embedded applications and this is my first time dabbling with pthread.
My question is, since I know exactly how many samples I need to transmit and receive, how can I e.g. start the RX thread once the TX thread is done executing and vice versa? How can I ensure that the timing stays consistent? Sampling at 10 MHz causes some harsh timing requirements.
Thanks!
EDIT:
To provide a little more detail, my device is a bladeRF x40 SDR, and communication to the device is handled by a FX3 microcontroller, which occurs over a USB3 connection. I'm running Xubuntu 14.04. Processing, scheduling and configuration however is handled by a C program which runs on the PC.
You don't say anything about your platform, except that it supports pthreads.
So, assuming Linux, you're going to have to realize that in general Linux is not a real-time operating system, and what you're doing sure sounds as if has real-time timing requirements.
There are real-time variants of Linux, I'm not sure how they'd suit your needs. You might also be able to achieve better performance by doing the work in a kernel driver, but then you won't have access to pthreads so you're going to have to be a bit more low-level.
Thought I'd post my solution.
While the next build of the bladeRF firmware and FPGA image will include the option to add metadata (timestamps) to the synchronous interface, until then there's no real way in which I can know at which time instants certain events occurred.
What I do know is my sampling rate, and exactly how many samples I need to transmit and receive at which times relative to each other. Therefore, by using conditional variables (with pthread), I can signal my receiver to start receiving samples at the desired instant. Since TX and RX operations happen in a very specific sequence, I can calculate delays by counting the number of samples and multiplying by the sampling rate, which has proven to be within 95-98% accurate.
This obviously means that since my TX and RX threads are running simultaneously, there are chunks of data within the received set of samples that will be useless, and I have another routine in place to discard those samples.
Related
Overview:
I spent a while trying to think of how to formulate this question. To narrow the scope, I wanted to provide my initial HW requirements in the form of a ‘real life’ example application.
I understand that clock speed is probably relative, in the sense that it is a case by case basis. For example, your requirement for a certain speed may be impacted on by the on-chip peripherals offered by the MCU. As an example, you may spend (n) cycles servicing an ISR for an encoder, or, you could pick an MCU that has a QEI input to do it for you (to some degree), which in turn, may loosen your requirement?
I am not an expert, and am very much still learning, so please call me out if I use an incorrect term, or completely misinterpret something. I assure you; the feedback is welcome!
Example Application:
This application is relatively simple. It can be thought of as a non-blocking state machine, where each ‘iteration’ of the machine must complete within 20ms. A single iteration of this machine has 4 main tasks:
Decode a serial payload, consisting of 32 bytes. The length is fixed at 32 bytes, payload is dynamic, baud is 115200bps (See Task #2 below)
Read 4 incremental shaft encoder signals, which are coupled with 4 DC Motors, 1 encoder for each motor (See Task #1 Below)
Determine the position of 4 limit switches. ISR driven, trigger on rising edge for each switch.
Based on the 3 categories of inputs above, the MCU will output 4 separate PWM signals # 50Hz (20ms) to a motor controller for its next set of movements. (See Task #3 below)
From an IO perspective, I know that the MCU is on the hook for reading 8 digital signals (4 quadrature encoders, 4 limit switches), and decoding a serial frame of 32 bytes over UART.
Based on that data, the MCU will output 4 independent PWM signals, with a pulse width of [1000usec -3200usec], per motor, to the motor controller.
The Question:
After all is said and done, I am trying to think through how I can map my requirements into MCU selection, solely from a speed point of view.
It’s easy for me to look through the datasheet and say, this chip meets my requirements because it has (n) UARTS, (n) ISR input pins, (n) PWM outputs etc. But my projects are so small that I always assume the processor is ‘fast enough’. Aside from my immediate peripheral needs, I never really look into the actual MCU speed, which is an issue on my end.
To resolve that, I am trying to understand what goes into selecting a particular clock speed, based on the needs of a given application. Or, another way to say it, which is probably wrong, but how to you quantify the theoretical load on the processor for that specific application?
Additional Information
Task #1: Encoder:
Each of the 4 motors have different tasks within the system, but regardless, they are the same brand/model motor, and have a maximum RPM of 230. My assumption is, if at its worst case, one of the motors is spinning at 230 RPM, that would mean, at full quadrature resolution (count rising/falling for channel A/B) the 1000PPR encoder would generate 4K interrupts per revolution. As such, the MCU would have to service those interrupts, potentially creating a bottleneck for the system. For example, if (n) number of clock cycles are required to service the ISR, and for 1 revolution of 1 motor, we expect 4K interrupts, that would be … 230(RPM) * 4K (ISR per rev) == 920,000 interrupts per minute? Yikes! And then I guess you could just extrapolate and say, again, at it’s worst case, where each of the 4 motors are spinning at 230 RPM, there’s a potential that, if the encoders are full resolution, the system would have to endure 920K interrupts per minute for each encoder. So 920K * 4 motors == 3,680,000 interrupts per minute? I am 100% sure I am doing something wrong, so please, feel free to set me straight.
Task #2: Serial Decoding
The MCU will require a dedicated HW serial port to decode a packet of 32 bytes, which repeats, with different values, every 7ms. Baud rate will be set to 115200bps.
Task #3: PWM Output
Based on the information from tasks 1 and 2, the MCU will write to 4 separate PWM outputs. The pulse for each output will be between 1000-3200usec with a frequency of 50Hz.
You need to separate real-time critical parts from the rest of the application. For example, the actual reception of an UART frame is somewhat time-critical if you do so interrupt-based. But the protocol decoding is not critical at all unless you are expected to respond within a certain time.
Decode a serial payload, consisting of 32 bytes.
You can either do this the old school way with interrupts filling up a buffer, or you could look for a part with DMA, which is fairly common nowadays. DMA means that you won't have to consider some annoying, relatively low frequency UART interrupt disrupting other tasks.
Read 4 incremental shaft encoder signals
I haven't worked with such encoders so I can't tell how time-critical they are. If you have to catch every single interrupt and your calculations are correct, then 3,680,000 interrupts per minute is still not that bad. 60*60/3680000 = 978us. So roughly one interrupt every millisecond, that's not a "hard real-time" requirement. If that's the only time-critical thing you need to do, then any shabby 8-bitter running at 8MHz could keep up.
Determine the position of 4 limit switches
You don't mention timing here but I assume this is something that could be polled cyclically by a low priority cyclic timer.
the MCU will output 4 separate PWM signals
Not a problem, just pick one with a decent PWM hardware peripheral. You should just need to update some PWM duty cycle registers now and then.
Overall, this doesn't sound all that real-time critical. I've done much worse real-time projects with icky 8 and 16 bitters. However, each time I did, I always regret not picking a faster MCU, because you always come up with stuff to add as the project/product goes on.
It sounds like your average mainstream Cortex M0+ would be a good candidate for this project. Clock it at ~48MHz and you'll have plenty of CPU power. Cortex M4 or larger if you actually expect floating point math (I don't quite see why you'd need that though).
Given the current component crisis, be careful with which brand you pick though! In particular stay clear of STM32, since ST can't produce them right now and you might end up waiting over a year until you get parts.
The answer to the question is "experience". But intuitively your example is not particularly taxing - although there are plenty of ways you could mess it up. I once worked on a project that ran on a 200MHz C5502 DSP at near 100% CPU load. The application now runs on a 72MHz Cortex-M3 at only 60% with additional functionality and I/O not present in the original implementation..
Your application is I/O bound; depending on data rates (and critically interrupt rates), I/O seldom constitutes the highest CPU load, and DMA, hardware FIFOs, input capture timer/counters, and hardware PWM etc. can be used to minimise the I/O impact. I shan't go into it in detail; #Lundin has already done that.
Note also that raw processor speed is important for data or signal processing and number crunching - but what I/O generally requires is deterministic real-time response, and that is seldom simply a matter of MHz or MIPS - you will get more deterministic and possibly faster response from an 8bit AVR running at a few MHz than you can guarantee from a 500MHz application processor running Linux - and it won't take 30 seconds to boot!
I am designing the controller and data acquisition unit for a rocket engine test stand. This system needs to control a number of actuators on the test stand and also be able to transmit collected data back to the host computer where the team will be watching live data/camera feeds from safety.
The overall design requirements are as follows:
Acquire data from ~15 analog sensors at 1KHz
Control the actuators on the test stand including valves and ignition switches
Transmit data back to the host computer in our shelter in real time
Accept control from the host computer for things like manual valve actuation, test sequence modification, sequence abortion, etc.
I am not exactly sure where to begin when laying out the software for this system. I am considering using an STM32 ARM Cortex-M4 processor running at 180 MHz. I am having trouble figuring how I should approach the problem. I have considered using an RTOS system but based on what I have seen those generate large overheads as you run them faster as the scheduler has to run each tick. The other idea I'm bouncing around is a state machine combined with some timer-based interrupts for reading and then sending data back out to the PC. Any advice as to how to approach this problem to minimize code complexity would be greatly appreciated. Thanks.
EDIT:
I have been told to clarify a number of things concerning the technical specs of the system.
My actuators consist of:
6 solenoids (controlled digitally through relays/MOSFET, and switched around once a second)
2 DC motors (driven with PWM outputs in a PID loop, need to be able to ramp position controllably)
One igniter, again controlled through a relay/MOSFET
My sensors consist of:
8 pressure transducers (analog voltages)
4 thermocouples (analog voltages)
2 motor encoders (quadrature encoders)
1 light sensor (analog voltage)
1 Load cell (analog voltage)
Ideally all of the collected data (all of the above sensors) plus some additional data (timestamps, motor set positions, solenoid positions) is streamed back to the host computer at in real time.
Given the motor control with PWM & PID, you need to specify a desired resolution, either in PWM timer ticks or ADC reads. This is the most critical part. It doesn't hurt if the ADC has greater resolution than your specified resolution either. The PCB has to be designed accordingly, with sufficient resolution on resistors etc.
After you've done this, find MCU with sufficiently accurate ADC. I would imagine that 12 bit resolution is enough for most applications, but I don't know your specific case.
Next, you need to decide how fast you want the PID to be. Should an output on the PWM result in a read on the ADC in the next cycle, or could you settle for slower response? The realtime bottleneck here will be the ADC conversion clock, not the CPU.
The rest of the system doesn't seem time critical at all - you just have to ensure that everything is read/set synchronously. The data transmission to/from the host should preferably be done over CAN since it comes with hard real-time characteristics. Doesn't seem that you need a whole lot of bandwidth.
I have designed systems very similar to this using bare metal 16 bit MCUs running on 16MHz. Processing speed is really not a big concern, but meeting real-time deadlines is. That means you can forget about using Linux toys like Rasp PI, it's completely out of the question. And a RTOS is likely overkill since it mostly adds additional complexity.
A bare metal Cortex M with sufficient ADC resolution and CAN seems like a good choice. If you can stay away from floating point, that's nice too - depends on how advanced math you need. If you need nothing more advanced than PID, it can be implemented with fixed point just fine. (Or PI rather, since that usually works best for fast motor control systems.)
I am new in embedded development and few times ago I red some code about a PIC24xxxx.
void i2c_Write(char data) {
while (I2C2STATbits.TBF) {};
IFS3bits.MI2C2IF = 0;
I2C2TRN = data;
while (I2C2STATbits.TRSTAT) {};
Nop();
Nop();
}
What do you think about the while condition? Does the microchip not using a lot of CPU for that?
I asked myself this question and surprisingly saw a lot of similar code in internet.
Is there not a better way to do it?
What about the Nop() too, why two of them?
Generally, in order to interact with hardware, there are 2 ways:
Busy wait
Interrupt base
In your case, in order to interact with the I2C device, your software is waiting first that the TBF bit is cleared which means the I2C device is ready to accept a byte to send.
Then your software is actually writing the byte into the device and waits that the TRSTAT bit is cleared, meaning that the data has been correctly processed by your I2C device.
The code your are showing is written with busy wait loops, meaning that the CPU is actively waiting the HW. This is indeed waste of resources, but in some case (e.g. your I2C interrupt line is not connected or not available) this is the only way to do.
If you would use interrupt, you would ask the hardware to tell you whenever a given event is happening. For instance, TBF bit is cleared, etc...
The advantage of that is that, while the HW is doing its stuff, you can continue doing other. Or just sleep to save battery.
I'm not an expert in I2C so the interrupt event I have described is most likely not accurate, but that gives you an idea why you get 2 while loop.
Now regarding pro and cons of interrupt base implementation and busy wait implementation I would say that interrupt based implementation is more efficient but more difficult to write since you have to process asynchronous event coming from HW. Busy wait implementation is easy to write but is slower; But this might still be fast enough for you.
Eventually, I got no idea why the 2 NoP are needed there. Most likely a tweak which is needed because somehow, the CPU would still go too fast.
when doing these kinds of transactions (i2c/spi) you find yourself in one of two situations, bit bang, or some form of hardware assist. bit bang is easier to implement and read and debug, and is often quite portable from one chip/family to the next. But burns a lot of cpu. But microcontrollers are mostly there to be custom hardware like a cpld or fpga that is easier to program. They are there to burn cpu cycles pretending to be hardware designs. with i2c or spi you are trying to create a specific waveform on some number of I/O pins on the device and at times latching the inputs. The bus has a spec and sometimes is slower than your cpu. Sometimes not, sometimes when you add the software and compiler overhead you might end up not needing a timer for delays you might be just slow enough. But ideally you look at the waveform and you simply create it, raise pin X delay n ms, raise pin Y delay n ms, drop pin Y delay 2*n ms, and so on. Those delays can come from tuned loops (count from 0 to 1341) or polling a timer until it gets to Z number of ticks of some clock. Massive cpu waste, but the point is you are really just being programmable hardware and hardware would be burning time waiting as well.
When you have a peripheral in your mcu that assists it might do much/most of the timing for you but maybe not all of it, perhaps you have to assert/deassert chip select and then the spi logic does the clock and data timing in and out for you. And these peripherals are generally very specific to one family of one chip vendor perhaps common across a chip vendor but never vendor to vendor so very not portable and there is a learning curve. And perhaps in your case if the cpu is fast enough it might be possible for you to do the next thing in a way that it violates the bus timing, so you would have to kill more time (maybe why you have those Nops()).
Think of an mcu as a software programmable CPLD or FPGA and this waste makes a lot more sense. Unfortunately unlike a CPLD or FPGA you are single threaded so you cant be doing several trivial things in parallel with clock accurate timing (exactly this many clocks task a switches state and changes output). Interrupts help but not quite the same, change one line of code and your timing changes.
In this case, esp with the nops, you should probably be using a scope anyway to see the i2c bus and since/when you have it on the scope you can try with and without those calls to see how it affects the waveform. It could also be a case of a bug in the peripheral or a feature maybe you cant hit some register too fast otherwise the peripheral breaks. or it could be a bug in a chip from 5 years ago and the code was written for that the bug is long gone, but they just kept re-using the code, you will see that a lot in vendor libraries.
What do you think about the while condition? Does the microchip not using a lot of CPU for that?
No, since the transmit buffer won't stay full for very long.
I asked myself this question and surprisingly saw a lot of similar code in internet.
What would you suggest instead?
Is there not a better way to do it? (I hate crazy loops :D)
Not that I, you, or apparently anyone else knows of. In what way do you think it could be any better? The transmit buffer won't stay full long enough to make it useful to retask the CPU.
What about the Nop() too, why two of them?
The Nop's ensure that the signal remains stable long enough. This makes this code safe to call under all conditions. Without it, it would only be safe to call this code if you didn't mess with the i2c bus immediately after calling it. But in most cases, this code would be called in a loop anyway, so it makes much more sense to make it inherently safe.
I have a small program running on Linux (on an embedded PC, dual-core Intel Atom 1.6GHz with Debian 6 running Linux 2.6.32-5) which communicates with external hardware via an FTDI USB-to-serial converter (using the ftdi_sio kernel module and a /dev/ttyUSB* device). Essentially, in my main loop I run
clock_gettime() using CLOCK_MONOTONIC
select() with a timeout of 8 ms
clock_gettime() as before
Output the time difference of the two clock_gettime() calls
To have some level of "soft" real-time guarantees, this thread runs as SCHED_FIFO with maximum priority (showing up as "RT" in top). It is the only thread in the system running at this priority, no other process has such priorities. My process has one other SCHED_FIFO thread with a lower priority, while everything else is at SCHED_OTHER. The two "real-time" threads are not CPU bound and do very little apart from waiting for I/O and passing on data.
The kernel I am using has no RT_PREEMPT patches (I might switch to that patch in the future). I know that if I want "proper" realtime, I need to switch to RT_PREEMPT or, better, Xenomai or the like. But nevertheless I would like to know what is behind the following timing anomalies on a "vanilla" kernel:
Roughly 0.03% of all select() calls are timed at over 10 ms (remember, the timeout was 8 ms).
The three worst cases (out of over 12 million calls) were 31.7 ms, 46.8 ms and 64.4 ms.
All of the above happened within 20 seconds of each other, and I think some cron job may have been interfering (although the system logs are low on information apart from the fact that cron.daily was being executed at the time).
So, my question is: What factors can be involved in such extreme cases? Is this just something that can happen inside the Linux kernel itself, i.e. would I have to switch to RT_PREEMPT, or even a non-USB interface and Xenomai, to get more reliable guarantees? Could /proc/sys/kernel/sched_rt_runtime_us be biting me? Are there any other factors I may have missed?
Another way to put this question is, what else can I do to reduce these latency anomalies without switching to a "harder" realtime environment?
Update: I have observed a new, "worse worst case" of about 118.4 ms (once over a total of around 25 million select() calls). Even when I am not using a kernel with any sort of realtime extension, I am somewhat worried by the fact that a deadline can apparently be missed by over a tenth of a second.
Without more information it is difficult to point to something specific, so I am just guessing here:
Interrupts and code that is triggered by interrupts take so much time in the kernel that your real time thread is significantly delayed. This depends on the frequency of interrupts, which interrupt handlers are involved, etc.
A thread with lower priority will not be interrupted inside the kernel until it yields the cpu or leaves the kernel.
As pointed out in this SO answer, CPU System Management Interrupts and Thermal Management can also cause significant time delays (up to 300ms were observed by the poster).
118ms seems quite a lot for a 1.6GHz CPU. But one driver that accidently locks the cpu for some time would be enough. If you can, try to disable some drivers or use different driver/hardware combinations.
sched_rt_period_us and sched_rt_period_us should not be a problem if they are set to reasonable values and your code behaves as you expect. Still, I would remove the limit for RT threads and see what happens.
What else can you do? Write a device driver! It's not that difficult and interrupt handlers get a higher priority than realtime threads. It may be easier to switch to a real time kernel but YMMV.
I have a program that generates packets to send to a receiver. I need an efficient method of introducing a small delay between the sending of each packet so as not to overrun the receiver. I've tried usleep() and nanosleep() but they seem to be too slow. I've implemented a busy wait loop and had more success, but it's not the most efficient method, I know. I'm interested in anyone's experiences in trying to do what I'm doing. Do others find usleep() and nanosleep() to function well for this type of application?
Thanks,
Danny Llewallyn
The behaviour of the sleep functions for very small intervals is heavily dependent on the kernel version and configuration.
If you have a "tickless" kernel (CONFIG_NO_HZ) and high resolution timers, then you can expect the sleeps to be quite close to what you ask for.
Otherwise, you'll generally end up sleeping at the granularity of the timer interrupt. The timer interrupt interval is configurable (CONFIG_HZ) - 10ms, 4ms, 3.3ms and 1ms are the common choices.
Assuming that the higher level approaches other commenters have mentioned are not available to you, then a common approach in embedded/microcontroller land is to create a NOP-loop of the required length.
A NOP operation takes one CPU cycle and in an embedded environment you typically know exactly what clock speed your processor is running at so you can just use a simple for-loop conatining _NOP() or if only a very short delay is required then don't bother with a loop, just add in the required number of nops.
regTX = 0xFF; // Transmit FF on special register
// Wait three clock cycles
_NOP();
_NOP();
_NOP();
regTX = 0x00; // Transmit 00
This seems like a bad design. Ideally the receiver would queue any extra data it receives , and then do its message processing separate thread. In that way, it can handle bursts of data without relying on the sender to throttle its requests.
But perhaps such an approach is not practical if (for example) you do not have control of the receiver's code, or if this is an embedded application.
I can speak for Solaris here, in that it uses an OS timer to wake up sleep calls. By default the minimum wait time will be 10ms, regardless of what you specify in your usleep. However, you can use the parameters hires_tick = 1 (1ms wakeups) and hires_hz = in the /etc/system configuration file to increase the frequency of timer wake up calls.
Instead of doing things at the packet level, where you need to worry about such things as overrunning the reciever. Why not use a TCP stream to transmit the data? Let TCP handle things like flow rate control and packet retransmission.
If you've already got a lot invested in the packetized approach, you can always use a layer on top of TCP to extract the original packets of data from the TCP stream and feed these into your existing functions.