ALSA capture causes high CPU usage - c

I write a full-duplex ALSA program and run it on a linux-based embedded system.
Its sound configurations are:
Sample rate: 16Hz
Channels: 1 (mono)
Format: S16_LE
min avail: 160 (frames)
For real-time application, I need to capture sound every 10ms, so I set the min avail to 160.
My problem is: While the program is running, the CPU usage is very High which might be 99.9% (by top command). Sometimes the CPU load is low, but once it gets up to 99.9%, then it can not go back to low CPU usage.
I found out that it might be configuration problem. In asound.conf file (see it in the followed code), I have created a asym type card named "asym0" to choose two different slave cards for playback and capture.
Originally, I use the "primary" as capture device, but it cause high CPU usage. Then I created a rate type card named "rate0", and set it as capture device. The CPU usage becomes lower which floats between 20%~60%, but the captured sound sounds bad. I 've heard some "po po po" in my voice if I test the Mic(capturing).
So...
If I choose "primary", CPU usage is high, but no "po po po" sound.
If I choose "rate0", CPU usage is lower, but has "po po po" sound.
What are the different from "type hw" and "type rate"?
Is the effect caused by the different interrupt frequency?
asound.conf file:
pcm.primary {
type hw
card mycard
}
pcm.rate0 {
type rate
slave {
pcm "primary"
rate 16000
}
}
pcm.asym0 {
type asym
playback.pcm "primary"
capture.pcm "primary" or "rate0"
}
Please anyone help me to solve this problem. Thank you!!!

Sound capture should be a very trivial task for the CPU because most of it is happening in silicon hardware and occasionally it needs to fire up the thread to handle input audio. Typically if your periods or buffers are very small it will require more CPU attention and is likely to have overruns. Overruns may be where your signal dropouts are occuring.
If your sample rate is 16 kHz, and you capture every 10ms, that is indeed 160 frames.
Some things to look at are whether your period is smaller then 10ms, whether you are doing processing which is very heavy in your thread.
To help you, there is some code in gtkIOStream which implements a C++ OO ALSA hierarchy. You can look at this ALSAFullduplex.C test application as a reference and test it to see if it suffers the same problems you are suffering.
Information on building gtkIOStream is given in this email :
https://lists.audioinjector.net/pipermail/people/2020-March/000028.html

Related

How to determine MCU Clock speed requirements

Overview:
I spent a while trying to think of how to formulate this question. To narrow the scope, I wanted to provide my initial HW requirements in the form of a ‘real life’ example application.
I understand that clock speed is probably relative, in the sense that it is a case by case basis. For example, your requirement for a certain speed may be impacted on by the on-chip peripherals offered by the MCU. As an example, you may spend (n) cycles servicing an ISR for an encoder, or, you could pick an MCU that has a QEI input to do it for you (to some degree), which in turn, may loosen your requirement?
I am not an expert, and am very much still learning, so please call me out if I use an incorrect term, or completely misinterpret something. I assure you; the feedback is welcome!
Example Application:
This application is relatively simple. It can be thought of as a non-blocking state machine, where each ‘iteration’ of the machine must complete within 20ms. A single iteration of this machine has 4 main tasks:
Decode a serial payload, consisting of 32 bytes. The length is fixed at 32 bytes, payload is dynamic, baud is 115200bps (See Task #2 below)
Read 4 incremental shaft encoder signals, which are coupled with 4 DC Motors, 1 encoder for each motor (See Task #1 Below)
Determine the position of 4 limit switches. ISR driven, trigger on rising edge for each switch.
Based on the 3 categories of inputs above, the MCU will output 4 separate PWM signals # 50Hz (20ms) to a motor controller for its next set of movements. (See Task #3 below)
From an IO perspective, I know that the MCU is on the hook for reading 8 digital signals (4 quadrature encoders, 4 limit switches), and decoding a serial frame of 32 bytes over UART.
Based on that data, the MCU will output 4 independent PWM signals, with a pulse width of [1000usec -3200usec], per motor, to the motor controller.
The Question:
After all is said and done, I am trying to think through how I can map my requirements into MCU selection, solely from a speed point of view.
It’s easy for me to look through the datasheet and say, this chip meets my requirements because it has (n) UARTS, (n) ISR input pins, (n) PWM outputs etc. But my projects are so small that I always assume the processor is ‘fast enough’. Aside from my immediate peripheral needs, I never really look into the actual MCU speed, which is an issue on my end.
To resolve that, I am trying to understand what goes into selecting a particular clock speed, based on the needs of a given application. Or, another way to say it, which is probably wrong, but how to you quantify the theoretical load on the processor for that specific application?
Additional Information
Task #1: Encoder:
Each of the 4 motors have different tasks within the system, but regardless, they are the same brand/model motor, and have a maximum RPM of 230. My assumption is, if at its worst case, one of the motors is spinning at 230 RPM, that would mean, at full quadrature resolution (count rising/falling for channel A/B) the 1000PPR encoder would generate 4K interrupts per revolution. As such, the MCU would have to service those interrupts, potentially creating a bottleneck for the system. For example, if (n) number of clock cycles are required to service the ISR, and for 1 revolution of 1 motor, we expect 4K interrupts, that would be … 230(RPM) * 4K (ISR per rev) == 920,000 interrupts per minute? Yikes! And then I guess you could just extrapolate and say, again, at it’s worst case, where each of the 4 motors are spinning at 230 RPM, there’s a potential that, if the encoders are full resolution, the system would have to endure 920K interrupts per minute for each encoder. So 920K * 4 motors == 3,680,000 interrupts per minute? I am 100% sure I am doing something wrong, so please, feel free to set me straight.
Task #2: Serial Decoding
The MCU will require a dedicated HW serial port to decode a packet of 32 bytes, which repeats, with different values, every 7ms. Baud rate will be set to 115200bps.
Task #3: PWM Output
Based on the information from tasks 1 and 2, the MCU will write to 4 separate PWM outputs. The pulse for each output will be between 1000-3200usec with a frequency of 50Hz.
You need to separate real-time critical parts from the rest of the application. For example, the actual reception of an UART frame is somewhat time-critical if you do so interrupt-based. But the protocol decoding is not critical at all unless you are expected to respond within a certain time.
Decode a serial payload, consisting of 32 bytes.
You can either do this the old school way with interrupts filling up a buffer, or you could look for a part with DMA, which is fairly common nowadays. DMA means that you won't have to consider some annoying, relatively low frequency UART interrupt disrupting other tasks.
Read 4 incremental shaft encoder signals
I haven't worked with such encoders so I can't tell how time-critical they are. If you have to catch every single interrupt and your calculations are correct, then 3,680,000 interrupts per minute is still not that bad. 60*60/3680000 = 978us. So roughly one interrupt every millisecond, that's not a "hard real-time" requirement. If that's the only time-critical thing you need to do, then any shabby 8-bitter running at 8MHz could keep up.
Determine the position of 4 limit switches
You don't mention timing here but I assume this is something that could be polled cyclically by a low priority cyclic timer.
the MCU will output 4 separate PWM signals
Not a problem, just pick one with a decent PWM hardware peripheral. You should just need to update some PWM duty cycle registers now and then.
Overall, this doesn't sound all that real-time critical. I've done much worse real-time projects with icky 8 and 16 bitters. However, each time I did, I always regret not picking a faster MCU, because you always come up with stuff to add as the project/product goes on.
It sounds like your average mainstream Cortex M0+ would be a good candidate for this project. Clock it at ~48MHz and you'll have plenty of CPU power. Cortex M4 or larger if you actually expect floating point math (I don't quite see why you'd need that though).
Given the current component crisis, be careful with which brand you pick though! In particular stay clear of STM32, since ST can't produce them right now and you might end up waiting over a year until you get parts.
The answer to the question is "experience". But intuitively your example is not particularly taxing - although there are plenty of ways you could mess it up. I once worked on a project that ran on a 200MHz C5502 DSP at near 100% CPU load. The application now runs on a 72MHz Cortex-M3 at only 60% with additional functionality and I/O not present in the original implementation..
Your application is I/O bound; depending on data rates (and critically interrupt rates), I/O seldom constitutes the highest CPU load, and DMA, hardware FIFOs, input capture timer/counters, and hardware PWM etc. can be used to minimise the I/O impact. I shan't go into it in detail; #Lundin has already done that.
Note also that raw processor speed is important for data or signal processing and number crunching - but what I/O generally requires is deterministic real-time response, and that is seldom simply a matter of MHz or MIPS - you will get more deterministic and possibly faster response from an 8bit AVR running at a few MHz than you can guarantee from a 500MHz application processor running Linux - and it won't take 30 seconds to boot!

System architecture to use for high speed micro controller test stand controller/daq

I am designing the controller and data acquisition unit for a rocket engine test stand. This system needs to control a number of actuators on the test stand and also be able to transmit collected data back to the host computer where the team will be watching live data/camera feeds from safety.
The overall design requirements are as follows:
Acquire data from ~15 analog sensors at 1KHz
Control the actuators on the test stand including valves and ignition switches
Transmit data back to the host computer in our shelter in real time
Accept control from the host computer for things like manual valve actuation, test sequence modification, sequence abortion, etc.
I am not exactly sure where to begin when laying out the software for this system. I am considering using an STM32 ARM Cortex-M4 processor running at 180 MHz. I am having trouble figuring how I should approach the problem. I have considered using an RTOS system but based on what I have seen those generate large overheads as you run them faster as the scheduler has to run each tick. The other idea I'm bouncing around is a state machine combined with some timer-based interrupts for reading and then sending data back out to the PC. Any advice as to how to approach this problem to minimize code complexity would be greatly appreciated. Thanks.
EDIT:
I have been told to clarify a number of things concerning the technical specs of the system.
My actuators consist of:
6 solenoids (controlled digitally through relays/MOSFET, and switched around once a second)
2 DC motors (driven with PWM outputs in a PID loop, need to be able to ramp position controllably)
One igniter, again controlled through a relay/MOSFET
My sensors consist of:
8 pressure transducers (analog voltages)
4 thermocouples (analog voltages)
2 motor encoders (quadrature encoders)
1 light sensor (analog voltage)
1 Load cell (analog voltage)
Ideally all of the collected data (all of the above sensors) plus some additional data (timestamps, motor set positions, solenoid positions) is streamed back to the host computer at in real time.
Given the motor control with PWM & PID, you need to specify a desired resolution, either in PWM timer ticks or ADC reads. This is the most critical part. It doesn't hurt if the ADC has greater resolution than your specified resolution either. The PCB has to be designed accordingly, with sufficient resolution on resistors etc.
After you've done this, find MCU with sufficiently accurate ADC. I would imagine that 12 bit resolution is enough for most applications, but I don't know your specific case.
Next, you need to decide how fast you want the PID to be. Should an output on the PWM result in a read on the ADC in the next cycle, or could you settle for slower response? The realtime bottleneck here will be the ADC conversion clock, not the CPU.
The rest of the system doesn't seem time critical at all - you just have to ensure that everything is read/set synchronously. The data transmission to/from the host should preferably be done over CAN since it comes with hard real-time characteristics. Doesn't seem that you need a whole lot of bandwidth.
I have designed systems very similar to this using bare metal 16 bit MCUs running on 16MHz. Processing speed is really not a big concern, but meeting real-time deadlines is. That means you can forget about using Linux toys like Rasp PI, it's completely out of the question. And a RTOS is likely overkill since it mostly adds additional complexity.
A bare metal Cortex M with sufficient ADC resolution and CAN seems like a good choice. If you can stay away from floating point, that's nice too - depends on how advanced math you need. If you need nothing more advanced than PID, it can be implemented with fixed point just fine. (Or PI rather, since that usually works best for fast motor control systems.)

How do I write audio data at a certain sample rate?

I am making a synthesizer by piping data into aplay (I know it's not ideal) and the sound is lagging behind the keypresses which alter the sound. I believe this is because aplay is going at a constant 8000 Hz, but the c program is going at an unstable rate. How do I get the for loop to go at 8000 Hz in C?
To generate audio samples at 8000 Hz (or any fixed rate) you don't want your loop to "run at" that rate. That would involve huge amounts of overhead (99.99% or more) spinning doing nothing until time to generate the next sample, and (especially if you sleep rather than spinning) would be unreliable in that your process might not wake-up/get-scheduled in time for some of the samples.
Instead, you just want to be producing samples at an overall rate matching what the consumer (aplay/the audio device) expects. You can compute the overall current sample number you should be generating up to as something like:
current_time + buffer_depth - start_time
then, after generating up to that sample, sleep for some period proportional to the buffer depth, but sufficiently less that you won't be in trouble if your process doesn't get scheduled again right away. The buffer depth you can use depends on what kind of latency you need. If you're making sounds for live/realtime events, you probably want a buffer depth of 1/50 sec (20 ms) or less. If not, you can happily use huge buffers like 5-10 seconds.
If you are piping data to aplay, you will not experience any problems with the sample rate (8 kHz, for example) because the kernel will block your program when you write() when the buffer is full. This will effectively limit your audio generation to 8 kHz with no work on your part.
However, this is far from ideal. Your application will only be throttled once the kernel buffer for the pipe is full, and the default size for pipe buffers on Linux is 64 kB. For stereo 16-bit data at 8 kHz, this is two full seconds of audio data, so you would expect your audio to lag at least two seconds from the user input. This is unacceptable for synthesizer applications.
The only real solution is to use the ALSA library directly (or some alternative sound API). Using this API, you can send buffered audio data to your audio output device without accumulating excessive queued data in kernel buffers.
See A Guide Through The Linux Sound API Jungle for some tips.

Increase Beaglebone Black ADC sampling rate?

I'm working on a project that requires the use of a microcontroller, and for this reason, I decided to use the Beaglebone Black. I'm still new to the Beaglebone world and I'm facing some problems that I hope you guys can help me with.
In my project I will have to continuously read from all the 7 analog read pins and do some processing accordingly. My question is, what will be the fastest programming language to do so (I must read as much samples as possible and in a very short time!) and how to increase the sampling rate from KHz to MHz?
I tried the following codes:
Javascript Code:
var b = require('bonescript');//this variable is to refer to my beaglebone
time = new Date();
b.analogRead("P9_39");
console.log(new Date() - time);
this code will simply perform one analog read and will print out the time needed to perform the read. Surprisingly, the result was 111ms!! which means that my sampling rate is 10 if I'm not wrong.
An alternative was to use pyhton:
import Adafruit_BBIO.ADC as ADC
import time
ADC.setup()
millis = int(round(time.time() * 1000))
ADC.read_raw("P9_39")
millis = millis = int(round(time.time() * 1000)) - millis
print millis
this code took less time (4ms) but still, if I wanted to read form the 7 analog input pins, I will only be able to read around 35 samples from each.
Using the terminal:
echo cape-bone-iio > /sys/devices/bone_capemgr.*/slots
time cat /sys/devices/ocp.3/helper.15/AIN0
############OR############
time cat /sys/devices/ocp.3/44e0d000.tscadc/tiadc/iio\:device0/in_voltage0_raw
and this took 50ms.
I want my sampling rate to be something in MHz. How can I do so? I know that the Beaglebone Black is capable of that but I could not find a clear way to do so. Any help is appreciated.
Thanks in advance.
Sampling rate of AM335x ADC is 200K (link). This means you won't get into MHz range with stock BeagleBone Black ADC.
To get something working with a latency of 5 µs in non-real-time OS like Linux is impossible. You will be at a mercy of OS to schedule your execution thread. Other kernel threads will take priority and will preempt your thread, even if you assign it the highest scheduling priority.
From my experience with digital IO on BeagleBone Black, I stated seeing missed frames starting around 1K samples per second. Now, it will depend on your level of tolerance to missing samples -- if you only need working semi-reliably you can probably squeeze out 10 K samples per second by switching to C/C++ and increasing priority of your process with nice --10 ... command. However if you cannot tolerate missed frames, you have to do one of these:
Bypass OS entirely and write C program for naked AM335x processor (no OS).
Use another hardware -- an ADC with a buffer to accumulate samples while your program is preempted.
Use PRUSS processors on BBB. They run at 200 MHz, so if you have a tight loop with e.g. 20 assembly instructions you will get reliable sampling rate of 10 MHz. That is if you had a faster ADC in the first place, and of course it would handle the stock 200 KHz ADC easily.
I personally went with option #3 and was happy to see my device perform sub-millisecond GPIO operations extremely reliably.
Use 127 beaglebone blacks plugged into 127 usb hub ports and breakout visual basic and write a usb program to automatically sequencially fire 127 beagle bones 1 after the other and read the data in a textbox...You will get around 16 mhz / msps consective adcs per fast cpu with say windows 10....lyj2021
You may have over lapping data...But you can track this with each fire of each beagle bone black...consecutively...

Scheduling routines in C and timing requirements

I'm working on a C program that transmits samples over USB3 for a set period of time (1-10 us), and then receives samples for 100-1000 us. I have a rudimentary pthread implementation where the TX and RX routines are each handled as a thread. The reason for this is that in order to test the actual TX routine, the RX needs to run and sample before the transmitter is activated.
Note that I have very little C experience outside of embedded applications and this is my first time dabbling with pthread.
My question is, since I know exactly how many samples I need to transmit and receive, how can I e.g. start the RX thread once the TX thread is done executing and vice versa? How can I ensure that the timing stays consistent? Sampling at 10 MHz causes some harsh timing requirements.
Thanks!
EDIT:
To provide a little more detail, my device is a bladeRF x40 SDR, and communication to the device is handled by a FX3 microcontroller, which occurs over a USB3 connection. I'm running Xubuntu 14.04. Processing, scheduling and configuration however is handled by a C program which runs on the PC.
You don't say anything about your platform, except that it supports pthreads.
So, assuming Linux, you're going to have to realize that in general Linux is not a real-time operating system, and what you're doing sure sounds as if has real-time timing requirements.
There are real-time variants of Linux, I'm not sure how they'd suit your needs. You might also be able to achieve better performance by doing the work in a kernel driver, but then you won't have access to pthreads so you're going to have to be a bit more low-level.
Thought I'd post my solution.
While the next build of the bladeRF firmware and FPGA image will include the option to add metadata (timestamps) to the synchronous interface, until then there's no real way in which I can know at which time instants certain events occurred.
What I do know is my sampling rate, and exactly how many samples I need to transmit and receive at which times relative to each other. Therefore, by using conditional variables (with pthread), I can signal my receiver to start receiving samples at the desired instant. Since TX and RX operations happen in a very specific sequence, I can calculate delays by counting the number of samples and multiplying by the sampling rate, which has proven to be within 95-98% accurate.
This obviously means that since my TX and RX threads are running simultaneously, there are chunks of data within the received set of samples that will be useless, and I have another routine in place to discard those samples.

Resources