I inherited some ALSA code that runs on a Linux embedded platform.
The existing implementation does blocking reads and writes using snd_pcm_readi() and snd_pcm_writei().
I am tasked to make this run on an ARM processor, but I find that the blocked interleaved reads push the CPU to 99%, so I am exploring non-blocking reads and writes.
I open the device as can be expected:
snd_pcm_handle *handle;
const char* hwname = "plughw:0"; // example name
snd_pcm_open(&handle, hwname, SND_PCM_STREAM_CAPTURE, SND_PCM_NONBLOCK);
Other ALSA stuff then happens which I can supply on request.
Noteworthy to mention at this point that:
we set a sampling rate of 48,000 [Hz]
the sample type is signed 32 bit integer
the device always overrides our requested period size to 1024 frames
Reading the stream like so:
int32* buffer; // buffer set up to hold #period_size samples
int actual = snd_pcm_readi(handle, buffer, period_size);
This call takes approx 15 [ms] to complete in blocking mode. Obviously, variable actual will read 1024 on return.
The problem is; in non-blocking mode, this function also takes 15 msec to complete and actual also always reads 1024 on return.
I would expect that the function would return immediately, with actual being <=1024 and quite possibly reading "EAGAIN" (-11).
In between read attempts I plan to put the thread to sleep for a specific amount of time, yielding CPU time to other processes.
Am I misunderstanding the ALSA API? Or could it be that my code is missing a vital step?
If the function returns a value of 1024, then at least 1024 frames were available at the time of the call.
(It's possible that the 15 ms is time needed by the driver to actually start the device.)
Anyway, blocking or non-blocking mode does not make any difference regarding CPU usage. To reduce CPU usage, replace the default device with plughw or hw, but then you lose features like device sharing or sample rate/format conversion.
I solved my problem by wrapping snd_pcm_readi() as follows:
/*
** Read interleaved stream in non-blocking mode
*/
template <typename SampleType>
snd_pcm_sframes_t snd_pcm_readi_nb(snd_pcm_t* pcm, SampleType* buffer, snd_pcm_uframes_t size, unsigned samplerate)
{
const snd_pcm_sframes_t avail = ::snd_pcm_avail(pcm);
if (avail < 0) {
return avail;
}
if (avail < size) {
snd_pcm_uframes_t remain = size - avail;
unsigned long msec = (remain * 1000) / samplerate;
static const unsigned long SLEEP_THRESHOLD_MS = 1;
if (msec > SLEEP_THRESHOLD_MS) {
msec -= SLEEP_THRESHOLD_MS;
// exercise for the reader: sleep for msec
}
}
return ::snd_pcm_readi(pcm, buffer, size);
}
This works quite well for me. My audio process now 'only' takes 19% CPU time.
And it matters not if the PCM interface was opened using SND_PCM_NONBLOCK or 0.
Going to perform callgrind analysis to see if more CPU cycles can be saved elsewhere in the code.
Related
I seem to be having a bit of trouble in waiting for the completion of serial data transmissions.
My interpretation of the relevant MSDN article is the EV_TXEMPTY event is the correct signal and which indicates that:
EV_TXEMPTY - The last character in the output buffer was sent.
However in my tests the event always fires immediately as soon as the data has been submitted to the buffer and long before the final has actually reached the wire. See the repro code below where the period is always zero.
Have I made an error in the implementation, am I misunderstanding the purpose of the flag, or is this feature simply not supported by modern drivers? In the latter case is there a viable workaround, say some form of synchronous line state request?
For the record the tests were conducted with FTDI USB-RS485 and TTL-232R devices in a Windows 10 system, a USB-SERIAL CH340 interface on a Windows 7 system, as well as the on-board serial interface of a 2005-vintage Windows XP machine. In the FTDI case sniffing the USB bus reveals only bulk out transactions and no obvious interrupt notification of the completion.
#include <stdio.h>
#include <windows.h>
static int fatal(void) {
fprintf(stderr, "Error: I/O error\n");
return 1;
}
int main(int argc, const char *argv[]) {
static const char payload[] = "Hello, World!";
// Use a suitably low bitrate to maximize the delay
enum { BAUDRATE = 300 };
// Ask for the port name on the command line
if(argc != 2) {
fprintf(stderr, "Syntax: %s {COMx}\n", argv[0]);
return 1;
}
char path[MAX_PATH];
snprintf(path, sizeof path, "\\\\.\\%s", argv[1]);
// Open and configure the serial device
HANDLE handle = CreateFileA(path, GENERIC_WRITE, 0, NULL,
OPEN_EXISTING, 0, NULL);
if(handle == INVALID_HANDLE_VALUE)
return fatal();
DCB dcb = {
.DCBlength = sizeof dcb,
.BaudRate = BAUDRATE,
.fBinary = TRUE,
.ByteSize = DATABITS_8,
.Parity = NOPARITY,
.StopBits = ONESTOPBIT
};
if(!SetCommState(handle, &dcb))
return fatal();
if(!SetCommMask(handle, EV_TXEMPTY))
return fatal();
// Fire off a write request
DWORD written;
unsigned int elapsed = GetTickCount();
if(!WriteFile(handle, payload, sizeof payload, &written, NULL) ||
written != sizeof payload)
return fatal();
// Wait for transmit completion and measure time elapsed
DWORD event;
if(!WaitCommEvent(handle, &event, NULL))
return fatal();
if(!(event & EV_TXEMPTY))
return fatal();
elapsed = GetTickCount() - elapsed;
// Display the final result
const unsigned int expected_time =
(sizeof payload * 1000 /* ms */ * 10 /* bits/char */) / BAUDRATE;
printf("Completed in %ums, expected %ums\n", elapsed, expected_time);
return 0;
}
The background is that this is part of a Modbus RTU protocol test suite where I am attempting to inject >3.5 character idle delays between characters on the wire to validate device response.
Admittedly, an embedded realtime system would have been more far suitable for the task but for various reasons I would prefer to stick to a Windows environment while controlling the timing as best as possible.
According to the comments by #Hans Passant and #RbMm the output buffer being referred in the EV_TXEMPTY documentation is an intermediate buffer and the event indicates that data has been forwarded to the driver. No equivalent notification event is defined which encompasses the full chain down to the final device buffers.
No general workaround is presently clear to me short of a manual delay based upon the bitrate and adding a significant worst-case margin for any remaining buffer layers to be traversed, inter-character gaps, clock skew, etc.
I would therefore very much appreciate answers with better alternate solutions.
Nevertheless, for my specific application I have implemented a viable workaround.
The target hardware is a half-duplex bus with a FTDI RS485 interface. This particular device offers an optional local-echo mode in which data actively transmitted onto the bus is not actively filtered from the reception.
After each transmission I am therefore able to wait for the expected echo to appear as a round-trip confirmation. In addition, this serves to detect certain faults such as a short-circuited bus.
I am working on a devices driver for a data acquisition system. There is a pci device that provides input and output data at the same time at regular intervals. And then the linux mod manages the data in circular buffers that are read and written to through file operations.
The data throughput of the system is relatively low it receives just over 750,000 bytes/second and transmits just over 150,000 bytes per second.
There is a small user space utility that writes and reads data in a loop for testing purposes.
Here is a section of the driver code (All the code related to the circular buffers has been omitted for simplicity sake. PCI device initialization is taken care of elsewhere and pci_interupt not the real entry point for the interrupt handler)
#include <linux/sched.h>
#include <linux/wait.h>
static DECLARE_WAIT_QUEUE_HEAD(wq_head);
static ssize_t read(struct file *filp, char __user *buf, size_t count, loff_t *f_pos)
{
DECLARE_WAITQUEUE(wq, current);
if(count == 0)
return 0;
add_wait_queue(&wq_head, &wq);
do
{
set_current_state(TASK_INTERRUPTIBLE);
if(/*There is any data in the receive buffer*/)
{
/*Copy Data from the receive buffer into user space*/
break;
}
schedule();
} while(1);
set_current_state(TASK_RUNNING);
remove_wait_queue(&wq_head, &wq);
return count;
}
static ssize_t write(struct file *filp, const char __user *buf, size_t count, loff_t *f_pos) {
/* Copy data from userspace into the transmit buffer*/
}
/* This procedure get's called in real time roughly once every 5 milliseconds,
It writes 4k to the receiving buffer and reads 1k from the transmit buffer*/
static void pci_interrupt() {
/*Copy data from PCI dma buffer to receiving buffer*/
if(/*There is enough data in the transmit buffer to fill the PCI dma buffer*/) {
/*Copy from the transmit buffer to the PCI device*/
} else {
/*Copy zero's to the PCI device*/
printk(KERN_ALERT DEVICE_NAME ": Data Underflow. Writing 0's'");
}
wake_up_interruptible(&wq_head);
}
The above code works well for long periods of time however every 12-18 hours there is a data underflow error. Resulting in zeros being written.
My first thought is that due to the userspace application not being truly real-time the time delay between it's read and write operations occasionally got too large causing the failure. However I tried changing the size of the reads and writes in userspace and changing the niceness of the userspace application this had no effect on the frequency of the error.
Do to the error's nature I believe there is some form of race condition in the three methods above. I am not sure how linux kernel wait queues work.
Is there a decent alternative to the above method for blocking reads or is there something else that is wrong the could cause this behavior.
System Information:
Linux Version: Ubuntu 16.10
Linux Kernel: linux-4.8.0-lowlatency
Chipset: Intel Celeron N3150/N3160 Quad Core 2.08 GHz SoC
TL;DR: The above code hits underflow errors every 12-18 hours is there a better way to do blocking IO or some race condition in the code.
One standard way used in linux can also be used in your case.
User space test program:
1. open file in blocking mode (default in linux until you specify NONBLOCK flag)
2. call select() to block on file descriptor.
Kernel driver:
1. Register interrupt handler which gets invoked whenever there is data available
2. Handler take lock to protect common buffer between reads/writes and transfer of data
Take a look at these links for source code from ldd3 book test and driver.
I am writing a seek routine for analog FM radio using rtl_sdr with a generic DVB-T stick (tuner is a FC0013). Code is mostly taken from rtl_power.c and rtl_fm.c.
My approach is:
Tune to the new frequency
Gather a few samples
Measure RSSI and store it
Do the same for the next frequency
Upon detecting a local peak which is above a certain threshold, tune to the frequency at which it was detected.
The issue is that I can’t reliably map samples to the frequency at which they were gathered. Here’s the relevant (pseudo) code snippet:
/* freq is the new target frequency */
rtlsdr_cancel_async(dongle.dev);
optimal_settings(freq, demod.rate_in);
fprintf(stderr, "\nSeek: currently at %d Hz (optimized to %d).\n", freq, dongle.freq);
rtlsdr_set_center_freq(dongle.dev, dongle.freq);
/* get two bursts of samples to measure RSSI */
if (rtlsdr_read_sync(dongle.dev, samples, samplesSize, &samplesRead) < 0)
fprintf(stderr, "\nSeek: rtlsdr_read_sync failed\n");
/* rssi = getRssiFromSamples(samples, samplesRead) */
fprintf(stderr, "\nSeek: rssi=%.2f", rssi);
if (rtlsdr_read_sync(dongle.dev, samples, samplesSize, &samplesRead) < 0)
fprintf(stderr, "\nSeek: rtlsdr_read_sync failed\n");
/* rssi = getRssiFromSamples(samples, samplesRead) */
fprintf(stderr, "\nSeek: rssi=%.2f\n", rssi);
When I scan the FM band with that snippet of code, I see that the two RSSI measurements typically differ significantly. In particular, the first measurement is usually in the neighborhood of the second measurement taken from the previous frequency, indicating that some of the samples were taken while still tuned into the old frequency.
I’ve also tried inserting a call to rtlsdr_reset_buffer() before gathering the samples, in an effort to flush any samples still stuck in the pipe, with no noticeable effect. Even a combination of
usleep(500000);
rtlsdr_cancel_async(dongle.dev);
rtlsdr_reset_buffer(dongle.dev)
does not change the picture, other than the usleep() slowing down the seek operation considerably. (Buffer size is 16384 samples, at a sample rate of 2 million, thus the usleep() delay is well above the time it takes to get one burst of samples.)
How can I ensure the samples I take were obtained after tuning into the new frequency?
Are there any buffers for samples which I would need to flush after tuning into a different frequency?
Can I rely on tuning being completed by the time rtlsdr_set_center_freq() returns, or does the tuner need some time to stabilize after that? In the latter case, how can I reliably tell when the frequency change is complete?
Anything else I might have missed?
Going through the code of rtl_power.c again, I found this function:
void retune(rtlsdr_dev_t *d, int freq)
{
uint8_t dump[BUFFER_DUMP];
int n_read;
rtlsdr_set_center_freq(d, (uint32_t)freq);
/* wait for settling and flush buffer */
usleep(5000);
rtlsdr_read_sync(d, &dump, BUFFER_DUMP, &n_read);
if (n_read != BUFFER_DUMP) {
fprintf(stderr, "Error: bad retune.\n");}
}
Essentially, the tuner needs to settle, with no apparent indicator of when this process is complete.
rtl_power.c solves this by waiting for 5 milliseconds, then discarding a few samples (BUFFER_DUMP is defined as 4096, at sample rates between 1–2.8M).
I found 4096 samples to be insufficient, so I went for the maximum of 16384. Results look a lot more stable this way, though even this does not always seem sufficient for the tuner to stabilize.
For a band scan, an alternative approach would be to have a loop acquiring samples and determining their RSSI until RSSI values begin to stabilize, i.e. changes are no longer monotonic or below a certain threshold.
static const int MAX_BUFFER_LEN = 1024*12; //in byets
char *bff = new char[MAX_BUFFER_LEN];
int fileflag = O_CREAT | O_WRONLY | O_NONBLOCK;
fl = open(filename, fileflag, 0666);
if(fl < 0)
{
printf("can not open file! \n");
return -1;
}
do
{
///begin one loop
struct timeval bef;
struct timeval aft;
gettimeofday(&bef, NULL);
write(fl, bff, MAX_BUFFER_LEN);
gettimeofday(&aft, NULL);
if(aft.tv_usec - bef.tv_usec > 20000) //ignore second condition
{
printf(" cost too long:%d \n", aft.tv_usec - bef.tv_usec);
}
//end one loop
//sleep
usleep(30*1000); //sleep 30ms
}while(1);
When I run the program on Linux ubuntu 2.6.32-24-generic, I find that the COST TOO LONG printing shows 1~2 times in a minutes. I tried both to USB disk and hard disk.I also tried this program in arm platform .This condition also happened. I think that 3.2Mbps is too high for low speed IO device. So I reduce to 0.4Mbps.It significantly reduce the printing frequency. Is any solution to control the time cost ?
Is write() just copying the data to kenal buffer and returning immediately or waiting fo disk IO complete? Is it possible that kenal IO buffer is full and must be waiting for flush but why only several times cost so long?
You can't accelerate the disk, but you can do other stuff while the disk is working. You needn't wait for it to be done.
This is, however, highly non-trivial to do in C. You would need nonblocking I/O, multithreading or multiprocessing. Try googling up these keywords and how to use the different techniques (you are already using a nonblocking fd up there).
Your disk I/O performance is being negatively impacted by the code around each write to measure the time (and measuring time at this granularity is going to have occasional spikes as the computer does other things).
Instead, measure the performance of the code to write the entire data - start/end times outside the loop (with the loop properly bounded, of course).
If you are calling a file write which you think is going to take a lot of time, then make your process to run two threads, while one is doing the main task let the other write to disk.
I have a little code below. I use this code to output some 1s and 0s (unsigned output[38]) from a GPIO of an embedded board.
My Question: the time between two output values (1, 0 or 0, 1) should be 416 microseconds as I define on clock_nanosleep below code, I also used sched_priority() for a better time resolution. However, an oscilloscope (pic below) measurement shows that the time between the two output values are 770 usec . I wonder why do I have that much inaccuracy between the signals?
PS. the board(beagleboard) has Linux 3.2.0-23-omap #36-Ubuntu Tue Apr 10 20:24:21 UTC 2012 armv7l armv7l armv7l GNU/Linux kernel, and it has 750 MHz CPU, top shows almost no CPU(~1%) and memory(~0.5%) is consumed before I run my code. I use an electronic oscilloscope which has no calibration problem.
#include <stdio.h>
#include <stdlib.h> //exit();
#include <sched.h>
#include <time.h>
void msg_send();
struct sched_param sp;
int main(void){
sp.sched_priority = sched_get_priority_max(SCHED_FIFO);
sched_setscheduler(0, SCHED_FIFO, &sp);
msg_send();
return 0;
}
void msg_send(){
unsigned output[38] = {0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,1};
FILE *fp8;
if ((fp8 = fopen("/sys/class/gpio/export", "w")) == NULL){ //echo 139 > export
fprintf(stderr,"Cannot open export file: line844.\n"); fclose(fp8);exit(1);
}
fprintf(fp8, "%d", 139); //pin 3
fclose(fp8);
if ((fp8 = fopen("/sys/class/gpio/gpio139/direction", "rb+")) == NULL){
fprintf(stderr,"Cannot open direction file - GPIO139 - line851.\n");fclose(fp8); exit(1);
}
fprintf(fp8, "out");
fclose(fp8);
if((fp8 = fopen("/sys/class/gpio/gpio139/value", "w")) == NULL) {
fprintf(stderr,"error in openning value\n"); fclose(fp8); exit(1);
}
struct timespec req = { .tv_sec=0, .tv_nsec = 416000 }; //416 usec
/* here is the part that my question focus*/
while(1){
for(i=0;i<38;i++){
rewind(fp8);
fprintf(fp8, "%d", output[i]);
clock_nanosleep(CLOCK_MONOTONIC ,0, &req, NULL);
}
}
}
EDIT: I have been reading for days that clock_nanosleep() or other nanosleep, usleep etc. does not guarantee the waking up on time. they usually provide to sleep the code for the defined time, but waking up the process depends on the CPU. what I found is that absolute time provides a better resolution (TIMER_ABSTIME flag). I found the same solution as Maxime suggests. however, I have a glitch on my signal when for loop is finalized. In my opinion, it is not good to any sleep functions to create a PWM or data output on an embedded platform. It is good to spend some time to learn CPU timers that platforms provide to generate the PWM or data out that has good accuracy.
I can't figure out how a call to clock_getres() can solve your problem. In the man page, it's said that only read the resolution of the clock.
As Geoff said, using absolute sleeping clock should be a better solution. This can avoid the unespected timing delay from other code.
struct timespec Time;
clock_gettime(CLOCK_REALTIME, &(Time));
while(1){
Time.tv_nsec += 416000;
if(Time.tv_nsec > 999999999){
(Time.tv_sec)++;
Time.tv_nsec -= 1000000000;
}
clock_nanosleep(CLOCK_REALTIME, TIMER_ABSTIME, &(Time), NULL);
//Do something
}
I am using this on fews programs I have for generating some regular message on ethernet network. And it's working fine.
If you are doing time sensitive I/O, you probably shouldn't use the stuff in stdio.h but instead the I/O system calls because of the buffering done by stdio. It looks like you might be getting the worst effect of the buffering too because your program does these steps:
fill the buffer
sleep
rewind, which I believe will flush the buffer
What you want is for the kernel to service the write while you are sleeping, instead the buffer is flushed after you sleep and you have to wait for the kernel to process it.
I think your best bet is to use open("/sys/class/gpio/gpio139/value", O_WRONLY|O_DIRECT) to minimize delays due to caching.
if you still need to flush buffers to force the write through you probably want to use clock_gettime to compute the time spent flushing the data and subtract that from the sleep time. Alternatively add the desired interval to the result of clock_gettime and pass that to clock_nanosleep and use the TIMER_ABSTIME flag to wait for that absolute time to occur.
I would guess that the problem is that the clock_nanosleep is sleeping for 416 microsec
and that the other commands in the loop as well as the loop and clock_nanosleep architecture itself are taking 354 microsec. The OS may also be making demands.
What interval do you get if you set the sleep = 0?
Are you running this on a computer or a PLC?
Response to Comment
Seems like you have something somewher in the hardware/software that is doing something unexpected - it could be a bugger to find.
I have 2 suggestions depending on how critical the period is:
Low criticality - put a figure in your program that causes the loop to take the time you want. However, if this is a transient or time/temperature dependant effect you will need to check for drift periodically.
High criticality - Build a temperature stable oscilator in hardware. These can be bought off the shelf.