How to avoid costing too long for writing data to disk - c

static const int MAX_BUFFER_LEN = 1024*12; //in byets
char *bff = new char[MAX_BUFFER_LEN];
int fileflag = O_CREAT | O_WRONLY | O_NONBLOCK;
fl = open(filename, fileflag, 0666);
if(fl < 0)
{
printf("can not open file! \n");
return -1;
}
do
{
///begin one loop
struct timeval bef;
struct timeval aft;
gettimeofday(&bef, NULL);
write(fl, bff, MAX_BUFFER_LEN);
gettimeofday(&aft, NULL);
if(aft.tv_usec - bef.tv_usec > 20000) //ignore second condition
{
printf(" cost too long:%d \n", aft.tv_usec - bef.tv_usec);
}
//end one loop
//sleep
usleep(30*1000); //sleep 30ms
}while(1);
When I run the program on Linux ubuntu 2.6.32-24-generic, I find that the COST TOO LONG printing shows 1~2 times in a minutes. I tried both to USB disk and hard disk.I also tried this program in arm platform .This condition also happened. I think that 3.2Mbps is too high for low speed IO device. So I reduce to 0.4Mbps.It significantly reduce the printing frequency. Is any solution to control the time cost ?
Is write() just copying the data to kenal buffer and returning immediately or waiting fo disk IO complete? Is it possible that kenal IO buffer is full and must be waiting for flush but why only several times cost so long?

You can't accelerate the disk, but you can do other stuff while the disk is working. You needn't wait for it to be done.
This is, however, highly non-trivial to do in C. You would need nonblocking I/O, multithreading or multiprocessing. Try googling up these keywords and how to use the different techniques (you are already using a nonblocking fd up there).

Your disk I/O performance is being negatively impacted by the code around each write to measure the time (and measuring time at this granularity is going to have occasional spikes as the computer does other things).
Instead, measure the performance of the code to write the entire data - start/end times outside the loop (with the loop properly bounded, of course).

If you are calling a file write which you think is going to take a lot of time, then make your process to run two threads, while one is doing the main task let the other write to disk.

Related

ALSA - Non blocking (interleaved) read

I inherited some ALSA code that runs on a Linux embedded platform.
The existing implementation does blocking reads and writes using snd_pcm_readi() and snd_pcm_writei().
I am tasked to make this run on an ARM processor, but I find that the blocked interleaved reads push the CPU to 99%, so I am exploring non-blocking reads and writes.
I open the device as can be expected:
snd_pcm_handle *handle;
const char* hwname = "plughw:0"; // example name
snd_pcm_open(&handle, hwname, SND_PCM_STREAM_CAPTURE, SND_PCM_NONBLOCK);
Other ALSA stuff then happens which I can supply on request.
Noteworthy to mention at this point that:
we set a sampling rate of 48,000 [Hz]
the sample type is signed 32 bit integer
the device always overrides our requested period size to 1024 frames
Reading the stream like so:
int32* buffer; // buffer set up to hold #period_size samples
int actual = snd_pcm_readi(handle, buffer, period_size);
This call takes approx 15 [ms] to complete in blocking mode. Obviously, variable actual will read 1024 on return.
The problem is; in non-blocking mode, this function also takes 15 msec to complete and actual also always reads 1024 on return.
I would expect that the function would return immediately, with actual being <=1024 and quite possibly reading "EAGAIN" (-11).
In between read attempts I plan to put the thread to sleep for a specific amount of time, yielding CPU time to other processes.
Am I misunderstanding the ALSA API? Or could it be that my code is missing a vital step?
If the function returns a value of 1024, then at least 1024 frames were available at the time of the call.
(It's possible that the 15 ms is time needed by the driver to actually start the device.)
Anyway, blocking or non-blocking mode does not make any difference regarding CPU usage. To reduce CPU usage, replace the default device with plughw or hw, but then you lose features like device sharing or sample rate/format conversion.
I solved my problem by wrapping snd_pcm_readi() as follows:
/*
** Read interleaved stream in non-blocking mode
*/
template <typename SampleType>
snd_pcm_sframes_t snd_pcm_readi_nb(snd_pcm_t* pcm, SampleType* buffer, snd_pcm_uframes_t size, unsigned samplerate)
{
const snd_pcm_sframes_t avail = ::snd_pcm_avail(pcm);
if (avail < 0) {
return avail;
}
if (avail < size) {
snd_pcm_uframes_t remain = size - avail;
unsigned long msec = (remain * 1000) / samplerate;
static const unsigned long SLEEP_THRESHOLD_MS = 1;
if (msec > SLEEP_THRESHOLD_MS) {
msec -= SLEEP_THRESHOLD_MS;
// exercise for the reader: sleep for msec
}
}
return ::snd_pcm_readi(pcm, buffer, size);
}
This works quite well for me. My audio process now 'only' takes 19% CPU time.
And it matters not if the PCM interface was opened using SND_PCM_NONBLOCK or 0.
Going to perform callgrind analysis to see if more CPU cycles can be saved elsewhere in the code.

What is the best way to read input of unpredictable and indeterminate (ie no EOF) size from stdin in C?

This must be a stupid question because this should be a very common and simple problem, but I haven't been able to find an answer anywhere, so I'll bite the bullet and ask.
How on earth should I go about reading from the standard input when there is no way of determining the size of the data? Obviously if the data ends in some kind of terminator like a NUL or EOF then this is quite trivial, but my data does not. This is simple IPC: the two programs need to talk back and forth and ending the file streams with EOF would break everything.
I thought this should be fairly simple. Clearly programs talk to each other over pipes all the time without needing any arcane tricks, so I hope there is a simple answer that I'm too stupid to have thought of. Nothing I've tried has worked.
Something obvious like (ignoring necessary realloc's for brevity):
int size = 0, max = 8192;
unsigned char *buf = malloc(max);
while (fread((buf + size), 1, 1, stdin) == 1)
++size;
won't work since fread() blocks and waits for data, so this loop won't terminate. As far as I know nothing in stdio allows nonblocking input, so I didn't even try any such function. Something like this is the best I could come up with:
struct mydata {
unsigned char *data;
int slen; /* size of data */
int mlen; /* maximum allocated size */
};
...
struct mydata *buf = xmalloc(sizeof *buf);
buf->data = xmalloc((buf->mlen = 8192));
buf->slen = 0;
int nread = read(0, buf->data, 1);
if (nread == (-1))
err(1, "read error");
buf->slen += nread;
fcntl(0, F_SETFL, oflags | O_NONBLOCK);
do {
if (buf->slen >= (buf->mlen - 32))
buf->data = xrealloc(buf->data, (buf->mlen *= 2));
nread = read(0, (buf->data + buf->slen), 1);
if (nread > 0)
buf->slen += nread;
} while (nread == 1);
fcntl(0, F_SETFL, oflags);
where oflags is a global variable containing the original flags for stdin (cached at the start of the program, just in case). This dumb way of doing it works as long as all of the data is present immediately, but fails otherwise. Because this sets read() to be non-blocking, it just returns -1 if there is no data. The program communicating with mine generally sends responses whenever it feels like it, and not all at once, so if the data is at all large this exits too early and fails.
How on earth should I go about reading from the standard input when there is no way of determining the size of the data?
There always has to be a way to determinate the size. Otherwise, the program would require infinite memory, and thus impossible to run on a physical computer.
Think about it this way: even in the case of a never-ending stream of data, there must be some chunks or points where you have to process it. For instance, a live-streamed video has to decode a portion of it (e.g. a frame). Or a video game which processes messages one by one, even if the game has undetermined length.
This holds true regardless of the type of I/O you decide to use (blocking/non-blocking, synchronous/asynchronous...). For instance, if you want to use typical blocking synchronous I/O, what you have to do is process the data in a loop: each iteration, you read as much data as is available, and process as much as you can. Whatever you can not process (because you have not received enough yet), you keep for the next iteration. Then, the rest of the loop is the rest of the logic of the program.
In the end, regardless of what you do, you (or someone else, e.g. a library, the operating system, the hardware buffers...) have to buffer incoming data until it can be processed.
Basically, you have two choices -- synchronous or asynchronous -- and both have their advantages and disadvantages.
For synchronous, you need either delimeters or a length field embedded in the record (or fixed length records, but that is pretty inflexible). This works best for synchronous protocols like synchronous rpc or simplex client-server interactions where only one side talks at a time while the other side waits. For ASCII/text based protocols, it is common to use a control-character delimiter like NL/EOL or NUL or CTX to mark the end of messages. Binary protocols more commonly use an embedded length field -- the receiver first reads the length and then reads the full amount of (expected) data.
For asynchronous, you use non-blocking mode. It IS possible to use non-blocking mode with stdio streams, it just requires some care. out-of-data conditions show up to stdio like error conditions, so you need to use ferror and clearerr on the FILE * as appropriate.
It's possible for both to be used -- for example in client-server interactions, the clients may use synchronous (they send a request and wait for a reply) while the server uses asynchronous (to be be robust in the presence of misbehaving clients).
The read api on Linux or the ReadFile Api on windows will immediately return and not wait for the specified number of bytes to fill the buffer (when reading a pipe or socket). Read then reurns the number of bytes read.
This means, when reading from a pipe, you set a buffersize, read as much as returned and the process it. You then read the next bit. The only time you are blocked is if there is no data available at all.
This differs from fread which only returns once the desired number of bytes are returned or the stream determines doing so is impossible (like eof).

rtl_sdr: reliably detect frequency changes, discard samples obtained prior

I am writing a seek routine for analog FM radio using rtl_sdr with a generic DVB-T stick (tuner is a FC0013). Code is mostly taken from rtl_power.c and rtl_fm.c.
My approach is:
Tune to the new frequency
Gather a few samples
Measure RSSI and store it
Do the same for the next frequency
Upon detecting a local peak which is above a certain threshold, tune to the frequency at which it was detected.
The issue is that I can’t reliably map samples to the frequency at which they were gathered. Here’s the relevant (pseudo) code snippet:
/* freq is the new target frequency */
rtlsdr_cancel_async(dongle.dev);
optimal_settings(freq, demod.rate_in);
fprintf(stderr, "\nSeek: currently at %d Hz (optimized to %d).\n", freq, dongle.freq);
rtlsdr_set_center_freq(dongle.dev, dongle.freq);
/* get two bursts of samples to measure RSSI */
if (rtlsdr_read_sync(dongle.dev, samples, samplesSize, &samplesRead) < 0)
fprintf(stderr, "\nSeek: rtlsdr_read_sync failed\n");
/* rssi = getRssiFromSamples(samples, samplesRead) */
fprintf(stderr, "\nSeek: rssi=%.2f", rssi);
if (rtlsdr_read_sync(dongle.dev, samples, samplesSize, &samplesRead) < 0)
fprintf(stderr, "\nSeek: rtlsdr_read_sync failed\n");
/* rssi = getRssiFromSamples(samples, samplesRead) */
fprintf(stderr, "\nSeek: rssi=%.2f\n", rssi);
When I scan the FM band with that snippet of code, I see that the two RSSI measurements typically differ significantly. In particular, the first measurement is usually in the neighborhood of the second measurement taken from the previous frequency, indicating that some of the samples were taken while still tuned into the old frequency.
I’ve also tried inserting a call to rtlsdr_reset_buffer() before gathering the samples, in an effort to flush any samples still stuck in the pipe, with no noticeable effect. Even a combination of
usleep(500000);
rtlsdr_cancel_async(dongle.dev);
rtlsdr_reset_buffer(dongle.dev)
does not change the picture, other than the usleep() slowing down the seek operation considerably. (Buffer size is 16384 samples, at a sample rate of 2 million, thus the usleep() delay is well above the time it takes to get one burst of samples.)
How can I ensure the samples I take were obtained after tuning into the new frequency?
Are there any buffers for samples which I would need to flush after tuning into a different frequency?
Can I rely on tuning being completed by the time rtlsdr_set_center_freq() returns, or does the tuner need some time to stabilize after that? In the latter case, how can I reliably tell when the frequency change is complete?
Anything else I might have missed?
Going through the code of rtl_power.c again, I found this function:
void retune(rtlsdr_dev_t *d, int freq)
{
uint8_t dump[BUFFER_DUMP];
int n_read;
rtlsdr_set_center_freq(d, (uint32_t)freq);
/* wait for settling and flush buffer */
usleep(5000);
rtlsdr_read_sync(d, &dump, BUFFER_DUMP, &n_read);
if (n_read != BUFFER_DUMP) {
fprintf(stderr, "Error: bad retune.\n");}
}
Essentially, the tuner needs to settle, with no apparent indicator of when this process is complete.
rtl_power.c solves this by waiting for 5 milliseconds, then discarding a few samples (BUFFER_DUMP is defined as 4096, at sample rates between 1–2.8M).
I found 4096 samples to be insufficient, so I went for the maximum of 16384. Results look a lot more stable this way, though even this does not always seem sufficient for the tuner to stabilize.
For a band scan, an alternative approach would be to have a loop acquiring samples and determining their RSSI until RSSI values begin to stabilize, i.e. changes are no longer monotonic or below a certain threshold.

What is the most reliable way to measure the number of cycles of my program in C?

I am familiar with two approaches, but both of them have their limitations.
The first one is to use the instruction RDTSC. However, the problem is that it doesn't count the number of cycles of my program in isolation and is therefore sensitive to noise due to concurrent processes.
The second option is to use the clock library function. I thought that this approach is reliable, since I expected it to count the number of cycles for my program only (what I intend to achieve). However, it turns out that in my case it measures the elapsed time and then multiplies it by CLOCKS_PER_SEC. This is not only unreliable, but also wrong, since CLOCKS_PER_SEC is set to 1,000,000 which does not correspond to the actual frequency of my processor.
Given the limitation of the proposed approaches, is there a better and more reliable alternative to produce consistent results?
A lot here depends on how large an amount of time you're trying to measure.
RDTSC can be (almost) 100% reliable when used correctly. It is, however, of use primarily for measuring truly microscopic pieces of code. If you want to measure two sequences of, say, a few dozen or so instructions apiece, there's probably nothing else that can do the job nearly as well.
Using it correctly is somewhat challenging though. Generally speaking, to get good measurements you want to do at least the following:
Set the code to only run on one specific core.
Set the code to execute at maximum priority so nothing preempts it.
Use CPUID liberally to ensure serialization where needed.
If, on the other hand, you're trying to measure something that takes anywhere from, say, 100 ms on up, RDTSC is pointless. It's like trying to measure the distance between cities with a micrometer. For this, it's generally best to assure that the code in question takes (at least) the better part of a second or so. clock isn't particularly precise, but for a length of time on this general order, the fact that it might only be accurate to, say, 10 ms or so, is more or less irrelevant.
Linux perf_event_open system call with config = PERF_COUNT_HW_CPU_CYCLES
This system call has explicit controls for:
process PID selection
whether to consider kernel/hypervisor instructions or not
and it will therefore count the cycles properly even when multiple processes are running concurrently.
See this answer for more details: How to get the CPU cycle count in x86_64 from C++?
perf_event_open.c
#include <asm/unistd.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <unistd.h>
#include <inttypes.h>
static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
int cpu, int group_fd, unsigned long flags)
{
int ret;
ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
group_fd, flags);
return ret;
}
int
main(int argc, char **argv)
{
struct perf_event_attr pe;
long long count;
int fd;
uint64_t n;
if (argc > 1) {
n = strtoll(argv[1], NULL, 0);
} else {
n = 10000;
}
memset(&pe, 0, sizeof(struct perf_event_attr));
pe.type = PERF_TYPE_HARDWARE;
pe.size = sizeof(struct perf_event_attr);
pe.config = PERF_COUNT_HW_CPU_CYCLES;
pe.disabled = 1;
pe.exclude_kernel = 1;
// Don't count hypervisor events.
pe.exclude_hv = 1;
fd = perf_event_open(&pe, 0, -1, -1, 0);
if (fd == -1) {
fprintf(stderr, "Error opening leader %llx\n", pe.config);
exit(EXIT_FAILURE);
}
ioctl(fd, PERF_EVENT_IOC_RESET, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
/* Loop n times, should be good enough for -O0. */
__asm__ (
"1:;\n"
"sub $1, %[n];\n"
"jne 1b;\n"
: [n] "+r" (n)
:
:
);
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
read(fd, &count, sizeof(long long));
printf("%lld\n", count);
close(fd);
}
RDTSC is the most accurate way of counting program execution cycles. If you are looking to measure execution performance over time scales where it matters if your thread has been preempted, then you would probably be better served with a profiler (VTune, for instance).
CLOCKS_PER_SECOND/clock() is pretty much a very bad (low performance) way of getting time as compared to RDTSC which has almost no overhead.
If you have a specific issue with RDTSC, I may be able to assist.
re: Comments
Intel Performance Counter Monitor: This is mainly for measuring metrics outside of the processor, such as Memory bandwidth, power usage, PCIe utilization. It does also happen to measure CPU frequency, but it typically is not useful for processor bound application performance.
RDTSC portability: RDTSC is an intel CPU instruction supported by all modern Intel CPU's. On modern CPU's it is based on the uncore frequency of your CPU and somewhat similar across CPU cores, although it is not appropriate if your application is frequently being preempted to different cores (and especially to different sockets). If that is the case you really want to look at a profiler.
Out of order Execution: Yes, things get executed out of order, so this can affect performance slightly, but it still takes time to execute instructions and RDTSC is the best way of measuring that time. It excels in the normal use case of executing Non-IO bound instructions on the same core, and this is really how it is meant to be used. If you have a more complicated use case you really should be using a different tool, but that doesn't negate that rdtsc() can be very useful in analyzing program execution.

precise timing in C

I have a little code below. I use this code to output some 1s and 0s (unsigned output[38]) from a GPIO of an embedded board.
My Question: the time between two output values (1, 0 or 0, 1) should be 416 microseconds as I define on clock_nanosleep below code, I also used sched_priority() for a better time resolution. However, an oscilloscope (pic below) measurement shows that the time between the two output values are 770 usec . I wonder why do I have that much inaccuracy between the signals?
PS. the board(beagleboard) has Linux 3.2.0-23-omap #36-Ubuntu Tue Apr 10 20:24:21 UTC 2012 armv7l armv7l armv7l GNU/Linux kernel, and it has 750 MHz CPU, top shows almost no CPU(~1%) and memory(~0.5%) is consumed before I run my code. I use an electronic oscilloscope which has no calibration problem.
#include <stdio.h>
#include <stdlib.h> //exit();
#include <sched.h>
#include <time.h>
void msg_send();
struct sched_param sp;
int main(void){
sp.sched_priority = sched_get_priority_max(SCHED_FIFO);
sched_setscheduler(0, SCHED_FIFO, &sp);
msg_send();
return 0;
}
void msg_send(){
unsigned output[38] = {0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,1};
FILE *fp8;
if ((fp8 = fopen("/sys/class/gpio/export", "w")) == NULL){ //echo 139 > export
fprintf(stderr,"Cannot open export file: line844.\n"); fclose(fp8);exit(1);
}
fprintf(fp8, "%d", 139); //pin 3
fclose(fp8);
if ((fp8 = fopen("/sys/class/gpio/gpio139/direction", "rb+")) == NULL){
fprintf(stderr,"Cannot open direction file - GPIO139 - line851.\n");fclose(fp8); exit(1);
}
fprintf(fp8, "out");
fclose(fp8);
if((fp8 = fopen("/sys/class/gpio/gpio139/value", "w")) == NULL) {
fprintf(stderr,"error in openning value\n"); fclose(fp8); exit(1);
}
struct timespec req = { .tv_sec=0, .tv_nsec = 416000 }; //416 usec
/* here is the part that my question focus*/
while(1){
for(i=0;i<38;i++){
rewind(fp8);
fprintf(fp8, "%d", output[i]);
clock_nanosleep(CLOCK_MONOTONIC ,0, &req, NULL);
}
}
}
EDIT: I have been reading for days that clock_nanosleep() or other nanosleep, usleep etc. does not guarantee the waking up on time. they usually provide to sleep the code for the defined time, but waking up the process depends on the CPU. what I found is that absolute time provides a better resolution (TIMER_ABSTIME flag). I found the same solution as Maxime suggests. however, I have a glitch on my signal when for loop is finalized. In my opinion, it is not good to any sleep functions to create a PWM or data output on an embedded platform. It is good to spend some time to learn CPU timers that platforms provide to generate the PWM or data out that has good accuracy.
I can't figure out how a call to clock_getres() can solve your problem. In the man page, it's said that only read the resolution of the clock.
As Geoff said, using absolute sleeping clock should be a better solution. This can avoid the unespected timing delay from other code.
struct timespec Time;
clock_gettime(CLOCK_REALTIME, &(Time));
while(1){
Time.tv_nsec += 416000;
if(Time.tv_nsec > 999999999){
(Time.tv_sec)++;
Time.tv_nsec -= 1000000000;
}
clock_nanosleep(CLOCK_REALTIME, TIMER_ABSTIME, &(Time), NULL);
//Do something
}
I am using this on fews programs I have for generating some regular message on ethernet network. And it's working fine.
If you are doing time sensitive I/O, you probably shouldn't use the stuff in stdio.h but instead the I/O system calls because of the buffering done by stdio. It looks like you might be getting the worst effect of the buffering too because your program does these steps:
fill the buffer
sleep
rewind, which I believe will flush the buffer
What you want is for the kernel to service the write while you are sleeping, instead the buffer is flushed after you sleep and you have to wait for the kernel to process it.
I think your best bet is to use open("/sys/class/gpio/gpio139/value", O_WRONLY|O_DIRECT) to minimize delays due to caching.
if you still need to flush buffers to force the write through you probably want to use clock_gettime to compute the time spent flushing the data and subtract that from the sleep time. Alternatively add the desired interval to the result of clock_gettime and pass that to clock_nanosleep and use the TIMER_ABSTIME flag to wait for that absolute time to occur.
I would guess that the problem is that the clock_nanosleep is sleeping for 416 microsec
and that the other commands in the loop as well as the loop and clock_nanosleep architecture itself are taking 354 microsec. The OS may also be making demands.
What interval do you get if you set the sleep = 0?
Are you running this on a computer or a PLC?
Response to Comment
Seems like you have something somewher in the hardware/software that is doing something unexpected - it could be a bugger to find.
I have 2 suggestions depending on how critical the period is:
Low criticality - put a figure in your program that causes the loop to take the time you want. However, if this is a transient or time/temperature dependant effect you will need to check for drift periodically.
High criticality - Build a temperature stable oscilator in hardware. These can be bought off the shelf.

Resources