precise timing in C - c

I have a little code below. I use this code to output some 1s and 0s (unsigned output[38]) from a GPIO of an embedded board.
My Question: the time between two output values (1, 0 or 0, 1) should be 416 microseconds as I define on clock_nanosleep below code, I also used sched_priority() for a better time resolution. However, an oscilloscope (pic below) measurement shows that the time between the two output values are 770 usec . I wonder why do I have that much inaccuracy between the signals?
PS. the board(beagleboard) has Linux 3.2.0-23-omap #36-Ubuntu Tue Apr 10 20:24:21 UTC 2012 armv7l armv7l armv7l GNU/Linux kernel, and it has 750 MHz CPU, top shows almost no CPU(~1%) and memory(~0.5%) is consumed before I run my code. I use an electronic oscilloscope which has no calibration problem.
#include <stdio.h>
#include <stdlib.h> //exit();
#include <sched.h>
#include <time.h>
void msg_send();
struct sched_param sp;
int main(void){
sp.sched_priority = sched_get_priority_max(SCHED_FIFO);
sched_setscheduler(0, SCHED_FIFO, &sp);
msg_send();
return 0;
}
void msg_send(){
unsigned output[38] = {0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,1};
FILE *fp8;
if ((fp8 = fopen("/sys/class/gpio/export", "w")) == NULL){ //echo 139 > export
fprintf(stderr,"Cannot open export file: line844.\n"); fclose(fp8);exit(1);
}
fprintf(fp8, "%d", 139); //pin 3
fclose(fp8);
if ((fp8 = fopen("/sys/class/gpio/gpio139/direction", "rb+")) == NULL){
fprintf(stderr,"Cannot open direction file - GPIO139 - line851.\n");fclose(fp8); exit(1);
}
fprintf(fp8, "out");
fclose(fp8);
if((fp8 = fopen("/sys/class/gpio/gpio139/value", "w")) == NULL) {
fprintf(stderr,"error in openning value\n"); fclose(fp8); exit(1);
}
struct timespec req = { .tv_sec=0, .tv_nsec = 416000 }; //416 usec
/* here is the part that my question focus*/
while(1){
for(i=0;i<38;i++){
rewind(fp8);
fprintf(fp8, "%d", output[i]);
clock_nanosleep(CLOCK_MONOTONIC ,0, &req, NULL);
}
}
}
EDIT: I have been reading for days that clock_nanosleep() or other nanosleep, usleep etc. does not guarantee the waking up on time. they usually provide to sleep the code for the defined time, but waking up the process depends on the CPU. what I found is that absolute time provides a better resolution (TIMER_ABSTIME flag). I found the same solution as Maxime suggests. however, I have a glitch on my signal when for loop is finalized. In my opinion, it is not good to any sleep functions to create a PWM or data output on an embedded platform. It is good to spend some time to learn CPU timers that platforms provide to generate the PWM or data out that has good accuracy.

I can't figure out how a call to clock_getres() can solve your problem. In the man page, it's said that only read the resolution of the clock.
As Geoff said, using absolute sleeping clock should be a better solution. This can avoid the unespected timing delay from other code.
struct timespec Time;
clock_gettime(CLOCK_REALTIME, &(Time));
while(1){
Time.tv_nsec += 416000;
if(Time.tv_nsec > 999999999){
(Time.tv_sec)++;
Time.tv_nsec -= 1000000000;
}
clock_nanosleep(CLOCK_REALTIME, TIMER_ABSTIME, &(Time), NULL);
//Do something
}
I am using this on fews programs I have for generating some regular message on ethernet network. And it's working fine.

If you are doing time sensitive I/O, you probably shouldn't use the stuff in stdio.h but instead the I/O system calls because of the buffering done by stdio. It looks like you might be getting the worst effect of the buffering too because your program does these steps:
fill the buffer
sleep
rewind, which I believe will flush the buffer
What you want is for the kernel to service the write while you are sleeping, instead the buffer is flushed after you sleep and you have to wait for the kernel to process it.
I think your best bet is to use open("/sys/class/gpio/gpio139/value", O_WRONLY|O_DIRECT) to minimize delays due to caching.
if you still need to flush buffers to force the write through you probably want to use clock_gettime to compute the time spent flushing the data and subtract that from the sleep time. Alternatively add the desired interval to the result of clock_gettime and pass that to clock_nanosleep and use the TIMER_ABSTIME flag to wait for that absolute time to occur.

I would guess that the problem is that the clock_nanosleep is sleeping for 416 microsec
and that the other commands in the loop as well as the loop and clock_nanosleep architecture itself are taking 354 microsec. The OS may also be making demands.
What interval do you get if you set the sleep = 0?
Are you running this on a computer or a PLC?
Response to Comment
Seems like you have something somewher in the hardware/software that is doing something unexpected - it could be a bugger to find.
I have 2 suggestions depending on how critical the period is:
Low criticality - put a figure in your program that causes the loop to take the time you want. However, if this is a transient or time/temperature dependant effect you will need to check for drift periodically.
High criticality - Build a temperature stable oscilator in hardware. These can be bought off the shelf.

Related

ALSA - Non blocking (interleaved) read

I inherited some ALSA code that runs on a Linux embedded platform.
The existing implementation does blocking reads and writes using snd_pcm_readi() and snd_pcm_writei().
I am tasked to make this run on an ARM processor, but I find that the blocked interleaved reads push the CPU to 99%, so I am exploring non-blocking reads and writes.
I open the device as can be expected:
snd_pcm_handle *handle;
const char* hwname = "plughw:0"; // example name
snd_pcm_open(&handle, hwname, SND_PCM_STREAM_CAPTURE, SND_PCM_NONBLOCK);
Other ALSA stuff then happens which I can supply on request.
Noteworthy to mention at this point that:
we set a sampling rate of 48,000 [Hz]
the sample type is signed 32 bit integer
the device always overrides our requested period size to 1024 frames
Reading the stream like so:
int32* buffer; // buffer set up to hold #period_size samples
int actual = snd_pcm_readi(handle, buffer, period_size);
This call takes approx 15 [ms] to complete in blocking mode. Obviously, variable actual will read 1024 on return.
The problem is; in non-blocking mode, this function also takes 15 msec to complete and actual also always reads 1024 on return.
I would expect that the function would return immediately, with actual being <=1024 and quite possibly reading "EAGAIN" (-11).
In between read attempts I plan to put the thread to sleep for a specific amount of time, yielding CPU time to other processes.
Am I misunderstanding the ALSA API? Or could it be that my code is missing a vital step?
If the function returns a value of 1024, then at least 1024 frames were available at the time of the call.
(It's possible that the 15 ms is time needed by the driver to actually start the device.)
Anyway, blocking or non-blocking mode does not make any difference regarding CPU usage. To reduce CPU usage, replace the default device with plughw or hw, but then you lose features like device sharing or sample rate/format conversion.
I solved my problem by wrapping snd_pcm_readi() as follows:
/*
** Read interleaved stream in non-blocking mode
*/
template <typename SampleType>
snd_pcm_sframes_t snd_pcm_readi_nb(snd_pcm_t* pcm, SampleType* buffer, snd_pcm_uframes_t size, unsigned samplerate)
{
const snd_pcm_sframes_t avail = ::snd_pcm_avail(pcm);
if (avail < 0) {
return avail;
}
if (avail < size) {
snd_pcm_uframes_t remain = size - avail;
unsigned long msec = (remain * 1000) / samplerate;
static const unsigned long SLEEP_THRESHOLD_MS = 1;
if (msec > SLEEP_THRESHOLD_MS) {
msec -= SLEEP_THRESHOLD_MS;
// exercise for the reader: sleep for msec
}
}
return ::snd_pcm_readi(pcm, buffer, size);
}
This works quite well for me. My audio process now 'only' takes 19% CPU time.
And it matters not if the PCM interface was opened using SND_PCM_NONBLOCK or 0.
Going to perform callgrind analysis to see if more CPU cycles can be saved elsewhere in the code.

rtl_sdr: reliably detect frequency changes, discard samples obtained prior

I am writing a seek routine for analog FM radio using rtl_sdr with a generic DVB-T stick (tuner is a FC0013). Code is mostly taken from rtl_power.c and rtl_fm.c.
My approach is:
Tune to the new frequency
Gather a few samples
Measure RSSI and store it
Do the same for the next frequency
Upon detecting a local peak which is above a certain threshold, tune to the frequency at which it was detected.
The issue is that I can’t reliably map samples to the frequency at which they were gathered. Here’s the relevant (pseudo) code snippet:
/* freq is the new target frequency */
rtlsdr_cancel_async(dongle.dev);
optimal_settings(freq, demod.rate_in);
fprintf(stderr, "\nSeek: currently at %d Hz (optimized to %d).\n", freq, dongle.freq);
rtlsdr_set_center_freq(dongle.dev, dongle.freq);
/* get two bursts of samples to measure RSSI */
if (rtlsdr_read_sync(dongle.dev, samples, samplesSize, &samplesRead) < 0)
fprintf(stderr, "\nSeek: rtlsdr_read_sync failed\n");
/* rssi = getRssiFromSamples(samples, samplesRead) */
fprintf(stderr, "\nSeek: rssi=%.2f", rssi);
if (rtlsdr_read_sync(dongle.dev, samples, samplesSize, &samplesRead) < 0)
fprintf(stderr, "\nSeek: rtlsdr_read_sync failed\n");
/* rssi = getRssiFromSamples(samples, samplesRead) */
fprintf(stderr, "\nSeek: rssi=%.2f\n", rssi);
When I scan the FM band with that snippet of code, I see that the two RSSI measurements typically differ significantly. In particular, the first measurement is usually in the neighborhood of the second measurement taken from the previous frequency, indicating that some of the samples were taken while still tuned into the old frequency.
I’ve also tried inserting a call to rtlsdr_reset_buffer() before gathering the samples, in an effort to flush any samples still stuck in the pipe, with no noticeable effect. Even a combination of
usleep(500000);
rtlsdr_cancel_async(dongle.dev);
rtlsdr_reset_buffer(dongle.dev)
does not change the picture, other than the usleep() slowing down the seek operation considerably. (Buffer size is 16384 samples, at a sample rate of 2 million, thus the usleep() delay is well above the time it takes to get one burst of samples.)
How can I ensure the samples I take were obtained after tuning into the new frequency?
Are there any buffers for samples which I would need to flush after tuning into a different frequency?
Can I rely on tuning being completed by the time rtlsdr_set_center_freq() returns, or does the tuner need some time to stabilize after that? In the latter case, how can I reliably tell when the frequency change is complete?
Anything else I might have missed?
Going through the code of rtl_power.c again, I found this function:
void retune(rtlsdr_dev_t *d, int freq)
{
uint8_t dump[BUFFER_DUMP];
int n_read;
rtlsdr_set_center_freq(d, (uint32_t)freq);
/* wait for settling and flush buffer */
usleep(5000);
rtlsdr_read_sync(d, &dump, BUFFER_DUMP, &n_read);
if (n_read != BUFFER_DUMP) {
fprintf(stderr, "Error: bad retune.\n");}
}
Essentially, the tuner needs to settle, with no apparent indicator of when this process is complete.
rtl_power.c solves this by waiting for 5 milliseconds, then discarding a few samples (BUFFER_DUMP is defined as 4096, at sample rates between 1–2.8M).
I found 4096 samples to be insufficient, so I went for the maximum of 16384. Results look a lot more stable this way, though even this does not always seem sufficient for the tuner to stabilize.
For a band scan, an alternative approach would be to have a loop acquiring samples and determining their RSSI until RSSI values begin to stabilize, i.e. changes are no longer monotonic or below a certain threshold.

What is the most reliable way to measure the number of cycles of my program in C?

I am familiar with two approaches, but both of them have their limitations.
The first one is to use the instruction RDTSC. However, the problem is that it doesn't count the number of cycles of my program in isolation and is therefore sensitive to noise due to concurrent processes.
The second option is to use the clock library function. I thought that this approach is reliable, since I expected it to count the number of cycles for my program only (what I intend to achieve). However, it turns out that in my case it measures the elapsed time and then multiplies it by CLOCKS_PER_SEC. This is not only unreliable, but also wrong, since CLOCKS_PER_SEC is set to 1,000,000 which does not correspond to the actual frequency of my processor.
Given the limitation of the proposed approaches, is there a better and more reliable alternative to produce consistent results?
A lot here depends on how large an amount of time you're trying to measure.
RDTSC can be (almost) 100% reliable when used correctly. It is, however, of use primarily for measuring truly microscopic pieces of code. If you want to measure two sequences of, say, a few dozen or so instructions apiece, there's probably nothing else that can do the job nearly as well.
Using it correctly is somewhat challenging though. Generally speaking, to get good measurements you want to do at least the following:
Set the code to only run on one specific core.
Set the code to execute at maximum priority so nothing preempts it.
Use CPUID liberally to ensure serialization where needed.
If, on the other hand, you're trying to measure something that takes anywhere from, say, 100 ms on up, RDTSC is pointless. It's like trying to measure the distance between cities with a micrometer. For this, it's generally best to assure that the code in question takes (at least) the better part of a second or so. clock isn't particularly precise, but for a length of time on this general order, the fact that it might only be accurate to, say, 10 ms or so, is more or less irrelevant.
Linux perf_event_open system call with config = PERF_COUNT_HW_CPU_CYCLES
This system call has explicit controls for:
process PID selection
whether to consider kernel/hypervisor instructions or not
and it will therefore count the cycles properly even when multiple processes are running concurrently.
See this answer for more details: How to get the CPU cycle count in x86_64 from C++?
perf_event_open.c
#include <asm/unistd.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <unistd.h>
#include <inttypes.h>
static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
int cpu, int group_fd, unsigned long flags)
{
int ret;
ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
group_fd, flags);
return ret;
}
int
main(int argc, char **argv)
{
struct perf_event_attr pe;
long long count;
int fd;
uint64_t n;
if (argc > 1) {
n = strtoll(argv[1], NULL, 0);
} else {
n = 10000;
}
memset(&pe, 0, sizeof(struct perf_event_attr));
pe.type = PERF_TYPE_HARDWARE;
pe.size = sizeof(struct perf_event_attr);
pe.config = PERF_COUNT_HW_CPU_CYCLES;
pe.disabled = 1;
pe.exclude_kernel = 1;
// Don't count hypervisor events.
pe.exclude_hv = 1;
fd = perf_event_open(&pe, 0, -1, -1, 0);
if (fd == -1) {
fprintf(stderr, "Error opening leader %llx\n", pe.config);
exit(EXIT_FAILURE);
}
ioctl(fd, PERF_EVENT_IOC_RESET, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
/* Loop n times, should be good enough for -O0. */
__asm__ (
"1:;\n"
"sub $1, %[n];\n"
"jne 1b;\n"
: [n] "+r" (n)
:
:
);
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
read(fd, &count, sizeof(long long));
printf("%lld\n", count);
close(fd);
}
RDTSC is the most accurate way of counting program execution cycles. If you are looking to measure execution performance over time scales where it matters if your thread has been preempted, then you would probably be better served with a profiler (VTune, for instance).
CLOCKS_PER_SECOND/clock() is pretty much a very bad (low performance) way of getting time as compared to RDTSC which has almost no overhead.
If you have a specific issue with RDTSC, I may be able to assist.
re: Comments
Intel Performance Counter Monitor: This is mainly for measuring metrics outside of the processor, such as Memory bandwidth, power usage, PCIe utilization. It does also happen to measure CPU frequency, but it typically is not useful for processor bound application performance.
RDTSC portability: RDTSC is an intel CPU instruction supported by all modern Intel CPU's. On modern CPU's it is based on the uncore frequency of your CPU and somewhat similar across CPU cores, although it is not appropriate if your application is frequently being preempted to different cores (and especially to different sockets). If that is the case you really want to look at a profiler.
Out of order Execution: Yes, things get executed out of order, so this can affect performance slightly, but it still takes time to execute instructions and RDTSC is the best way of measuring that time. It excels in the normal use case of executing Non-IO bound instructions on the same core, and this is really how it is meant to be used. If you have a more complicated use case you really should be using a different tool, but that doesn't negate that rdtsc() can be very useful in analyzing program execution.

How to avoid costing too long for writing data to disk

static const int MAX_BUFFER_LEN = 1024*12; //in byets
char *bff = new char[MAX_BUFFER_LEN];
int fileflag = O_CREAT | O_WRONLY | O_NONBLOCK;
fl = open(filename, fileflag, 0666);
if(fl < 0)
{
printf("can not open file! \n");
return -1;
}
do
{
///begin one loop
struct timeval bef;
struct timeval aft;
gettimeofday(&bef, NULL);
write(fl, bff, MAX_BUFFER_LEN);
gettimeofday(&aft, NULL);
if(aft.tv_usec - bef.tv_usec > 20000) //ignore second condition
{
printf(" cost too long:%d \n", aft.tv_usec - bef.tv_usec);
}
//end one loop
//sleep
usleep(30*1000); //sleep 30ms
}while(1);
When I run the program on Linux ubuntu 2.6.32-24-generic, I find that the COST TOO LONG printing shows 1~2 times in a minutes. I tried both to USB disk and hard disk.I also tried this program in arm platform .This condition also happened. I think that 3.2Mbps is too high for low speed IO device. So I reduce to 0.4Mbps.It significantly reduce the printing frequency. Is any solution to control the time cost ?
Is write() just copying the data to kenal buffer and returning immediately or waiting fo disk IO complete? Is it possible that kenal IO buffer is full and must be waiting for flush but why only several times cost so long?
You can't accelerate the disk, but you can do other stuff while the disk is working. You needn't wait for it to be done.
This is, however, highly non-trivial to do in C. You would need nonblocking I/O, multithreading or multiprocessing. Try googling up these keywords and how to use the different techniques (you are already using a nonblocking fd up there).
Your disk I/O performance is being negatively impacted by the code around each write to measure the time (and measuring time at this granularity is going to have occasional spikes as the computer does other things).
Instead, measure the performance of the code to write the entire data - start/end times outside the loop (with the loop properly bounded, of course).
If you are calling a file write which you think is going to take a lot of time, then make your process to run two threads, while one is doing the main task let the other write to disk.

Network receipt timer to ms resolution

My scenario, I'm collecting network packets and if packets match a network filter I want to record the time difference between consecutive packets, this last part is the part that doesn't work. My problem is that I cant get accurate sub-second measurements no matter what C timer function I use. I've tried: gettimeofday(), clock_gettime(), and clock().
I'm looking for assistance to figure out why my timing code isn't working properly.
I'm running on a cygwin environment.
Compile Options: gcc -Wall capture.c -o capture -lwpcap -lrt
Code snippet :
/*globals*/
int first_time = 0;
struct timespec start, end;
double sec_diff = 0;
main() {
pcap_t *adhandle;
const struct pcap_pkthdr header;
const u_char *packet;
int sockfd = socket(PF_INET, SOCK_STREAM, 0);
.... (previous I create socket/connect - works fine)
save_attr = tty_set_raw();
while (1) {
packet = pcap_next(adhandle, &header); // Receive a packet? Process it
if (packet != NULL) {
got_packet(&header, packet, adhandle);
}
if (linux_kbhit()) { // User types message to channel
kb_char = linux_getch(); // Get user-supplied character
if (kb_char == 0x03) // Stop loop (exit channel) if user hits Ctrl+C
break;
}
}
tty_restore(save_attr);
close(sockfd);
pcap_close(adhandle);
printf("\nCapture complete.\n");
}
In got_packet:
got_packet(const struct pcap_pkthdr *header, const u_char *packet, pcap_t * p){ ... {
....do some packet filtering to only handle my packets, set match = 1
if (match == 1) {
if (first_time == 0) {
clock_gettime( CLOCK_MONOTONIC, &start );
first_time++;
}
else {
clock_gettime( CLOCK_MONOTONIC, &end );
sec_diff = (end.tv_sec - start.tv_sec) + ((end.tv_nsec - start.tv_nsec)/1000000000.0); // Packet difference in seconds
printf("sec_diff: %ld,\tstart_nsec: %ld,\tend_nsec: %ld\n", (end.tv_sec - start.tv_sec), start.tv_nsec, end.tv_nsec);
printf("sec_diffcalc: %ld,\tstart_sec: %ld,\tend_sec: %ld\n", sec_diff, start.tv_sec, end.tv_sec);
start = end; // Set the current to the start for next match
}
}
}
I record all packets with Wireshark to compare, so I expect the difference in my timer to be the same as Wireshark's, however that is never the case. My output for tv_sec will be correct, however tv_nsec is not even close. Say there is a 0.5 second difference in wireshark, my timer will say there is a 1.999989728 second difference.
Basically, you will want to use a timer with a higher resolution
Also, I did not check in libpcap, but I am pretty sure that libpcap can give you the time at which each packet was received. In which case, it will be closest that you can get to what Wireshark displays.
I don't think that it is the clocks that are your problem, but the way that you are waiting on new data. You should use a polling function to see when you have new data from either the socket or from the keyboard. This will allow your program to sleep when there is no new data for it to process. This is likely to make the operating system be nicer to your program when it does have data to process and schedule it quicker. This also allows you to quit the program without having to wait for the next packet to come in. Alternately you could attempt to run your program at really high or real time priority.
You should consider getting the current time at the first instance after you get a packet if the filtering can take very long. You may also want to consider multiple threads for this program if you are trying to capture data on a fast and busy network. Especially if you have more than one processor, but since you are doing some pritnfs which may block. I noticed you had a function to set a tty to raw mode, which I assume is the standard output tty. If you are actually using a serial terminal that could slow things down a lot, but standard out to a xterm can also be slow. You may want to consider setting stdout to fully buffered rather than line buffered. This should speed up the output. (man setvbuf)

Resources