Why is this C program reporting more throughput than nload? - c

I run the following C program between two machines with 10GibE; the program reports 12Gib/s whereas nload reports a (more believable) 9.2Gib/s. Can anyone tell me what I'm doing wrong in the program?
.
.
#define BUFFSZ (4*1024)
char buffer[BUFFSZ];
.
.
start = clock();
while (1) {
n = write(sockfd, buffer, BUFFSZ);
if (n < 0)
error("ERROR writing to socket");
if (++blocks % (1024*1024) == 0)
{
blocks = 0;
printf("32Gib at %6.2lf Gib/s\n", 32.0/(((double) (clock() - start)) / CLOCKS_PER_SEC));
start = clock();
}
}
This is CentOs 6.0 on Linux 2.6.32; nload 0.7.3, gcc 4.4.4.

Firstly, clock() returns an estimate of the CPU time used by the program, not the wall-clock time - so your calculation indicates that you are transferring 12GiB per second of CPU time used. Instead, use clock_gettime() with the clock ID CLOCK_MONOTONIC to measure wall-clock time.
Secondly, after write() returns the data hasn't necessarily been sent to the network yet - merely copied into the kernel buffers for sending. This will give you a higher reported transfer rate at the start of the connection.

Check the return value from read() n might be shorter than BUFFSZ.
EDIT: oops, that should have been write().

Related

ALSA - Non blocking (interleaved) read

I inherited some ALSA code that runs on a Linux embedded platform.
The existing implementation does blocking reads and writes using snd_pcm_readi() and snd_pcm_writei().
I am tasked to make this run on an ARM processor, but I find that the blocked interleaved reads push the CPU to 99%, so I am exploring non-blocking reads and writes.
I open the device as can be expected:
snd_pcm_handle *handle;
const char* hwname = "plughw:0"; // example name
snd_pcm_open(&handle, hwname, SND_PCM_STREAM_CAPTURE, SND_PCM_NONBLOCK);
Other ALSA stuff then happens which I can supply on request.
Noteworthy to mention at this point that:
we set a sampling rate of 48,000 [Hz]
the sample type is signed 32 bit integer
the device always overrides our requested period size to 1024 frames
Reading the stream like so:
int32* buffer; // buffer set up to hold #period_size samples
int actual = snd_pcm_readi(handle, buffer, period_size);
This call takes approx 15 [ms] to complete in blocking mode. Obviously, variable actual will read 1024 on return.
The problem is; in non-blocking mode, this function also takes 15 msec to complete and actual also always reads 1024 on return.
I would expect that the function would return immediately, with actual being <=1024 and quite possibly reading "EAGAIN" (-11).
In between read attempts I plan to put the thread to sleep for a specific amount of time, yielding CPU time to other processes.
Am I misunderstanding the ALSA API? Or could it be that my code is missing a vital step?
If the function returns a value of 1024, then at least 1024 frames were available at the time of the call.
(It's possible that the 15 ms is time needed by the driver to actually start the device.)
Anyway, blocking or non-blocking mode does not make any difference regarding CPU usage. To reduce CPU usage, replace the default device with plughw or hw, but then you lose features like device sharing or sample rate/format conversion.
I solved my problem by wrapping snd_pcm_readi() as follows:
/*
** Read interleaved stream in non-blocking mode
*/
template <typename SampleType>
snd_pcm_sframes_t snd_pcm_readi_nb(snd_pcm_t* pcm, SampleType* buffer, snd_pcm_uframes_t size, unsigned samplerate)
{
const snd_pcm_sframes_t avail = ::snd_pcm_avail(pcm);
if (avail < 0) {
return avail;
}
if (avail < size) {
snd_pcm_uframes_t remain = size - avail;
unsigned long msec = (remain * 1000) / samplerate;
static const unsigned long SLEEP_THRESHOLD_MS = 1;
if (msec > SLEEP_THRESHOLD_MS) {
msec -= SLEEP_THRESHOLD_MS;
// exercise for the reader: sleep for msec
}
}
return ::snd_pcm_readi(pcm, buffer, size);
}
This works quite well for me. My audio process now 'only' takes 19% CPU time.
And it matters not if the PCM interface was opened using SND_PCM_NONBLOCK or 0.
Going to perform callgrind analysis to see if more CPU cycles can be saved elsewhere in the code.

Multithreaded C Program on the Raspberry PI

Currently, I am having a hard time to discover what the problem with my multithreading C program on the RPi is. I have written an application relying on two pthreads, one of them reading data from a gps device and writing it to a text file and the second one is doing exactly the same but with a temperature sensor. On my laptop (Intel® Core™ i3-380M, 2.53GHz) I am have the program nicely working and writing to my files up to the frequencies at which both of the devices send information (10 Hz and 500 Hz respectively).
The real problem emerges when I compile and execute my C program to run on the RPi; The performance of my program running on RPi considerably decreases, having my GPS log file written with a frequency of 3 Hz and the temperature log file at a frequency of 17 Hz (17 measurements written per second)..
I do not really know why I am getting those performance problems with my code running on the PI. Is it because of the RPi has only a 700 MHz ARM Processor and it can not process such a Multithreaded application? Or is it because my two threads routines are perturbing the nice work normally carried out by the PI? Thanks a lot in advance Guys....!!!
Here my code. I am posting just one thread function because I tested the performance with just one thread and it is still writing at a very low frequency (~4 Hz). At first, the main function:
int main(int argc, char *argv[]) {
int s1_hand = 0;
pthread_t routines[2];
printf("Creating Thread -> Main Thread Busy!\n");
s1_hand = pthread_create(&(routines[1]), NULL, thread_2, (void *)&(routines[1]));
if (s1_hand != 0){
printf("Not possible to create threads:[%s]\n", strerror(s1_hand));
exit(EXIT_FAILURE);
}
pthread_join(routines[1], NULL);
void* result;
if ((pthread_join(routines[1], &result)) == -1) {
perror("Cannot join thread 2");
exit(EXIT_FAILURE);
}
pthread_exit(NULL);
return 0;
}
Now, thread number 2 function:
void *thread_2(void *parameters) {
printf("Thread 2 starting...\n");
int fd, chars, parsing, c_1, parse, p_parse = 1;
double array[3];
fd = open("dev/ttyUSB0", O_RDONLY | O_NOCTTY | O_SYNC);
if (fd < 0){
perror("Unable to open the fd!");
exit (EXIT_FAILURE);
}
FILE *stream_a, *stream_b;
stream_a = fdopen(fd, "r");
stream_b = fopen (FILE_I, "w+");
if (stream_a == NULL || stream_b == NULL){
perror("IMPOSSIBLE TO CREATE STREAMS");
exit(EXIT_FAILURE);
}
c_1 = fgetc(stream_a);
parse = findit(p_parse, c_1, array);
printf("First Parse Done -> (%i)\n", parse);
while ((chars = fgetc(stream_a)) != EOF){
parsing = findit(0, (uint8_t)chars, array);
if (parsing == 1){
printf("MESSAGE FOUND AND SAVED -> (%i)\n", parsing);
fprintf(stream_b,"%.6f %.3f %.3f %.3f\n", time_stamp(), array[0], array[1], array[2]);
}
}
fflush(stream_b);
fclose(stream_b);
fclose(stream_a);
close(fd);
pthread_exit(NULL);
return 0;
}
Note that on my thread 2 function I am using findit(), function which returns 0 or 1 in case of having found and parsed a message from the gps, writing the parsed info in my array (0 no found, 1 found and parsed). The function time_stamp() just call the clock_gettime(CLOCK_MONOTONIC, &time_stamp) function in order to have a time reference on each written event. Hope with this information you guys can help me. Thank you!
Obviously the processor is capable of running 20 things a second. I'd first check your filesystem performance.
Write a small program that simulates the writes just the way you're doing them and see what the performance is like.
Beyond that, I'd suggest it's the task swapping that's causing delays. Try without one of the threads. What type of performance do you get?
I'd guess it's the filesystem, though. Try buffering your writes into memory and do large (4k+) writes every few seconds and I bet that will make your system a lot happier.
Also, post your code. Otherwise all we can do is guess.

How to avoid costing too long for writing data to disk

static const int MAX_BUFFER_LEN = 1024*12; //in byets
char *bff = new char[MAX_BUFFER_LEN];
int fileflag = O_CREAT | O_WRONLY | O_NONBLOCK;
fl = open(filename, fileflag, 0666);
if(fl < 0)
{
printf("can not open file! \n");
return -1;
}
do
{
///begin one loop
struct timeval bef;
struct timeval aft;
gettimeofday(&bef, NULL);
write(fl, bff, MAX_BUFFER_LEN);
gettimeofday(&aft, NULL);
if(aft.tv_usec - bef.tv_usec > 20000) //ignore second condition
{
printf(" cost too long:%d \n", aft.tv_usec - bef.tv_usec);
}
//end one loop
//sleep
usleep(30*1000); //sleep 30ms
}while(1);
When I run the program on Linux ubuntu 2.6.32-24-generic, I find that the COST TOO LONG printing shows 1~2 times in a minutes. I tried both to USB disk and hard disk.I also tried this program in arm platform .This condition also happened. I think that 3.2Mbps is too high for low speed IO device. So I reduce to 0.4Mbps.It significantly reduce the printing frequency. Is any solution to control the time cost ?
Is write() just copying the data to kenal buffer and returning immediately or waiting fo disk IO complete? Is it possible that kenal IO buffer is full and must be waiting for flush but why only several times cost so long?
You can't accelerate the disk, but you can do other stuff while the disk is working. You needn't wait for it to be done.
This is, however, highly non-trivial to do in C. You would need nonblocking I/O, multithreading or multiprocessing. Try googling up these keywords and how to use the different techniques (you are already using a nonblocking fd up there).
Your disk I/O performance is being negatively impacted by the code around each write to measure the time (and measuring time at this granularity is going to have occasional spikes as the computer does other things).
Instead, measure the performance of the code to write the entire data - start/end times outside the loop (with the loop properly bounded, of course).
If you are calling a file write which you think is going to take a lot of time, then make your process to run two threads, while one is doing the main task let the other write to disk.

precise timing in C

I have a little code below. I use this code to output some 1s and 0s (unsigned output[38]) from a GPIO of an embedded board.
My Question: the time between two output values (1, 0 or 0, 1) should be 416 microseconds as I define on clock_nanosleep below code, I also used sched_priority() for a better time resolution. However, an oscilloscope (pic below) measurement shows that the time between the two output values are 770 usec . I wonder why do I have that much inaccuracy between the signals?
PS. the board(beagleboard) has Linux 3.2.0-23-omap #36-Ubuntu Tue Apr 10 20:24:21 UTC 2012 armv7l armv7l armv7l GNU/Linux kernel, and it has 750 MHz CPU, top shows almost no CPU(~1%) and memory(~0.5%) is consumed before I run my code. I use an electronic oscilloscope which has no calibration problem.
#include <stdio.h>
#include <stdlib.h> //exit();
#include <sched.h>
#include <time.h>
void msg_send();
struct sched_param sp;
int main(void){
sp.sched_priority = sched_get_priority_max(SCHED_FIFO);
sched_setscheduler(0, SCHED_FIFO, &sp);
msg_send();
return 0;
}
void msg_send(){
unsigned output[38] = {0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,1};
FILE *fp8;
if ((fp8 = fopen("/sys/class/gpio/export", "w")) == NULL){ //echo 139 > export
fprintf(stderr,"Cannot open export file: line844.\n"); fclose(fp8);exit(1);
}
fprintf(fp8, "%d", 139); //pin 3
fclose(fp8);
if ((fp8 = fopen("/sys/class/gpio/gpio139/direction", "rb+")) == NULL){
fprintf(stderr,"Cannot open direction file - GPIO139 - line851.\n");fclose(fp8); exit(1);
}
fprintf(fp8, "out");
fclose(fp8);
if((fp8 = fopen("/sys/class/gpio/gpio139/value", "w")) == NULL) {
fprintf(stderr,"error in openning value\n"); fclose(fp8); exit(1);
}
struct timespec req = { .tv_sec=0, .tv_nsec = 416000 }; //416 usec
/* here is the part that my question focus*/
while(1){
for(i=0;i<38;i++){
rewind(fp8);
fprintf(fp8, "%d", output[i]);
clock_nanosleep(CLOCK_MONOTONIC ,0, &req, NULL);
}
}
}
EDIT: I have been reading for days that clock_nanosleep() or other nanosleep, usleep etc. does not guarantee the waking up on time. they usually provide to sleep the code for the defined time, but waking up the process depends on the CPU. what I found is that absolute time provides a better resolution (TIMER_ABSTIME flag). I found the same solution as Maxime suggests. however, I have a glitch on my signal when for loop is finalized. In my opinion, it is not good to any sleep functions to create a PWM or data output on an embedded platform. It is good to spend some time to learn CPU timers that platforms provide to generate the PWM or data out that has good accuracy.
I can't figure out how a call to clock_getres() can solve your problem. In the man page, it's said that only read the resolution of the clock.
As Geoff said, using absolute sleeping clock should be a better solution. This can avoid the unespected timing delay from other code.
struct timespec Time;
clock_gettime(CLOCK_REALTIME, &(Time));
while(1){
Time.tv_nsec += 416000;
if(Time.tv_nsec > 999999999){
(Time.tv_sec)++;
Time.tv_nsec -= 1000000000;
}
clock_nanosleep(CLOCK_REALTIME, TIMER_ABSTIME, &(Time), NULL);
//Do something
}
I am using this on fews programs I have for generating some regular message on ethernet network. And it's working fine.
If you are doing time sensitive I/O, you probably shouldn't use the stuff in stdio.h but instead the I/O system calls because of the buffering done by stdio. It looks like you might be getting the worst effect of the buffering too because your program does these steps:
fill the buffer
sleep
rewind, which I believe will flush the buffer
What you want is for the kernel to service the write while you are sleeping, instead the buffer is flushed after you sleep and you have to wait for the kernel to process it.
I think your best bet is to use open("/sys/class/gpio/gpio139/value", O_WRONLY|O_DIRECT) to minimize delays due to caching.
if you still need to flush buffers to force the write through you probably want to use clock_gettime to compute the time spent flushing the data and subtract that from the sleep time. Alternatively add the desired interval to the result of clock_gettime and pass that to clock_nanosleep and use the TIMER_ABSTIME flag to wait for that absolute time to occur.
I would guess that the problem is that the clock_nanosleep is sleeping for 416 microsec
and that the other commands in the loop as well as the loop and clock_nanosleep architecture itself are taking 354 microsec. The OS may also be making demands.
What interval do you get if you set the sleep = 0?
Are you running this on a computer or a PLC?
Response to Comment
Seems like you have something somewher in the hardware/software that is doing something unexpected - it could be a bugger to find.
I have 2 suggestions depending on how critical the period is:
Low criticality - put a figure in your program that causes the loop to take the time you want. However, if this is a transient or time/temperature dependant effect you will need to check for drift periodically.
High criticality - Build a temperature stable oscilator in hardware. These can be bought off the shelf.

Network receipt timer to ms resolution

My scenario, I'm collecting network packets and if packets match a network filter I want to record the time difference between consecutive packets, this last part is the part that doesn't work. My problem is that I cant get accurate sub-second measurements no matter what C timer function I use. I've tried: gettimeofday(), clock_gettime(), and clock().
I'm looking for assistance to figure out why my timing code isn't working properly.
I'm running on a cygwin environment.
Compile Options: gcc -Wall capture.c -o capture -lwpcap -lrt
Code snippet :
/*globals*/
int first_time = 0;
struct timespec start, end;
double sec_diff = 0;
main() {
pcap_t *adhandle;
const struct pcap_pkthdr header;
const u_char *packet;
int sockfd = socket(PF_INET, SOCK_STREAM, 0);
.... (previous I create socket/connect - works fine)
save_attr = tty_set_raw();
while (1) {
packet = pcap_next(adhandle, &header); // Receive a packet? Process it
if (packet != NULL) {
got_packet(&header, packet, adhandle);
}
if (linux_kbhit()) { // User types message to channel
kb_char = linux_getch(); // Get user-supplied character
if (kb_char == 0x03) // Stop loop (exit channel) if user hits Ctrl+C
break;
}
}
tty_restore(save_attr);
close(sockfd);
pcap_close(adhandle);
printf("\nCapture complete.\n");
}
In got_packet:
got_packet(const struct pcap_pkthdr *header, const u_char *packet, pcap_t * p){ ... {
....do some packet filtering to only handle my packets, set match = 1
if (match == 1) {
if (first_time == 0) {
clock_gettime( CLOCK_MONOTONIC, &start );
first_time++;
}
else {
clock_gettime( CLOCK_MONOTONIC, &end );
sec_diff = (end.tv_sec - start.tv_sec) + ((end.tv_nsec - start.tv_nsec)/1000000000.0); // Packet difference in seconds
printf("sec_diff: %ld,\tstart_nsec: %ld,\tend_nsec: %ld\n", (end.tv_sec - start.tv_sec), start.tv_nsec, end.tv_nsec);
printf("sec_diffcalc: %ld,\tstart_sec: %ld,\tend_sec: %ld\n", sec_diff, start.tv_sec, end.tv_sec);
start = end; // Set the current to the start for next match
}
}
}
I record all packets with Wireshark to compare, so I expect the difference in my timer to be the same as Wireshark's, however that is never the case. My output for tv_sec will be correct, however tv_nsec is not even close. Say there is a 0.5 second difference in wireshark, my timer will say there is a 1.999989728 second difference.
Basically, you will want to use a timer with a higher resolution
Also, I did not check in libpcap, but I am pretty sure that libpcap can give you the time at which each packet was received. In which case, it will be closest that you can get to what Wireshark displays.
I don't think that it is the clocks that are your problem, but the way that you are waiting on new data. You should use a polling function to see when you have new data from either the socket or from the keyboard. This will allow your program to sleep when there is no new data for it to process. This is likely to make the operating system be nicer to your program when it does have data to process and schedule it quicker. This also allows you to quit the program without having to wait for the next packet to come in. Alternately you could attempt to run your program at really high or real time priority.
You should consider getting the current time at the first instance after you get a packet if the filtering can take very long. You may also want to consider multiple threads for this program if you are trying to capture data on a fast and busy network. Especially if you have more than one processor, but since you are doing some pritnfs which may block. I noticed you had a function to set a tty to raw mode, which I assume is the standard output tty. If you are actually using a serial terminal that could slow things down a lot, but standard out to a xterm can also be slow. You may want to consider setting stdout to fully buffered rather than line buffered. This should speed up the output. (man setvbuf)

Resources