I have a couple of questions on this piece of code, running on a jetson nano:
#include "stdio.h"
#include "unistd.h"
#include "stdlib.h"
float gputemp = 0;
float cputemp = 0;
int count = 0;
int main() {
char* cpu;
char* gpu;
cpu = (char*)malloc(sizeof(char)*6);
gpu = (char*)malloc(sizeof(char)*6);
while (1) {
FILE* fcputemp = fopen("/sys/devices/virtual/thermal/thermal_zone1/temp", "r");
FILE* fgputemp = fopen("/sys/devices/virtual/thermal/thermal_zone2/temp","r");
if (!fcputemp || !fgputemp ) {
printf("Something went wrong\n");
exit(EXIT_FAILURE);
}
cputemp = atoi(fgets(cpu, 6, fcputemp))/1000;
gputemp = atoi(fgets(gpu, 6, fgputemp))/1000;
printf("\rCpu : %.2f, Gpu : %.2f. Elapsed time : %d", cputemp, gputemp, count);
fflush(stdout);
fclose(fcputemp);
fclose(fgputemp);
count++;
sleep(1);
}
}
Here I have to open, get the temperatures, and then close the file each loop iteration in order to get valid data (and dont segfault).
My concern here is the number of (expensive) kernel switches needed to do this.
I know that premature optimization is evil, but there is another way (or maybe the RIGHT way) to do that, opening the file only once?
And why the sensor interface (the file) cant update itself if I have it open?
P.S: Yes, I know, I didnt free cpu nor gpu variables, this is only "demo" code (just watch how i measure the time passed lol)
I'm not sure you can do this opening the files once and once only. You could try rewinding, but sysfs isn't a "real" filesystem and those aren't real files. If you rewind you might get the same data over and over, especially when using buffered calls like fopen().
The open operation is what prepares that data for reading. Since this is all managed by the kernel it should have very little overhead, and no actual disk activity. Consider that programs like top read thousands of these every second and it's no big deal.
Related
I have a C application whose one of the jobs is to call an executable file. That file has performance measurement routines inserted during compilation, at the level of intermediate code. It can measure time or L1/L2/L3 cache misses. In other words, I have modified the LLVM compiler to insert a call to that function and print the result to stdout for any compiled program.
Now, like I mentioned at the beginning, I would like to execute the program (with this result returned to stdout) from a separate C application and save that result. The way I'm doing it right now is:
void executeProgram(const char* filename, char* time) {
printf("Executing selected program %s...\n", filename);
char filePath[100] = "/home/michal/thesis/Drafts/output/";
strcat(filePath, filename);
FILE *fp;
fp = popen(filePath, "r");
char str[30];
if (fp == NULL) {
printf("Failed to run command\n" );
exit(1);
}
while (fgets(str, sizeof(str) - 1, fp) != NULL) {
strcat(time, str);
}
pclose(fp);
}
where filename is the name of the compiled executable to run. The result is saved to time string.
The problem is, that the results I'm getting are pretty different and unstable compared to those that are returned by simply running the executable 'by hand' from the command line (./test16). They look like:
231425
229958
230450
228534
230033
230566
231059
232016
230733
236017
213179
90515
229775
213351
229316
231642
230875
So they're mostly around 230000 us, with some occasional drops. The same executable, run from within the other application, produces:
97097
88706
91418
97970
97972
94597
95846
95139
91070
95918
107006
89988
90882
91986
90997
88824
129136
94976
102191
94400
95215
95061
92115
96319
114091
95230
114500
95533
102294
108473
105730
Note that it is the same executable that's being called. Yet the measured time it returns is different. The program that is being measured consists of a function call to a simple nested loop, accessing array elements. Here is the code:
#include "test.h"
#include <stdio.h>
float data[1000][1000] = {0};
void test(void)
{
int i0, i1;
int N = 80;
float mean[1000];
for (i0 = 0; i0 < N; i0++)
{
mean[i0] = 0.0;
for (i1 = 0; i1 < N; i1++) {
mean[i0] += data[i0][i1];
}
mean[i0] /= 1000;
}
}
I'm suspecting that there is something wrong in the way the program is invoked in the code, maybe the process should be forked or something? Any ideas?
You didnt specify where exactly your time measuring subroutines are inserted, so all I can really offer is guesswork.
The results seem to hint to the exact opposite - running the application from shell is slower, so I wouldn't worry about the way you're starting the process from the C code. My guess would be - when you run your program from shell, it's the terminal that's slowing you down. When you're running the process from your C code, you pipe the output back to your 'starter' application, which is already waiting for input on the pipe.
As a side note, consider switching from strcat to something safer, like strncat.
The idea behind this program is to simply access the ram and download the data from it to a txt file.
Later Ill convert the txt file to jpeg and hopefully it will be readable .
However when I try and read from the RAM using NEW[] it takes waaaaaay to long to actually copy all the values into the file?
Isnt it suppose to be really fast? I mean I save pictures everyday and it doesn't even take a second?
Is there some other method I can use to dump memory to a file?
#include <stdio.h>
#include <stdlib.h>
#include <hw/pci.h>
#include <hw/inout.h>
#include <sys/mman.h>
main()
{
FILE *fp;
fp = fopen ("test.txt","w+d");
int NumberOfPciCards = 3;
struct pci_dev_info info[NumberOfPciCards];
void *PciDeviceHandler1,*PciDeviceHandler2,*PciDeviceHandler3;
uint32_t *Buffer;
int *BusNumb; //int Buffer;
uint32_t counter =0;
int i;
int r;
int y;
volatile uint32_t *NEW,*NEW2;
uintptr_t iobase;
volatile uint32_t *regbase;
NEW = (uint32_t *)malloc(sizeof(uint32_t));
NEW2 = (uint32_t *)malloc(sizeof(uint32_t));
Buffer = (uint32_t *)malloc(sizeof(uint32_t));
BusNumb = (int*)malloc(sizeof(int));
printf ("\n 1");
for (r=0;r<NumberOfPciCards;r++)
{
memset(&info[r], 0, sizeof(info[r]));
}
printf ("\n 2");
//Here the attach takes place.
for (r=0;r<NumberOfPciCards;r++)
{
(pci_attach(r) < 0) ? FuncPrint(1,r) : FuncPrint(0,r);
}
printf ("\n 3");
info[0].VendorId = 0x8086; //Wont be using this one
info[0].DeviceId = 0x3582; //Or this one
info[1].VendorId = 0x10B5; //WIll only be using this one PLX 9054 chip
info[1].DeviceId = 0x9054; //Also PLX 9054
info[2].VendorId = 0x8086; //Not used
info[2].DeviceId = 0x24cb; //Not used
printf ("\n 4");
//I attached the device and give it a handler and set some setting.
if ((PciDeviceHandler1 = pci_attach_device(0,PCI_SHARE|PCI_INIT_ALL, 0, &info[1])) == 0)
{
perror("pci_attach_device fail");
exit(EXIT_FAILURE);
}
for (i = 0; i < 6; i++)
//This just prints out some details of the card.
{
if (info[1].BaseAddressSize[i] > 0)
printf("Aperture %d: "
"Base 0x%llx Length %d bytes Type %s\n", i,
PCI_IS_MEM(info[1].CpuBaseAddress[i]) ? PCI_MEM_ADDR(info[1].CpuBaseAddress[i]) : PCI_IO_ADDR(info[1].CpuBaseAddress[i]),
info[1].BaseAddressSize[i],PCI_IS_MEM(info[1].CpuBaseAddress[i]) ? "MEM" : "IO");
}
printf("\nEnd of Device random info dump---\n");
printf("\nNEWs Address : %d\n",*(int*)NEW);
//Not sure if this is a legitimate way of memory allocation but I cant see to read the ram any other way.
NEW = mmap_device_memory(NULL, info[1].BaseAddressSize[3],PROT_READ|PROT_WRITE|PROT_NOCACHE, 0,info[1].CpuBaseAddress[3]);
//Here is where things are starting to get messy and REALLY long to just run through all the ram and dump it.
//Is there some other way I can dump the data in the ram into a file?
while (counter!=info[1].BaseAddressSize[3])
{
fprintf(fp, "%x",NEW[counter]);
counter++;
}
fclose(fp);
printf("0x%x",*Buffer);
}
A few issues that I can see:
You are writing blocks of 4 bytes - that's quite inefficient. The stream buffering in the C library may help with that to a degree, but using larger blocks would still be more efficient.
Even worse, you are writing out the memory dump in hexadecimal notation, rather than the bytes themselves. That conversion is very CPU-intensive, not to mention that the size of the output is essentially doubled. You would be better off writing raw binary data using e.g. fwrite().
Depending on the specifics of your system (is this on QNX?), reading from I/O-mapped memory may be slower than reading directly from physical memory, especially if your PCI device has to act as a relay. What exactly is it that you are doing?
In any case I would suggest using a profiler to actually find out where your program is spending most of its time. Even a rudimentary system monitor would allow you to determine if your program is CPU-bound or I/O-bound.
As it is, "waaaaaay to long" is hardly a valid measurement. How much data is being copied? How long does it take? Where is the output file located?
P.S.: I also have some concerns w.r.t. what you are trying to do, but that is slightly off-topic for this question...
For fastest speed: write the data in binary form and use the open() / write() / close() API-s. Since your data is already available in a contiguous block of (virtual) memory it is a waste to copy it to a temporary buffer (used by the fwrite(), fprintf(), etc. API-s).
The code using write() will be similar to:
int fd = open("filename.bin", O_RDWR|O_CREAT, S_IRWXU);
write(fd, (void*)NEW, 4*info[1].BaseAddressSize[3]);
close(fd);
You will need to add error handling and make sure that the buffer size is specified correctly.
To reiterate, you get the speed-up from:
avoiding the conversion from binary to ASCII (as pointed out by others above)
avoiding many calls to libc
reducing the number of system-calls (from inside libc)
eliminating the overhead of copying data to a temporary buffer inside the fwrite()/fprintf() and related functions (buffering would be useful if your data arrived in small chunks, including the case of converting to ASCII in 4 byte units)
I intentionally ignore commenting on other parts of your code as it is apparently not intended to be production quality yet and your question is focused on how to speed up writing data to a file.
I have subtitles /srt files with data being generated by the second. I am trying to extract the data from them and save them to text files
in a while(1) loop, its a simple text read and write operation.
But i have found out using fgets to traverse through the file is causing this simple operation to have very high cpu usage ( top command ref). more the number of fgets higher the
cpu usage.
The code bit:
int convert_to_txt(FILE *f1, FILE *f2) {
static int str_len = 0;
char cap_nxt_tmp[128];
while (fgets(cap_nxt_tmp, 128, f1) != NULL) {
fprintf(f2, "%s\n", cap_nxt_tmp);
}
return 0;
}
int main(int argc, char* argv[]) {
FILE *inputFilePtr;
FILE *outputFilePtr;
char inputPath[1024] = "storage/sample.srt";
char outputPath[1024] = "storage/sample.txt";
while (1) {
outputFilePtr = fopen(outputPath, "w");
inputFilePtr = fopen(inputPath, "r");
if (inputFilePtr == NULL || outputFilePtr == NULL) {
perror("Error");
}
convert_to_txt(inputFilePtr, outputFilePtr);
fclose(inputFilePtr);
fclose(outputFilePtr);
// theres a break on an end condition lets say least run time is an hour.
}
return 0;
}
i cant understand whats wrong in the way i am using fgets to read the srt/captions file ( 1-2 mb per file), a simple text file i/o/read operation should not consume too much cpu usage.
this basic file i/o in a while(1) of a language like python shows only a 2-3% cpu usage.
but this in c is showing about 30% cpu usage.
How can i reduce the cpu usage without using sleep . is there any less cpu alternative to fgets?
I don't see anything wrong or strange here.
Your main while(1) loop never stops, which means you're converting your files over and over again. Files get cached by the OS, so you're not really accessing your physical drives. The program therefore spends almost all the CPU time doing strlen(cap_nxt_tmp), and you observe very high CPU load.
Currently I'm working on a small program that reads large files and sorts them. After some benchmarking I stumbled upon a weird performance issue. When the input file got to large the writing of the output file took longer than the actual sorting. So I went deeper into the code and finally realized that the fputs-function might be the problem. So I wrote this little benchmarking programm.
#include "stdio.h"
#include "ctime"
int main()
{
int i;
const int linecount = 50000000;
//Test Line with 184 byte
const char* dummyline = "THIS IS A LONG TEST LINE JUST TO SHOW THAT THE WRITER IS GUILTY OF GETTING SLOW AFTER A CERTAIN AMOUNT OF DATA THAT HAS BEEN WRITTEN. hkgjhkdsfjhgk jhksjdhfkjh skdjfhk jshdkfjhksjdhf\r\n";
clock_t start = clock();
clock_t last = start;
FILE* fp1 = fopen("D:\\largeTestFile.txt", "w");
for(i=0; i<linecount; i++){
fputs(dummyline, fp1);
if(i%100000==0){
printf("%i Lines written.\r", i);
if(i%1000000 == 0){
clock_t ms = clock()-last;
printf("Writting of %i Lines took %i ms\n", i, ms);
last = clock();
}
}
}
printf("%i Lines written.\n", i);
fclose(fp1);
clock_t ms = clock()-start;
printf("Writting of %i Lines took %i ms\n", i, ms);
}
When you execute the programm, you can see a clear drop of performance after about 14 to 15 mio lines which is about 2.5GB of data. The writing takes about 3 times as long as before. The threshold of 2GB indicate a 64bit issue, but I haven't found anything about that in the web. I also tested if there is a difference between binary and character-mode (e.g. "wb" and "w"), but there is none. I also tried to preallocate the filesize (to avoid file fragmentation) by seeking to the expected end and writing a zerobyte, but that had also little to no effect.
I'm running a Windows 7 64bit machine but I've tested it on a Windows Server 2008 64bit R1 machine as well. Currently I'm testing on a NTFS filesystem with more than 200GB of free space. My system has 16GB of RAM so that shouldn't be a problem either. The testprogram only uses about 700KB. The page faults, which I suspected earlier, are also very low (~400 page faults during whole runtime).
I know that for such large data the fwrite()-function would suite the task better, but at the moment I'm interested if there is another workaround and why this is happening. Any help would be highly appreciated.
The main reason for all this is a Windows disk cache. Then your program eats all RAM for it, then swapping begins, and thus, slowdowns. To fight these you need to:
1) Open file in commit mode using c flag:
FILE* fp1 = fopen("D:\\largeTestFile.txt", "wc");
2) Periodically write buffer to disk using flush function:
if(i%1000000 == 0)
{
// write content to disk
fflush(fp1);
clock_t ms = clock()-last;
printf("Writting of %i Lines took %i ms\n", i, ms);
last = clock();
}
This way you will use reasonable amount of disk cache. Speed will be basically limited by the speed of your hard drive.
I have a C application which generates a lot of output and for which speed is critical. The program is basically a loop over a large (8-12GB) binary input file which must be read sequentially. In each iteration the read bytes are processed and output is generated and written to multiple files, but never to multiple files at the same time. So if you are at the point where output is generated and there are 4 output files you write to either file 0 or 1 or 2, or 3. At the end of the iteration I now write the output using fwrite(), thereby waiting for the write operation to finish. The total number of output operations is large, up to 4 million per file, and output size of files ranges from 100mb to 3.5GB. The program runs on a basic multicore processor.
I want to write output in a separate thread and I know this can be done with
Asyncronous I/O
Creating threads
I/O completion ports
I have 2 type of questions, namely conceptual and code specific.
Conceptual Question
What would be the best approach. Note that the application should be portable to Linux, however, I don't see how that would be very important for my choice for 1-3, since I would write a wrapper around anything kernel/API specific. For me the most important criteria is speed. I have read that option 1 is not that likely to increase the performance of the program and that the kernel in any case creates new threads for the i/o operation, so then why not use option (2) immediately with the advantage that it seems easier to program (also since I did not succeed with option (1), see code issues below).
Note that I read https://stackoverflow.com/questions/3689759/how-can-i-run-a-specific-function-of-thread-asynchronously-in-c-c, but I dont see a motivation on what to use based on the nature of the application. So I hope somebody could provide me with some advice what would be best in my situation. Also from the book "Windows System Programming" by Johnson M. Hart, I know that the recommendation is using threads, mainly because of the simplicity. However, will it also be fastest?
Code Question
This question involves the attempts I made so far to make asynchronous I/O work. I understand that its a big piece of code so that its not that easy to look into. In any case I would really appreciate any attempt.
To decrease execution time I try to write the output by means of a new thread using WINAPI via CreateFile() with FILE_FLAGGED_OVERLAP with an overlapped structure. I have created a sample program in which I try to get this to work. However, I encountered 2 problems:
The file is only opened in overlapped mode when I delete an already existing file (I have tried using CreateFile in different modes (CREATE_ALWAYS, CREATE_NEW, OPEN_EXISTING), but this does not help).
Only the first WriteFile is executed asynchronously. The remainder of WriteFile commands is synchronous. For this problem I already consulted http://support.microsoft.com/kb/156932. It seems that the problem I have is related to the fact that "any write operation to a file that extends its length will be synchronous". I've already tried to solve this by increasing file size/valid data size (commented region in code). However, I still do not get it to work. I'm aware of the fact that it could be the case that to get most out of asynchronous io i should CreateFile with FILE_FLAG_NO_BUFFERING, however I cannot get this to work as well.
Please note that the program creates a file of about 120mb in the path of execution. Also note that print statements "not ok" are not desireable, I would like to see "can do work in background" appear on my screen... What goes wrong here?
#include <windows.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define ASYNC // remove this definition to run synchronously (i.e. using fwrite)
#ifdef ASYNC
struct _OVERLAPPED *pOverlapped;
HANDLE *pEventH;
HANDLE *pFile;
#else
FILE *pFile;
#endif
#define DIM_X 100
#define DIM_Y 150000
#define _PRINTERROR(msgs)\
{printf("file: %s, line: %d, %s",__FILE__,__LINE__,msgs);\
fflush(stdout);\
return 0;} \
#define _PRINTF(msgs)\
{printf(msgs);\
fflush(stdout);} \
#define _START_TIMER \
time_t time1,time2; \
clock_t clock1; \
time(&time1); \
printf("start time: %s",ctime(&time1)); \
fflush(stdout);
#define _END_TIMER\
time(&time2);\
clock1 = clock();\
printf("end time: %s",ctime(&time2));\
printf("elapsed processor time: %.2f\n",(((float)clock1)/CLOCKS_PER_SEC));\
fflush(stdout);
double aio_dat[DIM_Y] = {0};
double do_compute(double A,double B, int arr_len);
int main()
{
_START_TIMER;
const char *pName = "test1.bin";
DWORD dwBytesToWrite;
BOOL bErrorFlag = FALSE;
int j=0;
int i=0;
int fOverlapped=0;
#ifdef ASYNC
// create / open the file
pFile=CreateFile(pName,
GENERIC_WRITE, // open for writing
0, // share write access
NULL, // default security
CREATE_ALWAYS, // create new/overwrite existing
FILE_FLAG_OVERLAPPED, // | FILE_FLAG_NO_BUFFERING, // overlapped file
NULL); // no attr. template
// check whether file opening was ok
if(pFile==INVALID_HANDLE_VALUE){
printf("%x\n",GetLastError());
_PRINTERROR("file not opened properly\n");
}
// make the overlapped structure
pOverlapped = calloc(1,sizeof(struct _OVERLAPPED));
pOverlapped->Offset = 0;
pOverlapped->OffsetHigh = 0;
// put event handle in overlapped structure
if(!(pOverlapped->hEvent = CreateEvent(NULL,TRUE,FALSE,NULL))){
printf("%x\n",GetLastError());
_PRINTERROR("error in createevent\n");
}
#else
pFile = fopen(pName,"wb");
#endif
// create some output
for(j=0;j<DIM_Y;j++){
aio_dat[j] = do_compute(i, j, DIM_X);
}
// determine how many bytes should be written
dwBytesToWrite = (DWORD)sizeof(aio_dat);
for(i=0;i<DIM_X;i++){ // do this DIM_X times
#ifdef ASYNC
//if(i>0){
//SetFilePointer(pFile,dwBytesToWrite,NULL,FILE_CURRENT);
//if(!(SetEndOfFile(pFile))){
// printf("%i\n",pFile);
// _PRINTERROR("error in set end of file\n");
//}
//SetFilePointer(pFile,-dwBytesToWrite,NULL,FILE_CURRENT);
//}
// write the bytes
if(!(bErrorFlag = WriteFile(pFile,aio_dat,dwBytesToWrite,NULL,pOverlapped))){
// check whether io pending or some other error
if(GetLastError()!=ERROR_IO_PENDING){
printf("lasterror: %x\n",GetLastError());
_PRINTERROR("error while writing file\n");
}
else{
fOverlapped=1;
}
}
else{
// if you get here output got immediately written; bad!
fOverlapped=0;
}
if(fOverlapped){
// do background, this msgs is what I want to see
for(j=0;j<DIM_Y;j++){
aio_dat[j] = do_compute(i, j, DIM_X);
}
for(j=0;j<DIM_Y;j++){
aio_dat[j] = do_compute(i, j, DIM_X);
}
_PRINTF("can do work in background\n");
}
else{
// not overlapped, this message is bad
_PRINTF("not ok\n");
}
// wait to continue
if((WaitForSingleObject(pOverlapped->hEvent,INFINITE))!=WAIT_OBJECT_0){
_PRINTERROR("waiting did not succeed\n");
}
// reset event structure
if(!(ResetEvent(pOverlapped->hEvent))){
printf("%x\n",GetLastError());
_PRINTERROR("error in resetevent\n");
}
pOverlapped->Offset+=dwBytesToWrite;
#else
fwrite(aio_dat,sizeof(double),DIM_Y,pFile);
for(j=0;j<DIM_Y;j++){
aio_dat[j] = do_compute(i, j, DIM_X);
}
for(j=0;j<DIM_Y;j++){
aio_dat[j] = do_compute(i, j, DIM_X);
}
#endif
}
#ifdef ASYNC
CloseHandle(pFile);
free(pOverlapped);
#else
fclose(pFile);
#endif
_END_TIMER;
return 1;
}
double do_compute(double A,double B, int arr_len)
{
int i;
double res = 0;
double *xA = malloc(arr_len * sizeof(double));
double *xB = malloc(arr_len * sizeof(double));
if ( !xA || !xB )
abort();
for (i = 0; i < arr_len; i++) {
xA[i] = sin(A);
xB[i] = cos(B);
res = res + xA[i]*xA[i];
}
free(xA);
free(xB);
return res;
}
Useful links
http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/cref_cls/common/cppref_asynchioC_aio_read_write_eg.htm
http://www.ibm.com/developerworks/linux/library/l-async/?ca=dgr-lnxw02aUsingPOISIXAIOAPI
http://www.flounder.com/asynchexplorer.htm#Asynchronous%20I/O
I know this is a big question and I would like to thank everybody in advance who takes the trouble reading it and perhaps even respond!
You should be able to get this to work using the OVERLAPPED structure.
You're on the right track: the system is preventing you from writing asynchronously because every WriteFile extends the size of the file. However, you're doing the file size extension wrong. Simply calling SetFileSize will not actually reserve space in the MFT. Use the SetFileValidData function. This will allocate clusters for your file (note that they will contain whatever garbage the disk had there) and you should be able to execute WriteFile and your computation in parallel.
I would stay away from FILE_FLAG_NO_BUFFERING. You're after more performance with parallelism I presume? Don't prevent the cache from doing its job.
Another option that you did not consider is a memory mapped file. Those are available on Windows and Linux. There is a handy Boost abstraction that you could use.
With a memory mapped file, every thread in your process could write its output to the file on its own time, assuming that the record sizes are known and each thread has its own output area.
The operating system will take care of writing the mapped pages to disk when needed or when it gets around to it or when you close the file. Maybe when you close the file. Now that I think about it, some operating systems may require that you call msync to guarantee it.
I don't see why you would want to write asynchronously. Doing things in parallel does not make them faster in all cases. If you write two file at the same time to the same disk, it will almost always be a lot faster. If that is the case, just write them one after another.
If you have some fancy drive like SSD or a virtual RAM drive, parallel writing could be faster. You have to create an file with at full size and then do your parallel magic.
Asynchronous writing is nice, but is done by any OS anyway. The potential gain for you is that you can do other things than writing to disk like displaying a progress bar. This is where multi-threading can help you.
So imho you should use serial writing or parallel writing to multiple disks.
hth