Does lseek() trigger an actual mechanical disk seeking movement?

Does lseek() trigger an actual mechanical disk seeking movement? - c

Consider the following code:
lseek(fd, 100, 0); /* Seek to the 100th byte in the file fd. */
write(fd, buf, n); /* Write from that position. */
lseek(fd, 0, 0); /* Is this necessary? Will it trigger a actual disk movement? */
I'd like to lseek back to the beginning of the file, in case another line of code continues writing from that position thinking that it starts at the beginning of the file. First, is this good practice? Second...
I'd like to know if an lseek does trigger an actual disk movement. Or, is the disk movement triggered only in the event of an actual reading or writing.
Disk seeking is a huge performance hit, and I'd like to know the tradeoffs between such defensive coding practices and performance.

Assuming that this is a Windows or Unix type system, a regular file and you did nothing fancy with file open flags, none of those functions will trigger a disk seek.
It is likely that in 5 seconds or so, the buffer containing that new file data will be written to disk, along with everything else that happened.
Also, the file position that lseek sets is an entirely imaginary property of a file. It controls where data will read or write to in the file by default, but there are many functions that simply override file position.
As to if it is good practice I don't think it matters much. However, I've gotten out of the habit of using seek functions when writing to files because of multithreading. You might want to use pread and pwrite by preference.

Related

What's the fastest way to read a file from start to finish?

I ran my code through a profiler and saw most of the time (60%) is spent reading files. It only takes a few milliseconds to run but I was wondering if I can make it any faster. My code has a list of 20 files (from 1k to 1M). It opens one, reads the entire file into ram, process it (sequential, reads everything once), then repeats open/read/process/close for the rest of the files reusing the same buffer
I was wondering if there's a way to make anything faster? I tried using posix_fadvise with POSIX_FADV_SEQUENTIAL and POSIX_FADV_WILLNEED, using the file offset len as 0, 0 and 0, st_size. It didn't seem to make a difference. I haven't yet written code to open all the files before reading. Would that make a difference? Should I be using posix_fadvise on all of them? Should I be using POSIX_FADV_SEQUENTIAL or POSIX_FADV_WILLNEED?

fadvise only really helps if:
It's issued well before you begin reading, or
You're reading the file piecemeal (ideally with some processing gaps between the reads).
If you're just slurping the whole file into RAM up front immediately after opening it, there's not much to be optimized; the file has to be read from beginning to end, and you haven't given the OS enough warning to cache it. Things to consider:
Opening and fadviseing file n+1 just before you begin reading from file n (so the OS is caching the next file in while you're processing the current file)
Using mmap+madvise(WILLNEED) to avoid the need to copy the file from kernel to user buffers all at once before you can begin processing; if the processing of the file is expensive enough, the subsequent pages may be read in by the time you've finished processing the early pages in the file.
Given these are small files, I'd just stick with WILLNEED; SEQUENTIAL enlarges the read-ahead buffer, but you're going to read the whole file anyway (possibly in bulk, where SEQUENTIAL won't help much) so you may as well cache the whole thing in as quickly as possible.

In my opinion, this code, extracted from the first edition of "The C programming language" its quite difficult to superate:
#include <stdio.h>
main()
{
int c;
while((c = getchar()) != EOF)
putchar(c);
}

Reading directly from a FILE buffer

The core of my app looks approximately as follows:
size_t bufsize;
char* buf1;
size_t r1;
FILE* f1=fopen("/path/to/file","rb");
...
do{
r1=fread(buf1, 1, bufsize, f1);
processChunk(buf1,r1);
} while (!feof(f1));
...
(In reality, I have multiple FILE*'s and multiple bufN's.) Now, I hear that FILE is quite ready to manage a buffer (referred to as a "stream buffer") all by itself, and this behavior appears to be quite tweakable: https://www.gnu.org/software/libc/manual/html_mono/libc.html#Controlling-Buffering .
How can I refactor the above piece of code to ditch the buf1 buffer and use f1's internal stream buffer instead (while setting it to bufsize)?

If you don't want opaquely buffered I/O, don't use FILE *. Use lower-level APIs that let you manage all the application-side buffering yourself, such as plain POSIX open() and read() for instance.

So I've read a little bit of the C standard and run some benchmarks and here are my findings:
1) Doing it as in the above example does involve unnecessary in-memory copying, which increases the user time of simple cmp program based on the above example about twice. Nevertheless user-time is insignificant for most IO-heavy programs, unless the source of the file is extremely fast.
On in-memory file-sources (/dev/shm on Linux), however, turning off FILE buffering (setvbuf(f1, NULL, _IONBF, 0);) does yield a nice and consistent speed increase of about 10–15% on my machine when using buffsizes close to BUFSIZ (again, measured on the IO-heavy cmp utility based on the above snippet, which I've already mentioned, which I've tested on 2 identical 700MB files 100 times).
2) Whereas there is an API for setting the FILE buffer, I haven't found any standardized API for reading it, so I'm going to stick with the true and tested way of doing, but with the FILE buffer off (setvbuf(f1, NULL, _IONBF, 0);)
(But I guess I could solve my question by setting my own buffer as the FILE stream buffer with the _IONBF mode option (=turn off buffering), and then I could just access it via some unstandardized pointer in the FILE struct.)

Understanding the need for fflush() and problems associated with it

Below is sample code for using fflush():
#include <string.h>
#include <stdio.h>
#include <conio.h>
#include <io.h>
void flush(FILE *stream);
int main(void)
{
FILE *stream;
char msg[] = "This is a test";
/* create a file */
stream = fopen("DUMMY.FIL", "w");
/* write some data to the file */
fwrite(msg, strlen(msg), 1, stream);
clrscr();
printf("Press any key to flush DUMMY.FIL:");
getch();
/* flush the data to DUMMY.FIL without closing it */
flush(stream);
printf("\nFile was flushed, Press any key to quit:");
getch();
return 0;
}
void flush(FILE *stream)
{
int duphandle;
/* flush the stream's internal buffer */
fflush(stream);
/* make a duplicate file handle */
duphandle = dup(fileno(stream));
/* close the duplicate handle to flush the DOS buffer */
close(duphandle);
}
All I know about fflush() is that it is a library function used to flush an output buffer. I want to know what is the basic purpose of using fflush(), and where can I use it. And mainly I am interested in knowing what problems can there be with using fflush().

It's a little hard to say what "can be problems with" (excessive?) use of fflush. All kinds of things can be, or become, problems, depending on your goals and approaches. Probably a better way to look at this is what the intent of fflush is.
The first thing to consider is that fflush is defined only on output streams. An output stream collects "things to write to a file" into a large(ish) buffer, and then writes that buffer to the file. The point of this collecting-up-and-writing-later is to improve speed/efficiency, in two ways:
On modern OSes, there's some penalty for crossing the user/kernel protection boundary (the system has to change some protection information in the CPU, etc). If you make a large number of OS-level write calls, you pay that penalty for each one. If you collect up, say, 8192 or so individual writes into one large buffer and then make one call, you remove most of that overhead.
On many modern OSes, each OS write call will try to optimize file performance in some way, e.g., by discovering that you've extended a short file to a longer one, and it would be good to move the disk block from point A on the disk to point B on the disk, so that the longer data can fit contiguously. (On older OSes, this is a separate "defragmentation" step you might run manually. You can think of this as the modern OS doing dynamic, instantaneous defragmentation.) If you were to write, say, 500 bytes, and then another 200, and then 700, and so on, it will do a lot of this work; but if you make one big call with, say, 8192 bytes, the OS can allocate a large block once, and put everything there and not have to re-defragment later.
So, the folks who provide your C library and its stdio stream implementation do whatever is appropriate on your OS to find a "reasonably optimal" block size, and to collect up all output into chunk of that size. (The numbers 4096, 8192, 16384, and 65536 often, today, tend to be good ones, but it really depends on the OS, and sometimes the underlying file system as well. Note that "bigger" is not always "better": streaming data in chunks of four gigabytes at a time will probably perform worse than doing it in chunks of 64 Kbytes, for instance.)
But this creates a problem. Suppose you're writing to a file, such as a log file with date-and-time stamps and messages, and your code is going to keep writing to that file later, but right now, it wants to suspend for a while and let a log-analyzer read the current contents of the log file. One option is to use fclose to close the log file, then fopen to open it again in order to append more data later. It's more efficient, though, to push any pending log messages to the underlying OS file, but keep the file open. That's what fflush does.
Buffering also creates another problem. Suppose your code has some bug, and it sometimes crashes but you're not sure if it's about to crash. And suppose you've written something and it's very important that this data get out to the underlying file system. You can call fflush to push the data through to the OS, before calling your potentially-bad code that might crash. (Sometimes this is good for debugging.)
Or, suppose you're on a Unix-like system, and have a fork system call. This call duplicates the entire user-space (makes a clone of the original process). The stdio buffers are in user space, so the clone has the same buffered-up-but-not-yet-written data that the original process had, at the time of the fork call. Here again, one way to solve the problem is to use fflush to push buffered data out just before doing the fork. If everything is out before the fork, there's nothing to duplicate; the fresh clone won't ever attempt to write the buffered-up data, as it no longer exists.
The more fflush-es you add, the more you're defeating the original idea of collecting up large chunks of data. That is, you are making a tradeoff: large chunks are more efficient, but are causing some other problem, so you make the decision: "be less efficient here, to solve a problem more important than mere efficiency". You call fflush.
Sometimes the problem is simply "debug the software". In that case, instead of repeatedly calling fflush, you can use functions like setbuf and setvbuf to alter the buffering behavior of a stdio stream. This is more convenient (fewer, or even no, code changes required—you can control the set-buffering call with a flag) than adding a lot of fflush calls, so that could be considered a "problem with use (or excessive-use) of fflush".

Well, #torek's answer is almost perfect, but there's one point which is not so accurate.
The first thing to consider is that fflush is defined only on output
streams.
According to man fflush, fflush can also be used in input streams:
For output streams, fflush() forces a write of all user-space
buffered data for the given output or update stream via the stream's
underlying write function. For
input streams, fflush() discards any buffered data that has been fetched from the underlying file, but has not been consumed by
the application. The open status of
the stream is unaffected.
So, when used in input, fflush just discard it.
Here is a demo to illustrate it:
#include<stdio.h>
#define MAXLINE 1024
int main(void) {
char buf[MAXLINE];
printf("prompt: ");
while (fgets(buf, MAXLINE, stdin) != NULL)
fflush(stdin);
if (fputs(buf, stdout) == EOF)
printf("output err");
exit(0);
}

fflush() empties the buffers related to the stream. if you e.g. let a user input some data in a very shot timespan (milliseconds) and write some stuff into a file, the writing and reading buffers may have some "reststuff" remaining in themselves. you call fflush() then to empty all the buffers and force standard outputs to be sure the next input you get is what the user pressed then.
reference: http://www.cplusplus.com/reference/cstdio/fflush/

How to implement a circular buffer using a file?

My application (C program) opens two file handles to the same file (one in write and one in read mode). Two separate threads in the app read from and write to the file. This works fine.
Since my app runs on embedded device with a limited ram disk size, I would like write FileHandle to wrap to beginning of file on reaching max size and the read FileHandle to follow like a circular buffer. I understand from answers to this question that this should work. However as soon as I do fseek of write FileHandle to beginning of file, fread returns error. Will the EOF get reset on doing fseek to beginning of file? If so, which function should be used to cause write file position to get set to 0 without causing EOF to be reset.
EDIT/UPDATE:
I tried couple of things:
Based on #neodelphi I used pipes this works. However my usecase requires I write to a file. I receive multiple channels of live video surveilance stream that needs to be stored to harddisk and also read back decoded and displayed on monitor.
Thanks to #Clement suggestions on doing ftell I fixed a couple of bugs in my code and wrap works for the reader however, the data read appears to be stale data since write are still buffered but reader reads stale content from hard disk. I cant avoid buffering due to performance considerations (I get 32Mbps of live data that needs to be written to harddisk). I have tried things like flushing writes only in the interval from when write wraps to when read wraps and truncating the file (ftruncate) after read wraps but this doesnt solve the stale data problem.
I am trying to use two files in ping-pong fashion to see if this solves the issue but want to know if there is a better solution

You should have something like that :
// Write
if(ftell(WriteHandle)>BUFFER_MAX) rewind (WriteHandle);
fwrite(WriteHandle,/* ... */);
// Read (assuming binary)
readSize = fread (buffer,1,READ_CHUNK_SIZE,ReadHandle);
if(readSize!=READ_CHUNK_SIZE){
rewind (ReadHandle);
if(fread (buffer+readSize,1,READ_CHUNK_SIZE-readSize,ReadHandle)!=READ_CHUNK_SIZE-readSize)
;// ERROR !
}
Not tested, but it gives an idea. The write should also handle the case BUFFER_MAX is not modulo WRITE_CHUNK_SIZE.
Also, you may read only if you are sure that the data has already been written. But I guess you already do that.

You could mmap the file into you're virtual memory and then just create a normal circular buffer with the pointer returned.
int fd = open(path, O_RDWR);
volatile void * mem = mmap(NULL, max_size, PROT_WRITE, MAP_SHARED, fd, 0);
volatile char * c_mem = (volatile char *)mem;
c_mem[index % max_size] = 'a'; // This line will now write to the offset index in the file
// Now doing
Can also probably be stricter on permissions depending on on exact case.

Read a line of input faster than fgets?

I'm writing a program where performance is quite important, but not critical. Currently I am reading in text from a FILE* line by line and I use fgets to obtain each line. After using some performance tools, I've found that 20% to 30% of the time my application is running, it is inside fgets.
Are there faster ways to get a line of text? My application is single-threaded with no intentions to use multiple threads. Input could be from stdin or from a file. Thanks in advance.

You don't say which platform you are on, but if it is UNIX-like, then you may want to try the read() system call, which does not perform the extra layer of buffering that fgets() et al do. This may speed things up slightly, on the other hand it may well slow things down - the only way to find out is to try it and see.

Use fgets_unlocked(), but read carefully what it does first
Get the data with fgetc() or fgetc_unlocked() instead of fgets(). With fgets(), your data is copied into memory twice, first by the C runtime library from a file to an internal buffer (stream I/O is buffered), then from that internal buffer to an array in your program

Read the whole file in one go into a buffer.
Process the lines from that buffer.
That's the fastest possible solution.

You might try minimizing the amount of time you spend reading from the disk by reading large amounts of data into RAM then working on that. Reading from disk is slow, so minimize the amount of time you spend doing that by reading (ideally) the entire file once, then working on it.
Sorta like the way CPU cache minimizes the time the CPU actually goes back to RAM, you could use RAM to minimize the number of times you actually go to disk.

Depending on your environment, using setvbuf() to increase the size of the internal buffer used by file streams may or may not improve performance.
This is the syntax -
setvbuf (InputFile, NULL, _IOFBF, BUFFER_SIZE);
Where InputFile is a FILE* to a file just opened using fopen() and BUFFER_SIZE is the size of the buffer (which is allocated by this call for you).
You can try various buffer sizes to see if any have positive influence. Note that this is entirely optional, and your runtime may do absolutely nothing with this call.

If the data is coming from disk, you could be IO bound.
If that is the case, get a faster disk (but first check that you're getting the most out of your existing one...some Linux distributions don't optimize disk access out of the box (hdparm)), stage the data into memory (say by copying it to a RAM disk) ahead of time, or be prepared to wait.
If you are not IO bound, you could be wasting a lot of time copying. You could benefit from so-called zero-copy methods. Something like memory map the file and only access it through pointers.
That is a bit beyond my expertise, so you should do some reading or wait for more knowledgeable help.
BTW-- You might be getting into more work than the problem is worth; maybe a faster machine would solve all your problems...
NB-- It is not clear that you can memory map the standard input either...

If the OS supports it, you can try asynchronous file reading, that is, the file is read into memory whilst the CPU is busy doing something else. So, the code goes something like:

start asynchronous read
loop:
wait for asynchronous read to complete
if end of file goto exit
start asynchronous read
do stuff with data read from file
goto loop
exit:
If you have more than one CPU then one CPU reads the file and parses the data into lines, the other CPU takes each line and processes it.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight