I'm going to first admit that this is for a class project, since it will be pretty obvious. We are supposed to do reads to probe for the block size of the filesystem. My problem is that the time taken to do this appears to be linearly increasing, with no steps like I would expect.
I am timing the read like this:
double startTime = getticks();
read = fread(x, 1, toRead, fp);
double endTime = getticks();
where getticks uses rdtsc instructions. I am afraid there is caching/prefetching that is causing the reads to not take time during the fread. I tried creating a random file between each execution my program, but that is not alleviating my problem.
What is the best way to accurately measure the time taken for a read from disk? I am pretty sure my block size is 4096, but how can I get data to support that?
The usual way of determining filesystem block size is to ask the filesystem what its blocksize is.
#include <sys/statvfs.h>
#include <stdio.h>
int main() {
struct statvfs fs_stat;
statvfs(".", &fs_stat);
printf("%lu\n", fs_stat.f_bsize);
}
But if you really want, open(…,…|O_DIRECT) or posix_fadvise(…,…,…,POSIX_FADV_DONTNEED) will try to let you bypass the kernel's buffer cache (not guaranteed).
You may want to use the system calls (open(), read(), write(), ...)
directly to reduce the impact of the buffering done by the FILE* stuff.
Also, you may want to use synchronous I/O somehow.
One ways is opening the file with the O_SYNC flag set
(or O_DIRECT as per ephemient's reply).
Quoting the Linux open(2) manual page:
O_SYNC The file is opened for synchronous I/O. Any write(2)s on the
resulting file descriptor will block the calling process until
the data has been physically written to the underlying hardware.
But see NOTES below.
Another options would be mounting the filesystem with -o sync (see mount(8)) or setting the S attribute on the file using the chattr(1) command.
Related
When is the kernel buffer cache empty? This does not seem to be LINE Buffering. If I write () a string without a newline character, it is immediately output to the file.
In addition, does the input and output buffers of the socket file also use the kernel buffer cache like Disk I / O? Also, does the input and output buffers of the kernel space used for read() and write() exist for each open file (fd)?
#include <stdio.h>
#include <string.h>
#include <sys/fcntl.h>
#include <unistd.h>
int main()
{
int fd = open("text", O_RDWR | O_CREAT);
write(fd, "message", strlen("message"));
// I can check the string in the file without fsync(fd).
sleep(30);
close(fd);
return 0;
}
When is page cache bypassed?
Page cache shall be bypassed using direct I/O, provided that
opened with O_DIRECT flag
certain offset/address alignment constraint is met
no extending writes performed
See this link for more information.
(I'm assuming Linux for the answers below)
When is the kernel buffer cache empty?
You would need more context for this to be answerable. Additionally as you seem to be making files in a filesystem I'll refer to the kernel cache being used as the page cache, see the "What is the major difference between the buffer cache and the page cache?" quora question for the difference. For example, a write can be in the kernel page cache but not have made its way to do disk (i.e. dirty) or it can be in BOTH the page cache AND on disk (i.e. it got written out but the kernel is choosing to hold on to it in RAM). Do you mean "made clean" or do you mean "entirely discarded from the page cache"? Or do you mean "when is the I/O done visible to other programs working on the same file"?
This does not seem to be LINE Buffering
At the C library level there's a difference between I/O done on streams (which can be line buffered) and low-level I/O done on file descriptors. Your example is using file descriptors so there would never be line buffering. Further, C library buffering is orthogonal to kernel buffering.
does the input and output buffers of the socket file also use the kernel buffer cache like Disk I / O?
Sockets don't use the page cache as they aren't block or file backed. However, socket I/O IS buffered using sk_buff in the kernel.
Also, does the input and output buffers of the kernel space used for read() and write() exist for each open file (fd)?
Sorry, I don't understand the question. The page cache is shared for files/block devices so multiple file descriptors to the same file will be serviced by the same entries in the page cache (assuming they are requesting identical offsets).
(ETOOMANYQUESTIONS! #andoryu please can you limit one question per post? It's tough going for someone trying to answer otherwise. Thanks!)
I would like to know if we can use multiple threads to write binary data on the same file.
FILE *fd = openfile("test");
int SIZE = 1000000000;
int * table = malloc(sizeof(int) * SIZE);
// .. filling the table
fwrite(table, sizeof(*table), SIZE, fd);
so I wonder if i can use threads,and each thread calls fssek to seek to a different location to write in the same file.
Any idea ?
fwrite should be thread safe, but you'll need a mutex anyway, because you need the seek and the write to be atomic. Depending on your platform, you might have a write function that takes an offset, or you might be able to open the file in each thread. A better option if you have everything in memory anyway as your code suggests, would just be for each thread to fill into a single large array and then write that out when everything is done.
While fread() and fwrite() are thread safe, the stream buffer represented by the FILE* is not. So you can have multiple threads accessing the same file, but not via the same FILE* - each thread must have its own, and the file to which they refer must be shareable - which is OS dependent.
An alternative and possibly simpler approach is to use a memory mapped file, so that each thread treats the file as shared memory, and you let the OS deal with the file I/O. This has a significant advantage over normal file I/O as it is truly random access, so you don't need to worry about fseek() and sequential read/writes etc.
fseek and fwrite are thread-safe so you can use them without additional synchronization.
Let each thread open the file, and make sure they write to different positions, finally let each thread close the file and your done.
Update:
This works on IX'ish systems, at least.
I am using the low-level I/O function 'write' to write some data to disk in my code (C language on Linux). First, I accumulate the data in a memory buffer, and then I use 'write' to write the data to disk when the buffer is full. So what's the best buffer size for 'write'? According to my tests it isn't the bigger the faster, so I am here to look for the answer.
There is probably some advantage in doing writes which are multiples of the filesystem block size, especially if you are updating a file in place. If you write less than a partial block to a file, the OS has to read the old block, combine in the new contents and then write it out. This doesn't necessarily happen if you rapidly write small pieces in sequence because the updates will be done on buffers in memory which are flushed later. Still, once in a while you could be triggering some inefficiency if you are not filling a block (and a properly aligned one: multiple of block size at an offset which is a multiple of the block size) with each write operation.
This issue of transfer size does not necessarily go away with mmap. If you map a file, and then memcpy some data into the map, you are making a page dirty. That page has to be flushed at some later time: it is indeterminate when. If you make another memcpy which touches the same page, that page could be clean now and you're making it dirty again. So it gets written twice. Page-aligned copies of multiples-of a page size will be the way to go.
You'll want it to be a multiple of the CPU page size, in order to use memory as efficiently as possible.
But ideally you want to use mmap instead, so that you never have to deal with buffers yourself.
You could use BUFSIZ defined in <stdio.h>
Otherwise, use a small multiple of the page size sysconf(_SC_PAGESIZE) (e.g. twice that value). Most Linux systems have 4Kbytes pages (which is often the same as or a small multiple of the filesystem block size).
As other replied, using the mmap(2) system call could help. GNU systems (e.g. Linux) have an extension: the second mode string of fopen may contain the latter m and when that happens, the GNU libc try to mmap.
If you deal with data nearly as large as your RAM (or half of it), you might want to also use madvise(2) to fine-tune performance of mmap.
See also this answer to a question quite similar to yours. (You could use 64Kbytes as a reasonable buffer size).
The "best" size depends a great deal on the underlying file system.
The stat and fstat calls fill in a data structure, struct stat, that includes the following field:
blksize_t st_blksize; /* blocksize for file system I/O */
The OS is responsible for filling this field with a "good size" for write() blocks. However, it's also important to call write() with memory that is "well aligned" (e.g., the result of malloc calls). The easiest way to get this to happen is to use the provided <stdio.h> stream interface (with FILE * objects).
Using mmap, as in other answers here, can also be very fast for many cases. Note that it's not well suited to some kinds of streams (e.g., sockets and pipes) though.
It depends on the amount of RAM, VM, etc. as well as the amount of data being written. The more general answer is to benchmark what buffer works best for the load you're dealing with, and use what works the best.
Recently I ran into a "fun" problem with the Microsoft implementation of the CRTL. tmpfile places temp files in the root directory and completely ignores the temp file directory. This has issues with users who do not have privileges to the root directory (say, on our cluster). Moreover, using _tempnam would require the application to remember to delete the temporary files, which it is unable to do without a considerable amount of rework.
Therefore I bit the bullet and wrote Win32 versions of all of the IO routines (create_temp, read, write, seek, flush) which call the appropriate method. One thing I've noticed is the now abysmal performance of the library.
Results from the test suite:
CRTL: 4:30.05 elapsed
Win32: 11:18.06 elapsed
Stats measured in my routines:
Writes: 3129934 ( 44,642,745,008 bytes)
Reads: 935903 ( 8,183,423,744 bytes)
Seeks: 2205757 (2,043,782,657,968 bytes traveled)
Flushes: 92442
Example of a CRTL v. Win32 method:
int io_write(FILE_POINTER fp, size_t words, const void *buffer)
{
#if !defined(USE_WIN32_IO)
{
size_t words_written = 0;
/* read the data */
words_written = fwrite(buffer, sizeof(uint32_t), words, fp);
if (words_written != words)
{
return errno;
}
}
#else /* !defined(USE_WIN32_IO) */
{
DWORD bytesWritten;
if (!WriteFile(fp, buffer, words * sizeof(uint32_t), &bytesWritten, NULL)
|| (bytesWritten != words * sizeof(uint32_t)))
{
return GetLastError();
}
}
#endif /* USE_WIN32_IO */
return E_SUCCESS;
}
As you can see, they are effectively identical, yet the performance (in release mode) is wildly divergent. Time spent in WriteFile and SetFilePointer dwarf the time spent in fwrite and fseeko, which seems counterintuitive.
Ideas?
UPDATE: perfmon notes that fflush is about 10x cheaper than FlushFileBuffers and fwrite is ~1.1x slower than WriteFile. The net result is a huge performance loss with FlushFileBuffers used in the same manner as fflush. There is no change from FILE_ATTRIBUTE_NORMAL to FILE_FLAG_RANDOM_ACCESS either.
I think it's probably due to this issue, described on MSDN's page for FlushFileBuffers:
Due to disk caching interactions
within the system, the
FlushFileBuffers function can be
inefficient when used after every
write to a disk drive device when many
writes are being performed separately.
If an application is performing
multiple writes to disk and also needs
to ensure critical data is written to
persistent media, the application
should use unbuffered I/O instead of
frequently calling FlushFileBuffers.
To open a file for unbuffered I/O,
call the CreateFile function with the
FILE_FLAG_NO_BUFFERING and
FILE_FLAG_WRITE_THROUGH flags. This
prevents the file contents from being
cached and flushes the metadata to
disk with each write. For more
information, see CreateFile.
In general, FlushFileBuffers is an "expensive" operation, since it flushes everything in the write-back cache:
FlushFileBuffers(): This function will flush everything in the write-back cache, as it
does not know what part of the cache belongs to your file. This can take a lot of time,
depending on the cache size and the speed of the media. How necessary is it? There is
a thread which goes through and writes out dirty pages, so it is likely not very
necessary.
I presume that fflush does not flush the entire write-back cache. In that case, it's much more efficient, but that efficiency comes at the risk of potential data loss. The CRT's source code for fflush confirms this, since _commit calls FlushFileBuffers:
/* lowio commit to ensure data is written to disk */
if (str->_flag & _IOCOMMIT) {
return (_commit(_fileno(str)) ? EOF : 0);
}
From the implementation of _commit:
if ( !FlushFileBuffers((HANDLE)_get_osfhandle(filedes)) ) {
retval = GetLastError();
}
Traditionally, the C runtime library functions buffer the data and only trigger the write operation (hence the need for functions like fflush). I don't think that WriteFile buffers the write operation so every time you call WriteFile, an I/O operation gets triggered whereas with fwrite, the I/O gets triggered when the buffer has reached a certain size.
As you can see from your measurements, the buffered I/O tends to be more efficient...
I might be crazy, but wouldn't it be easier to just write a replacement for tmpfile that uses fopen(temporaryname, "wbTD+"), where you generate your own temporaryname?
At least then you don't have to worry about reimplementing <file.h>.
I'm still a little unclear on what the question is. You start out by talking about managing the lifetime of a temporary file and then jump to wrapping an entire file i/o interface. Are you asking about how to manage a temporary file without the performance penalty of wrapping all the file I/O? Or are you interested in how the CRT functions can be faster than the WinAPI functions they are built on top of?
Several of the comparisons being made between the C run-time functions and the WinAPi functions are of the apples and oranges variety.
The C run-time functions buffer the I/O in library memory. There is another layer of buffering (and caching) in the OS.
fflush flushes the data from the library buffers to the OS. It may go directly to disk, or it may go to OS buffers for later writing. FlushFileBuffers gets data from the OS buffers onto the disk, which generally takes longer than moving data from the library buffers to the OS buffers.
Unaligned writes are expensive. The OS buffers make unaligned writes possible, but they don't really speed up the process. The library buffers may accept several writes before pushing data to the OS, effectively reducing the number of unaligned writes to the disk.
It's also possible (though this is just a guess) that the library routines are taking advantage of overlapped (asynchronous) I/O to the disk, where your straight-to-WinAPI implementation is all synchronous.
I'm writing a program where performance is quite important, but not critical. Currently I am reading in text from a FILE* line by line and I use fgets to obtain each line. After using some performance tools, I've found that 20% to 30% of the time my application is running, it is inside fgets.
Are there faster ways to get a line of text? My application is single-threaded with no intentions to use multiple threads. Input could be from stdin or from a file. Thanks in advance.
You don't say which platform you are on, but if it is UNIX-like, then you may want to try the read() system call, which does not perform the extra layer of buffering that fgets() et al do. This may speed things up slightly, on the other hand it may well slow things down - the only way to find out is to try it and see.
Use fgets_unlocked(), but read carefully what it does first
Get the data with fgetc() or fgetc_unlocked() instead of fgets(). With fgets(), your data is copied into memory twice, first by the C runtime library from a file to an internal buffer (stream I/O is buffered), then from that internal buffer to an array in your program
Read the whole file in one go into a buffer.
Process the lines from that buffer.
That's the fastest possible solution.
You might try minimizing the amount of time you spend reading from the disk by reading large amounts of data into RAM then working on that. Reading from disk is slow, so minimize the amount of time you spend doing that by reading (ideally) the entire file once, then working on it.
Sorta like the way CPU cache minimizes the time the CPU actually goes back to RAM, you could use RAM to minimize the number of times you actually go to disk.
Depending on your environment, using setvbuf() to increase the size of the internal buffer used by file streams may or may not improve performance.
This is the syntax -
setvbuf (InputFile, NULL, _IOFBF, BUFFER_SIZE);
Where InputFile is a FILE* to a file just opened using fopen() and BUFFER_SIZE is the size of the buffer (which is allocated by this call for you).
You can try various buffer sizes to see if any have positive influence. Note that this is entirely optional, and your runtime may do absolutely nothing with this call.
If the data is coming from disk, you could be IO bound.
If that is the case, get a faster disk (but first check that you're getting the most out of your existing one...some Linux distributions don't optimize disk access out of the box (hdparm)), stage the data into memory (say by copying it to a RAM disk) ahead of time, or be prepared to wait.
If you are not IO bound, you could be wasting a lot of time copying. You could benefit from so-called zero-copy methods. Something like memory map the file and only access it through pointers.
That is a bit beyond my expertise, so you should do some reading or wait for more knowledgeable help.
BTW-- You might be getting into more work than the problem is worth; maybe a faster machine would solve all your problems...
NB-- It is not clear that you can memory map the standard input either...
If the OS supports it, you can try asynchronous file reading, that is, the file is read into memory whilst the CPU is busy doing something else. So, the code goes something like:
start asynchronous read
loop:
wait for asynchronous read to complete
if end of file goto exit
start asynchronous read
do stuff with data read from file
goto loop
exit:
If you have more than one CPU then one CPU reads the file and parses the data into lines, the other CPU takes each line and processes it.