Read times of really small files on Linux with ext4 - c

I'm working on comparing the performance of ext4 and NTFS on different file operations. As part of the same, I'm benchmarking reads on really small files (a few bytes) using C. These are of special interest since such small files are stored within the MFT on Windows but on ext4 would have their own disk block. So I should be able to observe a significant difference between read times of such files on NTFS against ext4. The tests are on a set of 100 files 4 bytes each. I measure the time read takes on each file.
The plots for read times, however, show patterns that I'm not able to explain. Here's 4 graphs:
First graph was generated by running echo 1 > /proc/sys/vm/drop_caches before reading each file, second by echo 2 > /proc/sys/vm/drop_caches, third by echo 3 > /proc/sys/vm/drop_caches and fourth without freeing any cache. For the second and third cases, why do the times go up and down? Shouldn't the times be within the same range throughput?
My first approximation was that since these files are really small, a lot of them get stored in one block and thus multiple files get read in a single block read. Although I'm definitely not sure if this is right. Is it possible that some sort of prefetching of an entire file is going on (although I'd imagine this would not be a good thing to do)? I also thought of checking the actual disk blocks of these files using debugfs but I'm not sure if logical block numbers would correspond to the actual disk blocks.

Related

Giving read() a start position

When you give read a start position - does it slow down read()? Does it have to read everything before the position to find the text it's looking for?
In other words, we have two different read commands,
read(fd,1000,2000)
read(fd,50000,51000)
where we give it two arguments:
read(file descriptor, start, end)
is there a way to implement read so that the two commands take the same amount of computing time?
You don't name a specific file system implementation or one specific language library so I will comment in general.
In general, a file interface will be built directly on top of the OS level file interface. In the OS level interface for most types of drives, data can be read in sectors with random access. The drive can seek to the start of a particular sector (without reading data) and can then read that sector without reading any of the data before it in the file. Because data is typically read in chunks by sector, if the data you request doesn't perfectly align on a sector boundary, it's possible the OS will read the entire sector containing the first byte you requested, but it won't be a lot and won't make a meaningful difference in performance as once the read/write head is positioned correctly, a sector is typically read in one DMA transfer.
Disk access times to read a given set of bytes for a spinning hard drive are not entirely predictable so it's not possible to design a function that will take exactly the same time no matter which bytes you're reading. This is because there's OS level caching, disk controller level caching and a difference in seek time for the read/write head depending upon what the read/write head was doing beforehand. If there are any other processes or services running on your system (which there always are) some of them may also be using the disk and contending for disk access too. In addition, depending upon how your files were written and how many bytes you're reading and how well your files are optimized, all the bytes you read may or may not be in one long readable sequence. It's possible the drive head may have to read some bytes, then seek to a new position on the disk and then read some more. All of that is not entirely predictable.
Oh, and some of this is different if it's a different type of drive (like an SSD) since there's no drive head to seek.
When you give read a start position - does it slow down read()?
No. The OS reads the directory entry to find out where the file is located on the disk, then calculates where on the disk your desired read should be, seeks to that position on the disk and starts reading.
Does it have to read everything before the position to find the text it's looking for?
No. Since it reads sectors at a time, it may read a few bytes before what you requested (whatever is before it in the sector), but sectors are not huge (often 8K) and are typically read in one fell swoop using DMA so that extra part of the sector before your desired data is not likely noticeable.
Is there a way to implement read so that the two commands take the same amount of computing time?
So no, not really. Disk reads, even of identical number of bytes vary a bit depending upon the situation and what else might be happening on the computer and what else might be cached already by the OS or the drive itself.
If you share what problem you're really trying to solve, we could probably suggest alternate approaches rather than relying on a given disk read taking an exact amount of time.
Well, filesystems usually split the data in a file in even-sized blocks. In most file systems the allocated blocks are organized in trees with high branching factor so it is effectively the same time to find the the nth data block than the first data block of the file, computing-wise.
The only general exception to this rule is the brain-damaged floppy disk file system FAT from Microsoft that should have become extinct in 1980s, because in it the blocks of the file are organized in a singly-linked list so to find the nth block you need to scan through n items in the list. Of course decent operating systems then have all sorts of tricks to address the shortcomings here.
Then the next thing is that your reads should touch the same number of blocks or operating system memory pages. Usually operating system pages are 4K nowadays and disk blocks something like 4k too so having every count being a multiple of 4096, 8192 or 16384 is better design than to have decimal even numbers.
i.e.
read(fd, 4096, 8192)
read(fd, 50 * 4096, 51 * 4096)
While it does not affect the computing time in a multiprocessing system, the type of media affects a lot: in magnetic disks the heads need to move around to find the new read position, and the disk must have spun to be in the reading position whereas SSDs have identical random access timings regardless of where on disk the data is positioned. And additionally the operating system might cache frequently accessed locations or expect that the block that is read after N would be N + 1 and hence such order be faster. But most of the time you wouldn't care.
Finally: perhaps instead of read you should consider using memory mapped I/O for random accesses!
Read typically reads data from the given file descriptor into a buffer. The amount of data it reads is from start (arg2) - end (arg3). More generically put the amount of data read can be found with (end-start). So if you have the following reads
read(fd1, 0xffff, 0xffffffff)
and
read(fd2, 0xf, 0xff)
the second read will be quicker because the end (0xff) - the start (0xf) is less than the first reads end (0xffffffff) - start (0xffff). AKA less bytes are being read.

Read a file after write and closing it in C

My code does the following
do 100 times of
open a new file; write 10M data; close it
open the 100 files together, read and merge their data into a larger file
do steps 1 and 2 many times in a loop
I was wondering if I can keep the 100 open w/o opening and closing them too many times. What I can do is fopen them with w+. After writing I set position the beginning to read, after read I set position to the beginning to write, and so on.
The questions are:
if I read after write w/o closing, do we always read all the written data
would this save some overhead? File open and close must have some overhead, but is this overhead large enough to save?
Bases on the comments and discussion I will talk about why I need to do this in my work. It is also related to my other post
how to convert large row-based tables into column-based tables efficently
I have a calculation that generates a stream of results. So far the results are saved in a row-storage table. This table has 1M columns, each column could be 10M long. Actually each column is one attribute the calculation produces. At the calculation runs, I dump and append the intermediate results the table. The intermediate results could be 2 or 3 double values at each column. I wanted to dump it soon because it already consumes >16M memory. And the calculate needs more memoy. This ends up a table like the following
aabbcc...zzaabbcc..zz.........aabb...zz
A row of data are stored together. The problem happens when I want to analyze the data column by column. So I have to read 16 bytes then seek to the next row for reading 16 bytes then keep on going. There are too many seeks, it is much slower than if all columns are stored together so I can read them sequentially.
I can make the calculation dump less frequent. But to make the late read more efficent. I may want to have 4K data stored together since I assume each fread gets 4K by default even if I read only 16bytes. But this means I need to buffer 1M*4k = 4G in memory...
So I was thinking if I can merge fragment datas into larger chunks like that the post says
how to convert large row-based tables into column-based tables efficently
So I wanted to use files as offline buffers. I may need 256 files to get a 4K contiguous data after merge if each file contains 1M of 2 doubles. This work can be done as an asynchronous way in terms of the main calculation. But I wanted to ensure the merge overhead is small so when it runs in parallel it can finish before the main calculation is done. So I came up with this question.
I guess this is very related to how column based data base is constructed. When people create them, do they have the similar issues? Is there any description of how it works on creation?
You can use w+ as long as the maximum number of open files on your system allows it; this is usually 255 or 1024, and can be set (e.g. on Unix by ulimit).
But I'm not too sure this will be worth the effort.
On the other hand, 100 files of 10M each is one gigabyte; you might want to experiment with a RAM disk. Or with a large file system cache.
I suspect that huger savings might be reaped by analyzing your specific problem structure. Why is it 100 files? Why 10 M? What kind of "merge" are you doing? Are those 100 files always accessed in the same order and with the same frequency? Could some data be kept in RAM and never be written at all?
Update
So, you have several large buffers like,
ABCDEFG...
ABCDEFG...
ABCDEFG...
and you want to pivot them so they read
AAA...
BBB...
CCC...
If you already have the total size (i.e., you know that you are going to write 10 GB of data), you can do this with two files, pre-allocating the file and using fseek() to write to the output file. With memory-mapped files, this should be quite efficient. In practice, row Y, column X of 1,000,000 , has been dumped at address 16*X in file Y.dat; you need to write it to address 16*(Y*1,000,000 + X) into largeoutput.dat.
Actually, you could write the data even during the first calculation. Or you could have two processes communicating via a pipe, one calculating, one writing to both row-column and column-row files, so that you can monitor the performances of each.
Frankly, I think that adding more RAM and/or a fast I/O layer (SSD maybe?) could get you more bang for the same buck. Your time costs too, and the memory will remain available after this one work has been completed.
Yes. You can keep the 100 files open without doing the opening-closing-opening cycle. Most systems do have a limit on the number of open files though.
if I read after write w/o closing, do we always read all the written data
It depends on you. You can do an fseek goto wherever you want in the file and read data from there. It's all the way you and your logic.
would this save some overhead? File open and close must have some overhead, but is this overhead large enough to save?
This would definitely save some overhead, like additional unnecessary I/O operations and also in some systems, the content which you write to file is not immediately flushed to physical file, it may be buffered and flushed periodically and or done at the time of fclose.
So, such overheads are saved, but, the real question is what do you achieve by saving such overheads? How does it suit you in the overall picture of your application? This is the call which you must take before deciding on the logic.

How to read/write at Maximum Speed from Hard Disk.Multi Threaded program I coded cannot go above 15 mb/ sec

I have a 5 gb 256 Files in csv which I need to read at optimum speed and then write back
data in Binary form .
I made following arrangments to achieve it :-
For each file, there is one corresponding thread.
Am using C function fscanf,fwrite.
But in Resource Monitor,it shows not more then 12 MB/ Sec of Hard Disk and 100 % Acitve Highest Time.
Google says HardDisk can read/write till 100 MB/Sec.
Machine Configuration is :-
Intel i7 Core 3.4. Has 8 Cores.
Please give me your prespective.
My aim to complete this process within 1 Min .
** Using One Thread it took me 12 Mins**
If all the files reside on the same disk, using multiple threads is likely to be counter-productive. If you read from many files in parallel, the HDD heads will keep moving back and forth between different areas of the disk, drastically reducing throughput.
I would measure how long it takes a built-in OS utility to read the files (on Unix, something like dd or cat into /dev/null) and then use that as a baseline, bearing in mind that you also need to write stuff back. Writing can be costly both in terms of throughput and seek times.
I would then come up with a single-threaded implementation that reads and writes data in large chunks, and see whether I can get it to perform similarly the OS tools.
P.S. If you have 5GB of data and your HDD's top raw throughput is 100MB, and you also need to write the converted data back onto the same disk, you goal of 1 minute is not realistic.

Will writing million times to a file, spoil my harddisk?

I have a IO intensive simulation program, that logs the simulation trace / data to a file at every iterations. As the simulation runs for more than millions of iterations or so and logs the data to a file in the disk (overwrite the file each time), I am curious to know if that would spoil the harddisk as most of storage disk has a upper limit to write/erase cycles ( eg. flash disk allow up to 100,000 write/erase cycles). Will splitting the file in to multiple files be a better option?
You need to recognize that a million write calls to a single file may only write to each block of the disk once, which doesn't cause any harm to magnetic disks or SSD devices. If you overwrite the first block of the file one million times, you run a greater risk of wearing things out, but there are lots of mitigating factors. First, if it is a single run of a program, the o/s is likely to keep the disk image in memory without writing to disk at all in the interim — unless, perhaps, you're using a journalled file system. If it is a journalled file system, then the actual writing will be spread over lots of different blocks.
If you manage to write to the same block on a magnetic spinning hard disk a million times, you are still not at serious risk of wearing the disk out.
A Google search on 'hard disk write cycles' shows a lot of informative articles (more particularly, perhaps, about SSD), and the related searches may also help you out.
On an SSD, there is a limited amount of writes (or erase cycles to be more accurate) to any particular block. It's probably more than 100K to 1 million to any given block, and SSD's use "wear loading" to avoid unnecessary "writes" to the same block every time. SSD's can only write zeros, so when you "reset" a bit to one, you have to erase the whole block. [One could put an inverter on the cell to make it the other way around, but you get one or t'other, so it doesn't help much].
Real hard disks are more of a mechanical device, so there isn't so much of a with how many times you write to the same place, it's more the head movements.
I wouldn't worry too much about it. Writing one file should be fine, it has little consequence whether you have one file or many.

High performance reading - linux/pthreads

I have moderately large binary file consisting of independent blocks like this:
header1
data1
header2
data2
header3
data3
...
The number of blocks, the size of each block and the total size of the file vary quite a lot, but typical numbers are ~1000 blocks and average blocksize 100kb. The files are generated by an external application which I have no control over, but I want to read them as fast as possible. In many cases I am only interested in a fraction (i.e. 10 %) of the blocks, and this is the case I will optimize for.
My current implementation is like this:
Open the file and read all the headers - using information in the header to fseek() to the next header location; retain an open FILE * pointer.
When data is requested use fseek() to locate the data block, read all the data and return it.
This works fine - but I was thinking maybe(?) it was possible to speed things up using e.g. aio, mmap or other techniques I have only heard of.
Any thoughts?
Joakim
The speed difference between mmap and read is not that big (both need to read the data from disk), the biggest advantage of mmap is avoiding the double buffering.
If you are only interested in 10% of the contents, your biggest saving will be to not read the other 90%. This could be done by only reading the headers, and seeking to the next header or to the data block wanted. But it all depends on the fileformat, which the OP did not show in detail.
Most of the time is probably spent in accessing the disk. So perhaps buying an SSD is sensible. (Whatever you do, your application is I/O bound).
Apparently, your file is only about 100Mb. You could get it on disk (kernel file) cache just by reading it, e.g. with cat yourfile > /dev/null before running your program. For such a small file (on a reasonable machine it fits in RAM), I won't worry that much.
You could pre-process the text file, e.g. to make a database (for sqlite, or a real RDBMS like PostGreSQL) or just a gdbm indexed file.
If using <stdio.h> you might have a bigger buffer with setbuffer, or call fopen with a "rmt" mode (the m is a GNU Glibc extension to ask mmap-ing it).
You could use mmap with madvise.
You could (perhaps in a separate thread) use the readahead syscall.
But your file seems small enough that you should not bother that much. Are you sure it is really a performance issue? Do you read that file many thousand times per day, or do you have many hundreds of such files?

Resources