High performance reading - linux/pthreads

High performance reading - linux/pthreads - c

I have moderately large binary file consisting of independent blocks like this:
header1
data1
header2
data2
header3
data3
...
The number of blocks, the size of each block and the total size of the file vary quite a lot, but typical numbers are ~1000 blocks and average blocksize 100kb. The files are generated by an external application which I have no control over, but I want to read them as fast as possible. In many cases I am only interested in a fraction (i.e. 10 %) of the blocks, and this is the case I will optimize for.
My current implementation is like this:
Open the file and read all the headers - using information in the header to fseek() to the next header location; retain an open FILE * pointer.
When data is requested use fseek() to locate the data block, read all the data and return it.
This works fine - but I was thinking maybe(?) it was possible to speed things up using e.g. aio, mmap or other techniques I have only heard of.
Any thoughts?
Joakim

The speed difference between mmap and read is not that big (both need to read the data from disk), the biggest advantage of mmap is avoiding the double buffering.
If you are only interested in 10% of the contents, your biggest saving will be to not read the other 90%. This could be done by only reading the headers, and seeking to the next header or to the data block wanted. But it all depends on the fileformat, which the OP did not show in detail.

Most of the time is probably spent in accessing the disk. So perhaps buying an SSD is sensible. (Whatever you do, your application is I/O bound).
Apparently, your file is only about 100Mb. You could get it on disk (kernel file) cache just by reading it, e.g. with cat yourfile > /dev/null before running your program. For such a small file (on a reasonable machine it fits in RAM), I won't worry that much.
You could pre-process the text file, e.g. to make a database (for sqlite, or a real RDBMS like PostGreSQL) or just a gdbm indexed file.
If using <stdio.h> you might have a bigger buffer with setbuffer, or call fopen with a "rmt" mode (the m is a GNU Glibc extension to ask mmap-ing it).
You could use mmap with madvise.
You could (perhaps in a separate thread) use the readahead syscall.
But your file seems small enough that you should not bother that much. Are you sure it is really a performance issue? Do you read that file many thousand times per day, or do you have many hundreds of such files?

Related

Reading a file of arbitrary length in C

What's the most idiomatic/efficient way to read a file of arbitrary length in C?
Get the filesize of the file in bytes and issue a single fread()
Keep fread()ing a constant size buffer until getting EOF
Anything else?

Avoid using any technique which requires knowing the size of the file in advance. That leaves exactly one technique: read the file a bit at a time, in blocks of a convenient size.
Here's why you don't want to try to find the filesize in advance:
If it is not a regular file, there may not be any way to tell. For example, you might be reading directly from a console, or taking piped input from a previous data generator. If your program requires the filesize to be knowable, these useful input mechanisms will not be available to your users, who will complain or choose a different tool.
Even if you can figure out the filesize, you have no way of preventing it from changing while you are reading the file. If you are not careful about how you read the file, you might open a vulnerability which could be exploited by adversarial programs.
For example, if you allocate a buffer of the "correct" size and then read until you get an end-of-file condition, you may end up overwriting random memory. (Multiple reads may be necessary if you use an interface like read() which might read less data than requested.) Or you might find that the file has been truncated; if you don't check the amount of data read, you might end up processing uninitialised memory leading to information leakage.

In practice, you usually don't need to keep the entire file content in memory. You'll often parse the file (notably if it is textual), or at least read the file in smaller pieces, and for that you don't need it entirely in memory. For a textual file, reading it line-by-line (perhaps with some state inside your parser) is often enough (using fgets or getline).
Files exist (notably on disks or SSDs) because usually they can be much "bigger" than your computer RAM. Actually, files have been invented (more than 50 years ago) to be able to deal with data larger than memory. Distributed file systems also can be very big (and accessed remotely even from a laptop, e.g. by NFS, CIFS, etc...)
Some file systems are capable of storing petabytes of data (on supercomputers), with individual files of many terabytes (much larger than available RAM).
You'll also likely to use some databases. These routinely have terabytes of data. See also this answer (about realistic size of sqlite databases).
If you really want to read a file entirely in memory using stdio (but you should avoid doing that, because you generally want your program to be able to handle a lot of data on files; so reading the entire file in memory is generally a design error), you indeed could loop on fread (or fscanf, or even fgetc) till end-of-file. Notice that feof is useful only after some input operation.
On current laptop or desktop computers, you could prefer (for efficiency) to use buffers of a few megabytes, and you certainly can deal with big files of several hundreds of gigabytes (much larger than your RAM).
On POSIX file systems, you might do memory mapped IO with e.g. mmap(2) - but that might not be faster than read(2) with large buffers (of a few megabytes). You could use readahead(2) (Linux specific) and posix_fadvise(2) (or madvise(2) if using mmap) to tune performance by hinting your OS kernel.
If you have to code for Microsoft Windows, you could study its WinAPI and find some way to do memory mapped IO.
In practice, file data (notably if it was accessed recently) often stays in the page cache, which is of paramount importance for performance. When that is not the case, your hardware (disk, controller, ...) becomes the bottleneck and your program becomes I/O bound (in that case, no software trick could improve significantly the performance).

Improve throughput writing a lot of small files in C

I want to improve the throughput of a software that writes, several usually small, files into a network attached volume.
The volume is limited to 100 IOPS and 80 MB/s of bandwidth.
At the moment I saturate the 100 IOPS but the bandwidth is very very far from the 80 MB/s reachable, ~4 MB/s but even less.
I believe that the main issue is that we make a lot of small requests, those small requests saturated the IOPS but the bandwidth is pretty much left unexploited.
The software is written in C and I control pretty much everything down to the actual write syscall.
At the moment the architecture is multithreading, with several threads working as "spoolers" and making synchronous write call, each for a different file.
So suppose we have file a, b and c and thread t1, t2 and t3.
t1 will open a and call in a loop something like write(fd_a, buff_a, 1024) and the same will do t2 (write(fd_b, buff_b, 1024)) and t3 (write(fd_c, buff_c, 1024)).
Each file is a new file, so it get created at the first write.
I believe that the problem is that the requests the OS is making (after the Linux IO scheduler merge) are pretty small, in the order of 10/20 blocks (5/10 kilobyte) each.
The only way I see to fix the issue is to make bigger requests, but each file is small so I am not quite sure what is the best way forward.
A possible idea could be to make a single write request instead of a loop of several request, so lookup how big is the file, allocate enough memory, populate the buffer and finally execute a single write.
Another idea could be to switch so async io, but I don't have understand what the advantages would be in this case.
Do you have any other suggestion?

You can put all the files into a tar-archive in memory. Then you can write the tar archive as a large request and then unzip the tar archive as a separate process which frees up the writing program.
Here is an idea that is a bit more "creative". First put the files into groups based on where they are being saved (possibly by directory). Then find the largest file in the group. Pad content of each other file so that each file is the same size. Then append the files to each other so now you have one large file. Send that write request. So now we have one large file written that contains a lot of equally sized smaller files. So use the linux split command to split the file into the multiple original files (https://kb.iu.edu/d/afar). This could work but you have to be ok with having padding at the end of files.
EDIT: It is important to note that these solutions are not scalable. The long term solution would be what #AndrewHenle suggested in the comments.

Read a file after write and closing it in C

My code does the following
do 100 times of
open a new file; write 10M data; close it
open the 100 files together, read and merge their data into a larger file
do steps 1 and 2 many times in a loop
I was wondering if I can keep the 100 open w/o opening and closing them too many times. What I can do is fopen them with w+. After writing I set position the beginning to read, after read I set position to the beginning to write, and so on.
The questions are:
if I read after write w/o closing, do we always read all the written data
would this save some overhead? File open and close must have some overhead, but is this overhead large enough to save?
Bases on the comments and discussion I will talk about why I need to do this in my work. It is also related to my other post
how to convert large row-based tables into column-based tables efficently
I have a calculation that generates a stream of results. So far the results are saved in a row-storage table. This table has 1M columns, each column could be 10M long. Actually each column is one attribute the calculation produces. At the calculation runs, I dump and append the intermediate results the table. The intermediate results could be 2 or 3 double values at each column. I wanted to dump it soon because it already consumes >16M memory. And the calculate needs more memoy. This ends up a table like the following
aabbcc...zzaabbcc..zz.........aabb...zz
A row of data are stored together. The problem happens when I want to analyze the data column by column. So I have to read 16 bytes then seek to the next row for reading 16 bytes then keep on going. There are too many seeks, it is much slower than if all columns are stored together so I can read them sequentially.
I can make the calculation dump less frequent. But to make the late read more efficent. I may want to have 4K data stored together since I assume each fread gets 4K by default even if I read only 16bytes. But this means I need to buffer 1M*4k = 4G in memory...
So I was thinking if I can merge fragment datas into larger chunks like that the post says
how to convert large row-based tables into column-based tables efficently
So I wanted to use files as offline buffers. I may need 256 files to get a 4K contiguous data after merge if each file contains 1M of 2 doubles. This work can be done as an asynchronous way in terms of the main calculation. But I wanted to ensure the merge overhead is small so when it runs in parallel it can finish before the main calculation is done. So I came up with this question.
I guess this is very related to how column based data base is constructed. When people create them, do they have the similar issues? Is there any description of how it works on creation?

You can use w+ as long as the maximum number of open files on your system allows it; this is usually 255 or 1024, and can be set (e.g. on Unix by ulimit).
But I'm not too sure this will be worth the effort.
On the other hand, 100 files of 10M each is one gigabyte; you might want to experiment with a RAM disk. Or with a large file system cache.
I suspect that huger savings might be reaped by analyzing your specific problem structure. Why is it 100 files? Why 10 M? What kind of "merge" are you doing? Are those 100 files always accessed in the same order and with the same frequency? Could some data be kept in RAM and never be written at all?
Update
So, you have several large buffers like,
ABCDEFG...
ABCDEFG...
ABCDEFG...
and you want to pivot them so they read
AAA...
BBB...
CCC...
If you already have the total size (i.e., you know that you are going to write 10 GB of data), you can do this with two files, pre-allocating the file and using fseek() to write to the output file. With memory-mapped files, this should be quite efficient. In practice, row Y, column X of 1,000,000 , has been dumped at address 16*X in file Y.dat; you need to write it to address 16*(Y*1,000,000 + X) into largeoutput.dat.
Actually, you could write the data even during the first calculation. Or you could have two processes communicating via a pipe, one calculating, one writing to both row-column and column-row files, so that you can monitor the performances of each.
Frankly, I think that adding more RAM and/or a fast I/O layer (SSD maybe?) could get you more bang for the same buck. Your time costs too, and the memory will remain available after this one work has been completed.

Yes. You can keep the 100 files open without doing the opening-closing-opening cycle. Most systems do have a limit on the number of open files though.
if I read after write w/o closing, do we always read all the written data
It depends on you. You can do an fseek goto wherever you want in the file and read data from there. It's all the way you and your logic.
would this save some overhead? File open and close must have some overhead, but is this overhead large enough to save?
This would definitely save some overhead, like additional unnecessary I/O operations and also in some systems, the content which you write to file is not immediately flushed to physical file, it may be buffered and flushed periodically and or done at the time of fclose.
So, such overheads are saved, but, the real question is what do you achieve by saving such overheads? How does it suit you in the overall picture of your application? This is the call which you must take before deciding on the logic.

Getting data from MATLAB Simulink every 0.008s in .txt file

I need to get data from my simulink model, write it to txt file, have another program read it, and this every 0.008s.
Is there any way to do it? All i could get is to get data into workspace
Also the system is discrete

You should use a To File block to save the data to disk. It will figure out the correct buffer size, etc., for you and write the data to disk. You just have to poll from the other program to get new data.
8 milliseconds is generally not enough data to justify the overhead of disk IO, so the To File block needs more than this to write to disk, and your other program needs more than this to read. This obviously introduces latency.
If you want a lower-latency solution, consider using UDP or TCP comminication blocks that exist in the DSP System Toolbox libarary.

Of course, it's impossible to say anything without a lot more detail.
How much data? What operating system? What happens if you "miss"? What kind of disk is the file on? Does it really have to be a file on-disk, can't you use e.g. pipes or something to avoid hitting disk? What does the "other program" have to do with the data?
8 milliseconds is not a lot of time for a disk to do anything, you're basically going to be assuming all accesses are in cache in order to work, so factor out the disk. Use a pipe or a RAM disk.
8 milliseconds is also not a lot of time for a typical desktop operating system.

mmap( ) vs read( )

I'm writing a bulk ID3 tag editor in C. ID3 tags are usually at the beginning of an mp3 encoded file, although older (version 1) tags are at the end. The app is designed to accept a directory and frame ID list from the command line, then recurse the directory structure updating all the ID3 tags it finds. The user may additionally choose to remove all older (version 1) tags. Another option is to simply display the current tags, without performing an update. The directory might contain 2 files or 2 million. If the user means to update the files, I was planning to load the entire file into memory, perform the updates, then save it (the file may be renamed as well). However, if the user only means to print the current ID3 tags, then loading the entire file seems excessive. After all the file could be 200mb.
I've read through this thread, which was insightful - mmap() vs. reading blocks
So my question is, what the most efficient way to go about this -- read(), mmap() or some combination? Design ideas welcome.
Edit: It's my understanding that mmap essentially delegates loading a file into memory, to the virtual memory subsystem. It seems to me, the VMM would be highly optimized on most systems as it's critical for system performance.

It really depends on what you're trying to do. If all you need to do is hop to a known offset and read out a small tag, read() may be faster (mmap() has to do some rather complex internal accounting). If you are planning on copying out all 200mb of the MP3, however, or scanning it for some tag that may appear at an unknown offset, then mmap() is likely a faster approach.
For example, if you need to shift the entire file down a few hundred bytes in order to insert an ID3 tag, one simple approach would be to expand the file with ftruncate(), mmap the file, then memmove() the contents down a bit. This, however, will destroy the file if your program crashes while it's running. You could also copy the contents of the file into a new file - this is another place where mmap() really shines; you can simply mmap() the old file, then copy all of its data into the new file with a single write().
In short, mmap() is great if you're doing a large amount of IO in terms of total bytes transferred; this is because it reduces the number of copies needed, and can significantly reduce the number of kernel entries needed for reading cached data. However mmap() requires a minimum of two trips into the kernel (three if you clean up the mapping when you're done!) and does some complex internal kernel accounting, and so the fixed overhead can be high.
read() on the other hand involves an extra memory-to-memory copy, and can thus be inefficient for large I/O operations, but is simple, and so the fixed overhead is relatively low. In short, use mmap() for large bulk I/O, and read() or pread() for one-off, small I/Os.

Don't bother with mmap unless your code is CPU bound, specifically due to lots small reads and writes. mmap may sound nice, but it isn't the awesome why isn't everyone using this alternative it looks like.
Given that you're recursing through potentially large directory structures, your bottleneck will be directory IO and concurrency. mmap is not going to help.
Update0
Reading the linked to question finds this answer that supports my experience:
mmap() vs. reading blocks

If you're not normally going to be streaming the file in and then processing it, but rather hopping around (like read the tags at the front and then jump to the end, etc.) then I would use mmap simply because your code will be cleaner and easier to maintain treating the file as a large buffer without having to actually manage the the buffering and paging yourself.
As has been mentioned, if you're processing a lot of data disk I/O is likely going to dominate your processing anyway. mmap may be faster than read, but for reasonable implementations, it's likely not THAT much faster, especially on todays hardware which has continually got faster and faster while disk drives have been stuck at 7200 and 10,000 RPM for years and years.
So, go with mmap and make your code easy and neat.

I don't know if standard POSIX functions reside inside what you are allowed or you will to use for the development but think about these two functions:
int ftruncate(int fildes, off_t length);
int truncate(const char *path, off_t length);
defined in unistd.h, which can be used to truncate a file up to a specified length. In this way you could easily
find where ID3 tags frame begins (don't know if you can compute it easily by just reading the header of the MP3 file but I guess yes)
save the offset
close the file
truncate the file with the provided function
open the file in append binary mode and write new tags
I'm not sure about the performance, you should test this method, but it should load much less things inside ram while providing a senseful way of doing it.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight