C BZ2_bzDecompress way slower than bzip2 command - c

I'm using mmap/read + BZ2_bzDecompress to sequentially decompress a large file (29GB). This is done because I need to parse the uncompressed xml data, but only need small bits of it, and it seemed like it would be way more efficient to do this sequentially than to uncompress the whole file (400GB uncompressed) and then parse it. Interestingly already the decompression part is extremely slow - while the shell command bzip2 is able to do a bit more than 52MB per second (used several runs of timeout 10 bzip2 -c -k -d input.bz2 > output and divided produced filesize by 10), my program is able to do not even 2MB/s, slowing down after a few seconds to 1.2MB/s
The file I'm trying to process uses multiple bz2 streams, so I'm checking BZ2_bzDecompress for BZ_STREAM_END, and if it occurs, use BZ2_bzDecompressEnd( strm ); and BZ2_bzDecompressInit( strm, 0, 0 ) to restart with the next stream, in case the file hasn't been completely processed. I also tried without BZ2_bzDecompressEnd but that didn't change anything (and I can't really see in the documentation how one should handle multiple streams correctly)
The file is being mmap'ed before, where I also tried different combinations of flags, currently MAP_RDONLY, MAP_PRIVATE with madvise to MADV_SEQUENTIAL | MADV_WILLNEED | MADV_HUGEPAGE (I'm checking return value, and madvise does not report any problems, and I'm on a linux kernel 3.2x debian setup which has hugepage support)
When profiling I made sure that other than some counters for measuring speed and a printf which was limited to once every n iterations, nothing else was run. Also this is on a modern multicore server processor where all other cores where idle, and it's bare metal, not virtualized.
Any ideas on what I could be doing wrong / do to improve performance?
Update: Thanks to James Chong's suggestion I tried "swapping" mmap() with read(), and the speed is still the same. So it seems mmap() is not the problem (either that, or mmap() and read() share an underlying problem)
Update 2: Thinking that maybe the malloc/free calls done in bzDecompressInit/bzDecompressEnd would be the cause, I set bzalloc/bzfree of the bz_stream struct to a custom implementation which only allocates memory the first time and does not free it unless a flag is set (passed by the opaque parameter = strm.opaque). It works perfectly fine, but again the speed did not increase.
Update 3: I also tried fread() instead of read() now, and still the speed stays the same. Also tried different amount of read bytes and decompressed-data-buffer sizes - no change.
Update 4: Reading speed is definitely not an issue, as I've been able to achieve speeds close to about 120MB/s in sequential reading using just mmap().

Swapping, mmap flags have with them little to do. If bzip2 is slow, it is not because of the file I/O.
I think your libbz2 wasn't fully optimized. Recompile it with the most brutal gcc flags which you can imagine.
My second idea were if there is some ELF linking overhead. In this case the problem will disappear if you link in bz2 statically. (After that you will be able to think how to make this fast with dynamically loaded libbz2).
Important extension from the future:
Libbz2 must be reentrant, thread-safe and position-independent. This means various C flags to be compiled with, and these flags don't have a good effect to performance (although they produce much faster code). In an extrem case I could even imagine a 5-10-times slow, compared to the single-threaded, non-PIC, non-reentrant version.

Related

improving small file read times with USB2 attached ext2 volume

I'm a more experienced Windows programmer than I am a Linux programmer. Apologies if I'm missing something obvious.
I need to read >10,000 small files (~2->10k) on a USB2 attached ext2 volume running Linux. The distro is a custom and runs busybox.
I'm hoping for tips on improving these writes. I'm doing the following
handle = open(O_CREAT|O_RDWR)
read(handle, 2kBuffer)
close(handle);
since my reads are small, this one read() tends to do the job in one call
Is there anything I can do to improve the performance? since it's a custom distro of Linux running on a USB2 (removable) disk are there any obvious kernel settings or mount options that I may be missing?
thanks!
I would definitely recommend opening the file readonly if you only intend to read from it.
Aside from this, have you tried doing several operations in parallel? Does it speed things up? What work are you actually doing with the data read from the files? Does the other work take significant time?
Have you profiled your application?
mount the device with "atime" disabled (you really don't need avery read() call to cause a write of meta data). See the noatime mount option. The open() call also takes a O_NOATIME flag doing the same, on a per file basis.
(Though, many kernels/distros have made the "relatime" option default for some time now, yielding mostly the same speedups)
Since reads from disk are block-sized (and ext* doesn't support block suballocation), if you've simply got a bunch of tiny files that don't come anywhere close to filling a block on their own, you'd be better off bundling them into archives. This may not be a win if you can't group related files together, though.
Consider ext4? The dir_index option in ext3 is standard in ext4 and speeds up anything with lots of files in the same directory. It places metadata, directory, and file blocks much more contiguously on disk, and greatly reduces the number of non-data blocks required to track each data block (although that matters more for large files than small). There's a proposal to inline a small file's data into its inode, but I don't think that's in upstream.
If you're seek-bound (as opposed to bandwidth-bound), it may help to call fadvise(FADV_WILLNEED) on a set of files before reading from any of them. The kernel takes this as a hint to readahead into the file cache. Do be careful, though: reading ahead more than cache can hold is wasteful and slower. There's a proposal to add fincore to determine when your files have gotten evicted, but I don't think that's upstream yet either.
If it turns out you're bound by bandwidth, having the files compressed with LZO or gzip can help. The CPU should still be faster decompressing than the disk reads with these compression methods (as opposed to LZMA or bzip2).
Most distros are horrible about setting their blockio-level caching way too low. Try setting
blockdev -setra 8192 /dev/yourdatasdev
it will use a bit more RAM, but the extra caching works well in just about any situations. If you have lots of RAM, use bigger values, I am yet to see a downside to this, the throughput and latency just gets better and better with more RAM allocated to it. There's of course a 'saturation' level, but the stock settings are so low (512) that any improvement tends to have dramatic effects without allocating too much memory for these buffers.
If it is metadata access that slows you down, I like to use a silly trick of putting updatedb in crontab, running in short intervals, which keeps the metadata cache warm and preloaded with all the useful info.

How to avoid caching effects in read benchmarks

I have a read benchmark and between consecutive runs, I have to make sure that the data does not reside in memory to avoid effects seen due to caching. So far what I used to do is: run a program that writes a large file between consecutive runs of the read benchmark. Something like
./read_benchmark
./write --size 64G --path /tmp/test.out
./read_benchmark
The write program simply writes an array of size 1G 64 times to file. Since the size of the main memory is 64G, I write a file that is approx. the same size. The problem is that writing takes a long time and I was wondering if there are better ways to do this, i.e. avoid effects seen when data is cached.
Also, what happens if I write data to /dev/null?
./write --size 64G --path /dev/null
This way, the write program exits very fast, no I/O is actually performed, but I am not sure if it overwrites 64G of main memory, which is what I ultimately want.
Your input is greatly appreciated.
You can drop all caches using a special file in /proc like this:
echo 3 > /proc/sys/vm/drop_caches
That should make sure cache does not affect the benchmark.
You can just unmount the filesystem and mount it back. Unmounting flushes and drops the cache for the filesystem.
Use echo 3 > /proc/sys/vm/drop_caches to flush the pagecache, directory entries cache and inodes cache.
You can the fadvise calls with FADV_DONTNEED to tell the kernel to keep certain files from being cached. You can also use mincore() to verify that the file is not cached. While the drop_caches solution is clearly simpler, this might be better than wiping out the entire cache as that effects all processes on the box.. I don't think you need elevated privledges to use fadvise while I bet you do for writing to /proc. Here is a good example of how to use fadvise calls for this purpose: http://insights.oetiker.ch/linux/fadvise/
One (crude) way that almost never fails is to simply occupy all that excess memory with another program.
Make a trivial program that allocates nearly all the free memory (while leaving enough for your benchmark app). Then memset() the memory to something to ensure that the OS will commit it to physical memory. Finally, do a scanf() to halt the program without terminating it.
By "hogging" all the excess memory, the OS won't be able to use it as cache. And this works in both Linux and Windows. Now you can proceed to do your I/O benchmark.
(Though this might not go well if you're sharing the machine with other users...)

mmap( ) vs read( )

I'm writing a bulk ID3 tag editor in C. ID3 tags are usually at the beginning of an mp3 encoded file, although older (version 1) tags are at the end. The app is designed to accept a directory and frame ID list from the command line, then recurse the directory structure updating all the ID3 tags it finds. The user may additionally choose to remove all older (version 1) tags. Another option is to simply display the current tags, without performing an update. The directory might contain 2 files or 2 million. If the user means to update the files, I was planning to load the entire file into memory, perform the updates, then save it (the file may be renamed as well). However, if the user only means to print the current ID3 tags, then loading the entire file seems excessive. After all the file could be 200mb.
I've read through this thread, which was insightful - mmap() vs. reading blocks
So my question is, what the most efficient way to go about this -- read(), mmap() or some combination? Design ideas welcome.
Edit: It's my understanding that mmap essentially delegates loading a file into memory, to the virtual memory subsystem. It seems to me, the VMM would be highly optimized on most systems as it's critical for system performance.
It really depends on what you're trying to do. If all you need to do is hop to a known offset and read out a small tag, read() may be faster (mmap() has to do some rather complex internal accounting). If you are planning on copying out all 200mb of the MP3, however, or scanning it for some tag that may appear at an unknown offset, then mmap() is likely a faster approach.
For example, if you need to shift the entire file down a few hundred bytes in order to insert an ID3 tag, one simple approach would be to expand the file with ftruncate(), mmap the file, then memmove() the contents down a bit. This, however, will destroy the file if your program crashes while it's running. You could also copy the contents of the file into a new file - this is another place where mmap() really shines; you can simply mmap() the old file, then copy all of its data into the new file with a single write().
In short, mmap() is great if you're doing a large amount of IO in terms of total bytes transferred; this is because it reduces the number of copies needed, and can significantly reduce the number of kernel entries needed for reading cached data. However mmap() requires a minimum of two trips into the kernel (three if you clean up the mapping when you're done!) and does some complex internal kernel accounting, and so the fixed overhead can be high.
read() on the other hand involves an extra memory-to-memory copy, and can thus be inefficient for large I/O operations, but is simple, and so the fixed overhead is relatively low. In short, use mmap() for large bulk I/O, and read() or pread() for one-off, small I/Os.
Don't bother with mmap unless your code is CPU bound, specifically due to lots small reads and writes. mmap may sound nice, but it isn't the awesome why isn't everyone using this alternative it looks like.
Given that you're recursing through potentially large directory structures, your bottleneck will be directory IO and concurrency. mmap is not going to help.
Update0
Reading the linked to question finds this answer that supports my experience:
mmap() vs. reading blocks
If you're not normally going to be streaming the file in and then processing it, but rather hopping around (like read the tags at the front and then jump to the end, etc.) then I would use mmap simply because your code will be cleaner and easier to maintain treating the file as a large buffer without having to actually manage the the buffering and paging yourself.
As has been mentioned, if you're processing a lot of data disk I/O is likely going to dominate your processing anyway. mmap may be faster than read, but for reasonable implementations, it's likely not THAT much faster, especially on todays hardware which has continually got faster and faster while disk drives have been stuck at 7200 and 10,000 RPM for years and years.
So, go with mmap and make your code easy and neat.
I don't know if standard POSIX functions reside inside what you are allowed or you will to use for the development but think about these two functions:
int ftruncate(int fildes, off_t length);
int truncate(const char *path, off_t length);
defined in unistd.h, which can be used to truncate a file up to a specified length. In this way you could easily
find where ID3 tags frame begins (don't know if you can compute it easily by just reading the header of the MP3 file but I guess yes)
save the offset
close the file
truncate the file with the provided function
open the file in append binary mode and write new tags
I'm not sure about the performance, you should test this method, but it should load much less things inside ram while providing a senseful way of doing it.

What posix_fadvise() args for sequential file write?

I am working on an application which does sequentially write a large file (and does not read at all), and I would like to use posix_fadvise() to optimize the filesystem behavior.
The function description in the manpage suggests that the most appropriate strategy would be POSIX_FADV_SEQUENTIAL. However, the Linux implementation description doubts that:
Under Linux, POSIX_FADV_NORMAL sets the readahead window to the default size for the backing device; POSIX_FADV_SEQUENTIAL doubles this size, and POSIX_FADV_RANDOM disables file readahead entirely.
As I'm only writing data (overwriting files possibly too), I don't expect any readahead. Should I then stick with my POSIX_FADV_SEQUENTIAL or rather use POSIX_FADV_RANDOM to disable it?
How about other options, such as POSIX_FADV_NOREUSE? Or maybe do not use posix_fadvise() for writing at all?
Most of the posix_fadvise() flags (eg POSIX_FADV_SEQUENTIAL and POSIX_FADV_RANDOM) are hints about readahead rather than writing.
There's some advice from Linus here and here about getting good sequential write performance. The idea is to break the file into large-ish (8MB) windows, then loop around doing:
Write out window N with write();
Request asynchronous write-out of window N with sync_file_range(..., SYNC_FILE_RANGE_WRITE)
Wait for the write-out of window N-1 to complete with sync_file_range(..., SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER)
Drop window N-1 from the pagecache with posix_fadvise(..., POSIX_FADV_DONTNEED)
This way you never have more than two windows worth of data in the page cache, but you still get the kernel writing out part of the pagecache to disk while you fill the next part.
It all depends on the temporal locality of your data. If your application won't need the data soon after it was written, then you can go with POSIX_FADV_NOREUSE to avoid writing to the buffer cache (in a similar way as the O_DIRECT flag from open()).
As far as writes go I think that you can just rely on the OSes disk IO scheduler to do the right thing.
You should keep in mind that while posix_fadvise is there specifically to give the kernel hints about future file usage patterns the kernel also has other data to help it out.
If you don't open the file for reading then it would only need to read blocks in when they were partially written. If you were to truncate the file to 0 then it doesn't even have to do that (you said that you were overwriting).

C program stuck on uninterruptible wait while performing disk I/O on Mac OS X Snow Leopard

One line of background: I'm the developer of Redis, a NoSQL database. One of the new features I'm implementing is Virtual Memory, because Redis takes all the data in memory. Thanks to VM Redis is able to transfer rarely used objects from memory to disk, there are a number of reasons why this works much better than letting the OS do the work for us swapping (redis objects are built of many small objects allocated in non contiguous places, when serialized to disk by Redis they take 10 times less space compared to the memory pages where they live, and so forth).
Now I've an alpha implementation that's working perfectly on Linux, but not so well on Mac OS X Snow Leopard. From time to time, while Redis tries to move a page from memory to disk, the redis process enters the uninterruptible wait state for minutes. I was unable to debug this, but this happens either in a call to fseeko() or fwrite(). After minutes the call finally returns and redis continues working without problems at all: no crash.
The amount of data transfered is very small, something like 256 bytes. So it should not be a matter of a very big amount of I/O performed.
But there is an interesting detail about the swap file that's target of the write operation. It's a big file (26 Gigabytes) created opening a file with fopen() and then enlarged using ftruncate(). Finally the file is unlink()ed so that Redis continues to take a reference to it, but we are sure that when the Redis process will exit the OS will really free the swap file.
Ok that's all but I'm here for any further detail. And BTW you can even find the actual code in the Redis git, but it's not trivial to understand in five minutes given that's a fairly complex system.
Thank you very much for any help.
As I understand it, HFS+ has very poor support for sparse files. So it may be that your write is triggering a file expansion that is initializing/materializing a large fraction of the file.
For example, I know mmap'ing a new large empty file and then writing at a few random locations produces a very large file on disk with HFS+. It's quite annoying since mmap and sparse files are an extremely convenient way of working with data, and virtually every other platform/filesystem out there handles this gracefully.
Is the swap file written to linearly? Meaning we either replace an existing block or write a new block at the end and increment a free space pointer? If so, perhaps doing more frequent smaller ftruncate calls to expand the file would result in shorter pauses.
As an aside, I'm curious why redis VM doesn't use mmap and then just move blocks around in an attempt to concentrate hot blocks into hot pages.
antirez, I'm not sure I'll be much help since my Apple experience is limited to the Apple ][, but I'll give it a shot.
First thing is a question. I would have thought that, for virtual memory, speed of operation would be a more important measure than disk space (especially for a NoSQL DB where speed is the whole point, otherwise you'd be using SQL, no?). But, if your swap file is 26G, maybe not :-)
Some things to try (if possible).
Try to actually isolate the problem to the seek or write. I have a hard time believing a seek could take that long since, at worst, it should be a buffer pointer change. Still, I didn't write OSX so I can't be sure.
Try adjusting the size of the swap file to see if that's what is causing the problem.
Do you ever dynamically expand the swap file (as opposed to pre-allocation)? If you do, that may be what is causing the problem.
Do you always write as low in the file as you can? It may be that creating a 26G file may not actually fill it with data but, if you create it then write to the last byte, the OS may have to zero out the bytes before then (deferring the initialization, if any).
What happens if you just pre-allocate the entire file (write to every byte) and not unlink it? In other words, leave the file there between runs of your program (creating it if it doesn't already exist of course). Then in your startup code for Redis, just initialize the file (pointers and such). This may get rid of any problems like those in point 4 above.
Ask on the various BSD sites as well. I'm not sure how much Apple changed under the covers but OSX is just BSD at the lowest level (Pax ducks for cover).
Also consider asking on the Apple sites (if you haven't already done so).
Well, that's my small contribution, hopefully it'll help. Good luck with your project.
Have you turned off file caching for your file? i.e. fcntl(fd, F_GLOBAL_NOCACHE, 1)
Have you tried debugging with DTrace and or Instruments (Apple's experimental dtrace front-end)?
Exploring Leopard with DTrace
Debugging Chrome on OS X
As Linus said once on the Git mailing list:
"I realize that OS X people have a hard time accepting it, but OS X
filesystems are generally total and utter crap - even more so than
Windows."

Resources