How to avoid damaging SD card for large writes? - c

Ok, first a little background to help make my question clear:
I am working on a device that collects certain data from sensors and posts them to a server using a GSM modem. As a GSM connection is not 100% reliable, it would contain a logging mechanism that would write unsent data to an SD card.
We are using Chan's FatFs module for providing us with a file system as we want the log to be readable on a PC.
Now I've been testing the FAT system for boundary conditions, i.e., trying to fill up the card completely.
In the first run I opened the file and set the code to keep writing a string until the drive was full. The program would synch after every write.
I left the code running overnight.
The next day, I examined the SD card. I found that the file was only 150 MB in size. There were about 1.2 million lines written to it. The card could still be read from but not written to or formatted.
Next time I tried the same type of test, but this time I used the f_lseek() function to pre-allocate the file to 1GB. It would then write to that file until that limit was reached. This time the data would be synced after 50 writes. It would then close that file and open another to do the same.
As you can guess another brave little card lost it's mind that day.
So these are what I would like help with :
How to prevent damage to the card while writing large amounts of data?
Does leaving the file open for extended periods have any negative effects?
Since the full code may be too long, here's the main part where the writing happens
for(file_count=3;file_count>=0;--file_count){
ax_log_msg(E_LOG_INFO,"===================================");
ax_log_msg(E_LOG_INFO,file_names[file_count]);
f_open(&file_ptr,file_names[file_count],FA_WRITE|FA_OPEN_ALWAYS);
if(result!=FR_OK){
ax_log_msg(E_LOG_INFO,"\n\rf_open Failed\n\rResult code");
ax_log_msg(E_LOG_INFO,FRESULT_S[result]);
continue;
}
ax_log_msg(E_LOG_INFO,"\n\rf_open Sucessfull");
result=f_lseek(&file_ptr,FILE_SIZE_LIMIT_1GB);
if(result!=FR_OK){
ax_log_msg(E_LOG_INFO,"\n\rf_lseek Failed for preallocation\n\rResult code");
ax_log_msg(E_LOG_INFO,FRESULT_S[result]);
f_close(&file_ptr);
continue;
}
ax_log_msg(E_LOG_INFO,"\n\rf_lseek Sucessfull for preallocation");
f_lseek(&file_ptr,0);
bytes_to_write=sizeof(messages[file_count]);
write_count=0;
while( (f_tell(&file_ptr) < FILE_SIZE_LIMIT_1GB )){
result=f_write(&file_ptr,messages[file_count],bytes_to_write,&bytes_written);
if(result==FR_OK){
++write_count;
if(write_count%50==0){
f_sync(&file_ptr);
}
}else{
ax_log_msg(E_LOG_INFO,"\n\rWrite failed\n\rFRESULT=");
ax_log_msg(E_LOG_INFO,FRESULT_S[result]);
break;
}
}
f_close(&file_ptr);
}
Note :
ax_log_msg() is part of the device firmware to print on console.
FRESULT_S[result] is used to convert the enum result code to a string.
If there is any data missing, please do mention it.
Thank You

You probably need to buffer an entire block of data, perhaps 4 KB, to avoid flashing an entire block with every flush. But, the filesystem or driver should do this for you, as long as you don't call fflush explicitly, which is the real lesson.
Why do you need it to be synced so often? Perhaps a timer would work better than an interval per number of records?

Due to 100,000 write cycles limit per sector it is a really challenging task to extend a flash memory lifespan. One of my cards died over one night after I run writing tests on it. I then counted time periods, and that's indeed easy to perform 100,000 writes (in the same sector) just in one night (without taking into account a calculation it comes through experience).
At that time I was told that there is a smart monitors in some filesystems and they count and keep writes number for every sector in order to writings number per every sector was the same, I guess. I neither took nor tested one.
I now found some extremely popular/highly voted answer/suggestion for Raspberrypi and I quote it here now:
These methods should increase the lifespan of the SD card by minimising the number of read/writes in various ways:
Disable Swap
Swapping is the process of using part of the SD card as volatile memory. This will increase the amount of RAM available, but it will result in a high number of read/writes. It is unlikely to increase performance significantly.
Disable swap with the swapoff command:
sudo swapoff --all
You must also prevent it from coming back after a reboot:
For Raspbian which uses dphys-swapfile to manage a swap file (instead of a "normal" swap partition) you can simply sudo apt-get remove dphys-swapfile to remove it permanently. Best to remove because setting the CONF_SWAPSIZE to 0, as explained in this answer, doesn't seem to work and still creates a 100MB swap file after reboot.
For other distributions that use a swap partition instead of a swap file, remove the appropriate line from /etc/fstab
Disabling Journaling on the Filesystem
Using a journaling filesystem such as ext3 or ext4 WITHOUT a journal is an option to decrease read/writes. The obvious drawback of using a filesystem with journaling disabled is data loss as a result of an ungraceful dismount (i.e. post power failure, kernel lockup, etc.).
You can disable journaling on ext3 by mounting it as ext2
You can disable journaling on ext4 on an unmounted drive like this:
tune4fs -O ^has_journal /dev/sdaX
e4fsck –f /dev/sdaX
sudo reboot
The noatime Mount Flag
Assign the noatime mount flag to partitions residing on the SD card by adding it to the options section of the partition in /etc/fstab.
Reading accesses to the file system will no longer result in an update to the atime information associated with the file. The importance of the noatime setting is that it eliminates the need by the system to make writes to the file system for files which are simply being read. Since writes can be somewhat expensive as mentioned in previous section, this can result in measurable performance gains. Note that the write time information to a file will continue to be updated anytime the file is written to with this option enabled.
Directories in RAM
Highly used directories such as /var/tmp/ and possibly /var/log can be relocated to RAM in /etc/fstab like this:
tmpfs /var/tmp tmpfs nodev,nosuid,size=50M 0 0
This will allow /var/tmp to use 50MB of RAM as disk space. The only issue with doing this is that any drives mounted in RAM will not persist past a reboot. Thus if you mount /var/log and your system encounters an error that causes it to reboot, you will not be able to find out why.
Directories in external Hard Disk
You can also mount some directories on a persistent USB hard disk. More details of this can be found in this question.
The Raspberry Pi can also boot it's root partition from an external drive. This could be via USB or Ethernet and means that the SD card will only be used to delegate to different device during boot. This requires a bit of kernel hacking to accomplish, as I don't think the default kernel supports USB storage. You can find more information at this question, or this external blog post.
Here is one more interesting consideration from another answerer:
Excellent article about flash filesystems.
Important question when talking about flash filesystems is following: What is wear leveling? Wikipedia article. Basically, on flash disks you can write limited number of times until block goes bad. After that, filesystem (if there is no built-in wear leveling management on hardware, as in case of SSDs there usually is) must mark that block as invalid, and avoid using it anymore.
Typical filesystems (for example reiserfs, ntfs, ext3 and so on) are designed for hard disks, that do not have such limitations.
JFFS2
Includes compression and elegant wear leveling protection.
YAFFS2
Single thing that makes the difference: short mount times, after successful umount.
Implements write once property: once data is written to one block, there is no need to rewrite it. This is important for protecting against wear leveling.
LogFS
Not very mature, but already included in Linux kernel tree.
Supports larger filesystems than JFFS2/YAFFS2 without problems.
UBIFS
More mature than LogFS
Write caching support
On scalability: article. On large disks, better performance than with JFFS2
ext4
If no driver or card (for example SSD drives do have internal wear leveling, at least usually) handle wear leveling, then ext4 is not the best idea, as it is not intended for raw flash usage.
What is best one?
Of course, it depends on usage and support. From what I read from the internet, I would recommend UBIFS. Good support for large filesystems, mature development phase, adequate performance and no huge downsides.
Thanks to answerers:
How can I extend the life of my SD card?
Choice of filesystem for GNU/Linux on an SD card

Related

Optimizing data stream to disk in C (also flash memory)

I have a C program running on Linux that acquires data from a USB device (sensor data), does some processing and streams the result to disk. Currently I save to a text file using fputs(), a line looks like this:
timestamp value1 value2 ... valueN
the sample rate being up to 250Hz.
The program should run on a RPi or similar board and possibly write the data to a flash memory (SD card).
I have following questions:
Should I be optimizing the data stream or let the OS do the job? More specifically, should I be trying to minimize how often data is actually written to disk (also given the use of a flash memory)?
I have read about setbuf() and setvbuf(), as I understand they should effectively delay writing until a "block" is filled. Are these appropriate or is there a better way other than perhaps implementing my own buffer?
Which output function is best suited for data streaming with the above in mind (fputs() / fprintf() / write())?
Should I be trying to increase randomness (as to use all sectors) when writing to a SD card? If yes what's the best way to achieve this?
Here some more thoughts:
I can consider using a binary format to decrease size, but I would prefer keeping the text format to simplify later data handling.
Using a hard drive is also an option in the final design, especially if a high acquisition rate is to be carried on over a long time.
The data rate being relatively low I do not expect bandwidth problem with either hard drive or SD card. It is possible that the rate will be higher in the future (kHz or more).
Thanks for your answers.
EDIT 20130128
Thank you for all the answers so far, they give me some good insight. I'll sum it up a bit:
In general I should not have bandwidth issues, however to avoid unnecessary large log files I might consider a binary format. Yes the log should be human readable, if not I'll make an export function or similar. Yes unwind's assumption is correct, about 10 or 15 data values each line.
The mentioned read/write cycles per cell should be enough for some time, at least in the testing phase, considering we don't always write and delete the same cells. I will play around with buffer size in setvbuf() and set the buffering mode to full buffering to see if I can optimize this while keeping a reasonable save interval (a few seconds or more also depending on sample rate).
In the final design I might use a hard drive to avoid most of the problems mentioned here, or a second SD card which can be easily replaced (might be also good to quickly retrieve the data). I will format this with one of the format suggested here (FAT or JFFS2/F2FS).
Following zmo's suggestion I will try to make the system as read only as possible (at least the system partition), I was already considering this.
A Beaglebone, also mentioned by zmo, is my next choice if I'm not happy with the RPi (I read that its USB bus is not always stable, USB is obviously very important for my application).
I have already implemented a UDP port to send data over network, still I would like to keep at least a local copy of that data and maybe only send a subset of or already processed data, as well as "control data".
Should I be optimizing the data stream or let the OS do the job? More specifically, should I be trying to minimize how often data is actually written to disk (also given the use of a flash memory)?
Well, you can usually assume that the OS does a pretty awesome job at buffering and handling output to the hard drive… As long as you don't do unbuffered writes.
Though, from my experience, you should not write logs to a SD Card, because it definitely kills the SD Card faster than you can imagine. On my first projects, I had installed linux on beaglebones, and between 6 months to 12 months after, all my SD Cards had to be replaced…
Since then, I've learned to run read only systems on the SD card and send any kind of regular updates over the network, the trick being to use a ramdisk for /tmp and /var.
In your case, using a hard drive is an easy solution (which will works smoothly), but you can also use a secondary SD Card where you write the logs. Then you'll be able to use a "stupid" filesystem such as a FAT one where you'll write your data aligned, as your data will be the only thing to be written on the SD. What is killing a SDCard is lots of little read/writes that happen a lot with temporary files, and defragmentation of the drive.
I have read about setbuf() and setvbuf(), as I understand they should effectively delay writing until a "block" is filled. Are these appropriate or is there a better way other than perhaps implementing my own buffer?
well, just keep it to full buffering, it will help write your data aligned on the filesystem.
Which output function is best suited for data streaming with the above in mind (fputs() / fprintf() / write())?
they should all behave similarly for your problematic.
Should I be trying to increase randomness (as to use all sectors) when writing to a SD card? If yes what's the best way to achieve this?
the firmware of the sdcard should be taking care of that for you. The only thing would be to use a simpler filesystem like FAT (or JFFS2/F2FS like ivan-voras suggets), because ext2/ext3/ext4 filesystems do automatic defragmentation which basically is moving around inodes to keep everything aligned. Though I'm not sure if it disables that behavior with SDcards and SSDs.
Writing to the SD card often will definitely kill it sooner, but it also means you can attempt to prolong this time by reducing the number of writes. As others have said, the best solution for you would be to write the logs over the network to a server or just another machine which has proper storage (in the simplest case, maybe you can use syslog(3) or just plain NFS).
If you want to continue with the original plan, then using setvbuf(3) to enable block buffered mode and setting a large buffer size (like 128 KiB or 256 KiB) would be best. A large buffer size also means that you will lose unwritten data from the buffer if power goes out, etc.
However, a large buffer only delays the inevitable and you should search for other options. It's not as alarming as Lundin's answer states because there are many cells and you're not writing always to the same one, so if you get the largest SD card you can buy, then using his method you can calculate approximately how many times you can rewrite the entire card before it fails. Using a flash-friendly file system such as F2FS or JFFS2 will be beneficial.
Here're my thoughts:
It might be a good idea to buffer some data in memory before writing to disk, but keep in mind that this might cause data loss in case of power failure.
I think this is highly dependent on the file system and type of storage you use. There is no generic answer but it could prove useful to implement and benchmark it on your specific configuration.
Considering the huge amount of data you're outputting, I'd choose a binary format (unless you want the file to be human readable)
The firmware of the flash drive should already take care of this. Basically this is the cornerstone of all modern SSDs. (SD card controllers should implement it too.)

How to prevent C read() from reading from cache

I have a program that is used to exercise several disk units in a raid configuration. 1 process synchronously (O_SYNC) writes random data to a file using write(). It then puts the name of the directory into a shared-memory queue, where a 2nd process is waiting for the queue to have entries to read the data back into memory using read().
The problem that I can't seem to overcome is that when the 2nd process attempts to read the data back into memory, none of the disk units show read accesses. The program has code to check whether or not the data read back in is equal to the code that is written to disk, and the data always matches.
My question is, how can I make the OS (IBM i) not buffer the data when it is written to disk so that the read() system call accesses the data on the disk rather than in cache? I am doing simple throughput calculations and the read() operations are always 10+ times faster than the write operations.
I have tried using the O_DIRECT flag, but cannot seem to get the data to write to the file. It could have to do with setting up the correct aligned buffers. I have also tried the posix_fadvise(fd, offset,len, POSIX_FADV_DONTNEED) system call.
I have read through this similar question but haven't found a solution. I can provide code if it would be helpful.
My though is that if you write ENOUGH data, then there simply won't be enough memory to cache it, and thus SOME data must be written to disk.
You can also, if you want to make sure that small writes to your file works, try writing ANOTHER large file (either from the same process or a different one - for example, you could start a process like dd if=/dev/zero of=myfile.dat bs=4k count=some_large_number) to force other data to fill the cache.
Another "trick" may be to "chew up" some (more like most) of the RAM in the system - just allocate a large lump of memory, then write to some small part of it at a time - for example, an array of integers, where you write to every 256th entry of the array in a loop, moving to one step forward each time - that way, you walk through ALL of the memory quickly, and since you are writing continuously to all of it, the memory will have to be resident. [I used this technique to simulate a "busy" virtual machine when running VM tests].
The other option is of course to nobble the caching system itself in OS/filesystem driver, but I would be very worried about doing that - it will almost certainly slow the system down to a slow crawl, and unless there is an existing option to disable it, you may find it hard to do accurately/correctly/reliably.
...exercise several disk units in a raid configuration... How? IBM i doesn't allow a program access to the hardware. How are you directing I/O to any specific physical disks?
ANSWER: The write/read operations are done in parallel against IFS so the stream file manager is selecting which disks to target. By having enough threads reading/writing, the busyness of SYSBASE or an IASP can be driven up.
...none of the disk units show read accesses. None of them? Unless you are running the sole job on a system in restricted state, there is going to be read activity on the disks from other tasks. Is the system divided into multiple LPARs? Multiple ASPs? I'm suggesting that you may be monitoring disks that this program isn't writing to, because IBM i handles physical I/O, not programs.
ANSWER I guess none of them is a slight exaggeration - I know which disks belong to SYSBASE and those disks are not being targeted with many read requests. I was just trying to generalize for an audience not familiar w/IBM i. In the picture below, you will see that the write reqs are driving the % busyness up, but the read reqs are not even though they are targeting the same files.
...how can I make the OS (IBM i) not buffer the data when it is written to disk... Use a memory starved main storage pool to maximise paging, write immense blocks of data so as to guarantee that the system and disk controller caches overflow and use a busy machine so that other tasks are demanding disk I/O as well.

When are sequential seeks with small reads slower than reading a whole file?

I've run into a situation where lseek'ing forward repeatedly through a 500MB file and reading a small chunk (300-500 bytes) between each seek appears to be slower than read'ing through the whole file from the beginning and ignoring the bytes I don't want. This appears to be true even when I only do 5-10 seeks (so when I only end up reading ~1% of the file). I'm a bit surprised by this -- why would seeking forward repeatedly, which should involve less work, be slower than reading which actually has to copy the data from kernel space to userspace?
Presumably on local disk when seeking the OS could even send a message to the drive to seek without sending any data back across the bus for even more savings. But I'm accessing a network mount, where I'd expect read to be much slower than seek (sending one packet saying to move N bytes ahead versus actually transferring data across the network).
Regardless of whether reading from local disk or a network filesystem, how could this happen? My only guess is the OS is prefetching a ton of data after each location I seek to. Is this something that can normally occur or does it likely indicate a bug in my code?
The magnitude of the difference will be a factor of the ratio of the seek count/data being read to the size of the entire file.
But I'm accessing a network mount, where I'd expect read to be much slower than seek (sending one packet saying to move N bytes ahead versus actually transferring data across the network).
If there's rotational magnetic drives at the other end of the network, the effect will still be present and likely significantly compounded by the round trip time. The network protocol may play a role too. Even solid state drives may take some penalty.
I/O schedulers may reorder requests in order to minimize head movements (perhaps naively even for storage devices without a head). A single bulk request might give you some greater efficiency across many layers. The filesystems have an opportunity to interfere here somewhat.
Regardless of whether reading from local disk or a network filesystem, how could this happen?
I wouldn't be quick to dismiss the effect of those layers -- do you have measurements which show the same behavior from a local disk? It's much easier to draw conclusions without quite so much between you and the hardware. Start with a raw device and bisect from there.
Have you considered using a memory map instead? It's perfect for this use case.
Depending on the filesystem, the specific lseek implementation make create some overhead.
For example, I believe when using NFS, lseek locks the kernel by calling remote_llseek().

improving small file read times with USB2 attached ext2 volume

I'm a more experienced Windows programmer than I am a Linux programmer. Apologies if I'm missing something obvious.
I need to read >10,000 small files (~2->10k) on a USB2 attached ext2 volume running Linux. The distro is a custom and runs busybox.
I'm hoping for tips on improving these writes. I'm doing the following
handle = open(O_CREAT|O_RDWR)
read(handle, 2kBuffer)
close(handle);
since my reads are small, this one read() tends to do the job in one call
Is there anything I can do to improve the performance? since it's a custom distro of Linux running on a USB2 (removable) disk are there any obvious kernel settings or mount options that I may be missing?
thanks!
I would definitely recommend opening the file readonly if you only intend to read from it.
Aside from this, have you tried doing several operations in parallel? Does it speed things up? What work are you actually doing with the data read from the files? Does the other work take significant time?
Have you profiled your application?
mount the device with "atime" disabled (you really don't need avery read() call to cause a write of meta data). See the noatime mount option. The open() call also takes a O_NOATIME flag doing the same, on a per file basis.
(Though, many kernels/distros have made the "relatime" option default for some time now, yielding mostly the same speedups)
Since reads from disk are block-sized (and ext* doesn't support block suballocation), if you've simply got a bunch of tiny files that don't come anywhere close to filling a block on their own, you'd be better off bundling them into archives. This may not be a win if you can't group related files together, though.
Consider ext4? The dir_index option in ext3 is standard in ext4 and speeds up anything with lots of files in the same directory. It places metadata, directory, and file blocks much more contiguously on disk, and greatly reduces the number of non-data blocks required to track each data block (although that matters more for large files than small). There's a proposal to inline a small file's data into its inode, but I don't think that's in upstream.
If you're seek-bound (as opposed to bandwidth-bound), it may help to call fadvise(FADV_WILLNEED) on a set of files before reading from any of them. The kernel takes this as a hint to readahead into the file cache. Do be careful, though: reading ahead more than cache can hold is wasteful and slower. There's a proposal to add fincore to determine when your files have gotten evicted, but I don't think that's upstream yet either.
If it turns out you're bound by bandwidth, having the files compressed with LZO or gzip can help. The CPU should still be faster decompressing than the disk reads with these compression methods (as opposed to LZMA or bzip2).
Most distros are horrible about setting their blockio-level caching way too low. Try setting
blockdev -setra 8192 /dev/yourdatasdev
it will use a bit more RAM, but the extra caching works well in just about any situations. If you have lots of RAM, use bigger values, I am yet to see a downside to this, the throughput and latency just gets better and better with more RAM allocated to it. There's of course a 'saturation' level, but the stock settings are so low (512) that any improvement tends to have dramatic effects without allocating too much memory for these buffers.
If it is metadata access that slows you down, I like to use a silly trick of putting updatedb in crontab, running in short intervals, which keeps the metadata cache warm and preloaded with all the useful info.

real-time writes to disk

I have a thread that needs to write data from an in-memory buffer to a disk thousands of times. I have some requirements of how long each write takes because the buffer needs to be cleared for a separate thread to write to it again.
I have tested the disk with dd. I'm not using any filesystem on it and writing directly to the disk (opening it with the direct flag). I am able to get about 100 MB/s with a 32K block size.
In my application, I noticed I wasn't able to write data to the disk at nearly this speed. So I looked into what was happening and I find that some writes are taking very long. My block of code looks like (this is in C by the way):
last = get_timestamp();
write();
now = get_timestamp();
if (longest_write < now - last)
longest_write = now - last;
And at the end I print out the longest write. I found that for a 32K buffer, I am seeing a longest write speed of about 47ms. This is way too long to meet the requirements of my application. I don't think this can be solely attributed to rotational latency of the disk. Any ideas what is going on and what I can do to get more stable write speeds? Thanks
Edit:
I am in fact using multiple buffers of the type I declare above and striping between them to multiple disks. One solution to my problem would be to just increase the number of buffers to amortize the cost of long writes. However I would like to keep the amount of memory being used for buffering as small as possible to avoid dirtying the cache of the thread that is producing the data written into the buffer. My question should be constrained to dealing with variance in the latency of writing a small block to disk and how to reduce it.
I'm assuming that you are using an ATA or SATA drive connected to the built-in disk controller in a standard computer. Is this a valid assumption, or are you using anything out of the ordinary (hardware RAID controller, SCSI drives, external drive, etc)?
As an engineer who does a lot of disk I/O performance testing at work, I would say that this sounds a lot like your writes are being cached somewhere. Your "high latency" I/O is a result of that cache finally being flushed. Even without a filesystem, I/O operations can be cached in the I/O controller or in the disk itself.
To get a better view of what is going on, record not just your max latency, but your average latency as well. Consider recording your max 10-15 latency samples so you can get a better picture of how (in-)frequent these high-latency samples are. Also, throw out the data recorded in the first two or three seconds of your test and start your data logging after that. There can be high-latency I/O operations seen at the start of a disk test that aren't indicative of the disk's true performance (can be caused by things like the disk having to rev up to full speed, the head having to do a large initial seek, disk write cache being flushed, etc).
If you are wanting to benchmark disk I/O performance, I would recommend using a tool like IOMeter instead of using dd or rolling your own. IOMeter makes it easy to see what kind of a difference it makes to change the I/O size, alignment, etc, plus it keeps track of a number of useful statistics.
Requiring an I/O operation to happen within a certain amount of time is a risky thing to do. For one, other applications on the system can compete with you for disk access or CPU time and it is nearly impossible to predict their exact effect on your I/O speeds. Your disk might encounter a bad block, in which case it has to do some extra work to remap the affected sectors before processing your I/O. This introduces an unpredictable delay. You also can't control what the OS, driver, and disk controller are doing. Your I/O request may get backed up in one of those layers for any number of unforseeable reasons.
If the only reason you have a hard limit on I/O time is because your buffer is being re-used, consider changing your algorithm instead. Try using a circular buffer so that you can flush data out of it while writing into it. If you see that you are filling it faster than flushing it, you can throttle back your buffer usage. Alternatively, you can also create multiple buffers and cycle through them. When one buffer fills up, write that buffer to disk and switch to the next one. You can be writing to the new buffer even if the first is still being written.
Response to comment:
You can't really "get the kernel out of the way", it's the lowest level in the system and you have to go through it to one degree or another. You might be able to build a custom version of the driver for your disk controller (provided it's open source) and build in a "high-priority" I/O path for your application to use. You are still at the mercy of the disk controller's firmware and the firmware/hardware of the drive itself, which you can't necessarily predict or do anything about.
Hard drives traditionally perform best when doing large, sequential I/O operations. Drivers, device firmware, and OS I/O subsystems take this into account and try to group smaller I/O requests together so that they only have to generate a single, large I/O request to the drive. If you are only flushing 32K at a time, then your writes are probably being cached at some level, coalesced, and sent to the drive all at once. By defeating this coalescing, you should reduce the number of I/O latency "spikes" and see more uniform disk access times. However, these access times will be much closer to the large times seen in your "spikes" than the moderate times that you are normally seeing. The latency spike corresponds to an I/O request that didn't get coalesced with any others and thus had to absorb the entire overhead of a disk seek. Request coalescing is done for a reason; by bundling requests you are amortizing the overhead of a drive seek operation over multiple commands. Defeating coalescing leads to doing more seek operations than you would normally, giving you overall slower I/O speeds. It's a trade-off: you reduce your average I/O latency at the expense of occasionally having an abnormal, high-latency operation. It is a beneficial trade-off, however, because the increase in average latency associated with disabling coalescing is nearly always more of a disadvantage than having a more consistent access time is an advantage.
I'm also assuming that you have already tried adjusting thread priorities, and that this isn't a case of your high-bandwidth producer thread starving out the buffer-flushing thread for CPU time. Have you confirmed this?
You say that you do not want to disturb the high-bandwidth thread that is also running on the system. Have you actually tested various output buffer sizes/quantities and measured their impact on the other thread? If so, please share some of the results you measured so that we have more information to use when brainstorming.
Given the amount of memory that most machines have, moving from a 32K buffer to a system that rotates through 4 32K buffers is a rather inconsequential jump in memory usage. On a system with 1GB of memory, the increase in buffer size represents only 0.0092% of the system's memory. Try moving to a system of alternating/rotating buffers (to keep it simple, start with 2) and measure the impact on your high-bandwidth thread. I'm betting that the extra 32K of memory isn't going to have any sort of noticeable impact on the other thread. This shouldn't be "dirtying the cache" of the producer thread. If you are constantly using these memory regions, they should always be marked as "in use" and should never get swapped out of physical memory. The buffer being flushed must stay in physical memory for DMA to work, and the second buffer will be in memory because your producer thread is currently writing to it. It is true that using an additional buffer will reduce the total amount of physical memory available to the producer thread (albeit only very slightly), but if you are running an application that requires high bandwidth and low latency then you would have designed your system such that it has quite a lot more than 32K of memory to spare.
Instead of solving the problem by trying to force the hardware and low-level software to perform to specific performance measurements, the easier solution is to adjust your software to fit the hardware. If you measure your max write latency to be 1 second (for the sake of nice round numbers), write your program such that a buffer that is flushed to disk will not need to be re-used for at least 2.5-3 seconds. That way you cover your worst-case scenario, plus provide a safety margin in case something really unexpected happens. If you use a system where you rotate through 3-4 output buffers, you shouldn't have to worry about re-using a buffer before it gets flushed. You aren't going to be able to control the hardware too closely, and if you are already writing to a raw volume (no filesystem) then there's not much between you and the hardware that you can manipulate or eliminate. If your program design is inflexible and you are seeing unacceptable latency spikes, you can always try a faster drive. Solid-state drives don't have to "seek" to do I/O operations, so you should see a fairly uniform hardware I/O latency.
As long as you are using O_DIRECT | O_SYNC, you can use ioprio_set() to set the IO scheduling priority of your process/thread (although the man page says "process", I believe you can pass a TID as given by gettid()).
If you set a real-time IO class, then your IO will always be given first access to the disk - it sounds like this is what you want.
I have a thread that needs to write data from an in-memory buffer to a disk thousands of times.
I have tested the disk with dd. I'm not using any filesystem on it and writing directly to the disk (opening it with the direct flag). I am able to get about 100 MB/s with a 32K block size.
The dd's block size is aligned with file system block size. I guess your log file isn't.
Plus probably your application writes not only the log file, but also does some other file operations. Or your application isn't alone using the disk.
Generally, disk I/O isn't optimized for latencies, it is optimized for the throughput. High latencies are normal - and networked file systems have them even higher.
In my application, I noticed I wasn't able to write data to the disk at nearly this speed. So I looked into what was happening and I find that some writes are taking very long.
Some writes take longer time because after some point of time you saturate the write queue and OS finally decides to actually flush the data to disk. The I/O queues by default configured pretty short: to avoid excessive buffering and information loss due to a crash.
N.B. If you want to see the real speed, try setting the O_DSYNC flag when opening the file.
If your blocks are really aligned you might try using the O_DIRECT flag, since that would remove contentions (with other applications) on the Linux disk cache level. The writes would work at the real speed of the disk.
100MB/s with dd - without any syncing - is a highly synthetic benchmark, as you never know that data have really hit the disk. Try adding conv=dsync to the dd's command line.
Also trying using larger block size. 32K is still small. IIRC 128K size was the optimal when I was testing sequential vs. random I/O few years ago.
I am seeing a longest write speed of about 47ms.
"Real time" != "fast". If I define max response time of 50ms, and your app consistently responds within the 50ms (47 < 50) then your app would classify as real-time.
I don't think this can be solely attributed to rotational latency of the disk. Any ideas what is going on and what I can do to get more stable write speeds?
I do not think you can avoid the write() delays. Latencies are the inherit property of the disk I/O. You can't avoid them - you have to expect and handle them.
I can think only of the following option: use two buffers. First would be used by write(), second - used for storing new incoming data from threads. When write() finishes, switch the buffers and if there is something to write, start writing it. That way there is always a buffer for threads to put the information into. Overflow might still happen if threads generate information faster than the write() can write. Dynamically adding more buffers (up to some limit) might help in the case.
Otherwise, you can achieve some sort of real-time-ness for (rotational) disk I/O only if your application is the sole user of the disk. (Old rule of real time applications applies: there can be only one.) O_DIRECT helps somehow to remove the influence of the OS itself from the equation. (Though you would still have the overhead of file system in form of occasional delays due to block allocation for the file extension. Under Linux that works pretty fast, but still can be avoided by preallocating the whole file in advance, e.g. by writing zeros.) If the timing is really important, consider buying dedicated disk for the job. SSDs have excellent throughput and do not suffer from the seeking.
Are you writing to a new file or overwriting the same file?
The big difference with dd is likely to be seek time, dd is streaming to a contigous (mostly) list of blocks, if you are writing lots of small files the head may be seeking all over the drive to allocate them.
The best way of solving the problem is likely to be removing the requirement for the log to be written in a specific time. Can you use a set of buffers so that one is being written (or at least sent to the drives's buffer) while new log data is arriving into another one?
linux does not write anything directly to the disk it will use the virtual memory and then, a kernel thread call pdflush will write these datas to the disk , the behavior of pdflush could be controlled through sysctl -w ""

Resources