I am running some experiments with I/O intensive applications and am trying to understand the effects of varying the kernel i/o buffer size, different elevator algorithms, and so on.
How can I know the current size of the i/o buffer in the kernel? Does the kernel use more than one buffer as need arises? How can I change the size of this buffer? Is there a config file somewhere that stores this info?
(To be clear, I am not talking about processor or disk caches, I am talking about the buffer used by the kernel internally that buffers reads/writes before flushing them out to disk from time to time).
Thanks in advance.
The kernel does not buffer reads and writes the way you think... It maintains a "page cache" that holds pages from the disk. You do not get to manipulate its size (well, not directly, anyway); the kernel will always use all available free memory for the page cache.
You need to explain what you are really trying to do. If you want some control over how much data the kernel pre-fetches from disk, try a search for "linux readahead". (Hint: blockdev --setra XXX)
If you want some control over how long the kernel will hold dirty pages before flushing them to disk, try a search for "linux dirty_ratio".
A specific application can also bypass the page cache completely by using O_DIRECT, and it can exercise some control over it using fsync, sync_file_range, posix_fadvise, and posix_madvise. (O_DIRECT and sync_file_range are Linux-specific; the rest are POSIX.)
You will be able to ask a better question if you first educate yourself about the Linux VM subsystem, especially the page cache.
I think you mean the disk IO queues. For example:
$ cat /sys/block/sda/queue/nr_requests
128
How this queue is used depends on the IO scheduler that is in use.
$ cat /sys/block/sda/queue/scheduler
noop anticipatory deadline [cfq]
cfq is the most common choice, although on systems with advanced disk controllers and in virtual guest systems noop is also a very good choice.
There is no config file for this information that I am aware of. On systems that I need to change the queue settings on I put the changes into /etc/rc.local although you could use a full-up init script instead and place it into an RPM or DEB for mass distribution to a lot of systems.
Related
Ok, first a little background to help make my question clear:
I am working on a device that collects certain data from sensors and posts them to a server using a GSM modem. As a GSM connection is not 100% reliable, it would contain a logging mechanism that would write unsent data to an SD card.
We are using Chan's FatFs module for providing us with a file system as we want the log to be readable on a PC.
Now I've been testing the FAT system for boundary conditions, i.e., trying to fill up the card completely.
In the first run I opened the file and set the code to keep writing a string until the drive was full. The program would synch after every write.
I left the code running overnight.
The next day, I examined the SD card. I found that the file was only 150 MB in size. There were about 1.2 million lines written to it. The card could still be read from but not written to or formatted.
Next time I tried the same type of test, but this time I used the f_lseek() function to pre-allocate the file to 1GB. It would then write to that file until that limit was reached. This time the data would be synced after 50 writes. It would then close that file and open another to do the same.
As you can guess another brave little card lost it's mind that day.
So these are what I would like help with :
How to prevent damage to the card while writing large amounts of data?
Does leaving the file open for extended periods have any negative effects?
Since the full code may be too long, here's the main part where the writing happens
for(file_count=3;file_count>=0;--file_count){
ax_log_msg(E_LOG_INFO,"===================================");
ax_log_msg(E_LOG_INFO,file_names[file_count]);
f_open(&file_ptr,file_names[file_count],FA_WRITE|FA_OPEN_ALWAYS);
if(result!=FR_OK){
ax_log_msg(E_LOG_INFO,"\n\rf_open Failed\n\rResult code");
ax_log_msg(E_LOG_INFO,FRESULT_S[result]);
continue;
}
ax_log_msg(E_LOG_INFO,"\n\rf_open Sucessfull");
result=f_lseek(&file_ptr,FILE_SIZE_LIMIT_1GB);
if(result!=FR_OK){
ax_log_msg(E_LOG_INFO,"\n\rf_lseek Failed for preallocation\n\rResult code");
ax_log_msg(E_LOG_INFO,FRESULT_S[result]);
f_close(&file_ptr);
continue;
}
ax_log_msg(E_LOG_INFO,"\n\rf_lseek Sucessfull for preallocation");
f_lseek(&file_ptr,0);
bytes_to_write=sizeof(messages[file_count]);
write_count=0;
while( (f_tell(&file_ptr) < FILE_SIZE_LIMIT_1GB )){
result=f_write(&file_ptr,messages[file_count],bytes_to_write,&bytes_written);
if(result==FR_OK){
++write_count;
if(write_count%50==0){
f_sync(&file_ptr);
}
}else{
ax_log_msg(E_LOG_INFO,"\n\rWrite failed\n\rFRESULT=");
ax_log_msg(E_LOG_INFO,FRESULT_S[result]);
break;
}
}
f_close(&file_ptr);
}
Note :
ax_log_msg() is part of the device firmware to print on console.
FRESULT_S[result] is used to convert the enum result code to a string.
If there is any data missing, please do mention it.
Thank You
You probably need to buffer an entire block of data, perhaps 4 KB, to avoid flashing an entire block with every flush. But, the filesystem or driver should do this for you, as long as you don't call fflush explicitly, which is the real lesson.
Why do you need it to be synced so often? Perhaps a timer would work better than an interval per number of records?
Due to 100,000 write cycles limit per sector it is a really challenging task to extend a flash memory lifespan. One of my cards died over one night after I run writing tests on it. I then counted time periods, and that's indeed easy to perform 100,000 writes (in the same sector) just in one night (without taking into account a calculation it comes through experience).
At that time I was told that there is a smart monitors in some filesystems and they count and keep writes number for every sector in order to writings number per every sector was the same, I guess. I neither took nor tested one.
I now found some extremely popular/highly voted answer/suggestion for Raspberrypi and I quote it here now:
These methods should increase the lifespan of the SD card by minimising the number of read/writes in various ways:
Disable Swap
Swapping is the process of using part of the SD card as volatile memory. This will increase the amount of RAM available, but it will result in a high number of read/writes. It is unlikely to increase performance significantly.
Disable swap with the swapoff command:
sudo swapoff --all
You must also prevent it from coming back after a reboot:
For Raspbian which uses dphys-swapfile to manage a swap file (instead of a "normal" swap partition) you can simply sudo apt-get remove dphys-swapfile to remove it permanently. Best to remove because setting the CONF_SWAPSIZE to 0, as explained in this answer, doesn't seem to work and still creates a 100MB swap file after reboot.
For other distributions that use a swap partition instead of a swap file, remove the appropriate line from /etc/fstab
Disabling Journaling on the Filesystem
Using a journaling filesystem such as ext3 or ext4 WITHOUT a journal is an option to decrease read/writes. The obvious drawback of using a filesystem with journaling disabled is data loss as a result of an ungraceful dismount (i.e. post power failure, kernel lockup, etc.).
You can disable journaling on ext3 by mounting it as ext2
You can disable journaling on ext4 on an unmounted drive like this:
tune4fs -O ^has_journal /dev/sdaX
e4fsck –f /dev/sdaX
sudo reboot
The noatime Mount Flag
Assign the noatime mount flag to partitions residing on the SD card by adding it to the options section of the partition in /etc/fstab.
Reading accesses to the file system will no longer result in an update to the atime information associated with the file. The importance of the noatime setting is that it eliminates the need by the system to make writes to the file system for files which are simply being read. Since writes can be somewhat expensive as mentioned in previous section, this can result in measurable performance gains. Note that the write time information to a file will continue to be updated anytime the file is written to with this option enabled.
Directories in RAM
Highly used directories such as /var/tmp/ and possibly /var/log can be relocated to RAM in /etc/fstab like this:
tmpfs /var/tmp tmpfs nodev,nosuid,size=50M 0 0
This will allow /var/tmp to use 50MB of RAM as disk space. The only issue with doing this is that any drives mounted in RAM will not persist past a reboot. Thus if you mount /var/log and your system encounters an error that causes it to reboot, you will not be able to find out why.
Directories in external Hard Disk
You can also mount some directories on a persistent USB hard disk. More details of this can be found in this question.
The Raspberry Pi can also boot it's root partition from an external drive. This could be via USB or Ethernet and means that the SD card will only be used to delegate to different device during boot. This requires a bit of kernel hacking to accomplish, as I don't think the default kernel supports USB storage. You can find more information at this question, or this external blog post.
Here is one more interesting consideration from another answerer:
Excellent article about flash filesystems.
Important question when talking about flash filesystems is following: What is wear leveling? Wikipedia article. Basically, on flash disks you can write limited number of times until block goes bad. After that, filesystem (if there is no built-in wear leveling management on hardware, as in case of SSDs there usually is) must mark that block as invalid, and avoid using it anymore.
Typical filesystems (for example reiserfs, ntfs, ext3 and so on) are designed for hard disks, that do not have such limitations.
JFFS2
Includes compression and elegant wear leveling protection.
YAFFS2
Single thing that makes the difference: short mount times, after successful umount.
Implements write once property: once data is written to one block, there is no need to rewrite it. This is important for protecting against wear leveling.
LogFS
Not very mature, but already included in Linux kernel tree.
Supports larger filesystems than JFFS2/YAFFS2 without problems.
UBIFS
More mature than LogFS
Write caching support
On scalability: article. On large disks, better performance than with JFFS2
ext4
If no driver or card (for example SSD drives do have internal wear leveling, at least usually) handle wear leveling, then ext4 is not the best idea, as it is not intended for raw flash usage.
What is best one?
Of course, it depends on usage and support. From what I read from the internet, I would recommend UBIFS. Good support for large filesystems, mature development phase, adequate performance and no huge downsides.
Thanks to answerers:
How can I extend the life of my SD card?
Choice of filesystem for GNU/Linux on an SD card
Based on sync manual page, there is no guarantee the disc will flush its cache after calling sync:
"According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but may return before the actual writing is done. However, since version 1.3.20 Linux does actually wait. (This still does not guarantee data integrity: modern disks have large caches.) "
And, in fsync manual, there is no mention about this.
Is there ways to makes sure all writes to disc especially portable device (USB) has been finished after calling sync? I have encountered cases that data and metadata information has not fully written to disc after calling sync/fsync.
I am curious how "Safely remove device" in windows/linux knows that all data has been fully written by the device itself.
I am curious how "Safely remove device" in windows/linux knows that all data has been fully written by the device itself.
For IXish systems:
Unmount the USB-device's partitions using the umount command or the umount() system call.
Doing
blockdev --flushbufs
might flush the device's buffer, but does not keep anybody from accessing the devices again and refill buffers.
Also there is this kernel interface in the /proc file system:
/proc/sys/vm/drop_caches
which can be used to flush different buffers:
Verbatim from https://www.kernel.org/doc/Documentation/sysctl/vm.txt
[...]
To free dentries and inodes:
echo 2 > /proc/sys/vm/drop_caches
[...]
At least in principle, this is a Linux bug. The specification for sync functions is that the data is fully written to permanent storage; leaving it in a hardware cache is not conforming.
I'm not sure what the correct workaround is, but you can probably strace the hwparm utility running with the -F option (I think that's the right one) to see what it's doing (or read the source, but strace is a lot easier).
I was implementing an efficient text file loader and found some good advice from the author of GNU grep in this post:
http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html
One of things he suggests is to do read() calls of page aligned blocks of data into page aligned buffers. Apparently this allows the kernel to avoid some extra buffering.
I've been searching and I haven't heard anyone else back up this claim. Is it true that calling read() into a page aligned buffer (perhaps allocated with mmap/posix_memalign etc..) is actually more efficient? If its not true, is it something that used to be true? Does it heavily depend on the underlying file system or other factors like that?
Thanks!
Normally, read() will read into a kernel buffer, then copy it to user space. This extra copy is what is being discussed.
Linux supports "direct I/O" via the O_DIRECT flag to open(). This will skip kernel buffering and read directly into the userspace buffer. However, this direct I/O requires aligned accesses and buffers. So I don't think the author of that post meant that magic happens when you're aligned, but rather that if you align carefully, you can use "closer-to-the-metal" techniques to extract more performance.
mmap() is a much easier way to get the same effect. When the mapping is first set up, no I/O happens. When the user first accesses a page in the mapping, a page fault is triggered, which the kernel handles by allocating the user's page and performing the I/O to fill it. No copy. But again, the I/O happens in page-sized chunks, on page-aligned boundaries.
Whether this is a big deal or not depends on how fast memory copies happen relative to the I/O, and what proportion of CPU time is spent copying rather than doing real work. A web server, for instance, often doesn't even have to look at what it's reading: it just writes it out again out a socket (which incurs another copy). That's why a bunch of work has gone into "zerocopy" techniques like system calls sendfile() and splice(). These are specialized workloads. Normally, the buffering is too small an effect to worry about.
I have a program that is used to exercise several disk units in a raid configuration. 1 process synchronously (O_SYNC) writes random data to a file using write(). It then puts the name of the directory into a shared-memory queue, where a 2nd process is waiting for the queue to have entries to read the data back into memory using read().
The problem that I can't seem to overcome is that when the 2nd process attempts to read the data back into memory, none of the disk units show read accesses. The program has code to check whether or not the data read back in is equal to the code that is written to disk, and the data always matches.
My question is, how can I make the OS (IBM i) not buffer the data when it is written to disk so that the read() system call accesses the data on the disk rather than in cache? I am doing simple throughput calculations and the read() operations are always 10+ times faster than the write operations.
I have tried using the O_DIRECT flag, but cannot seem to get the data to write to the file. It could have to do with setting up the correct aligned buffers. I have also tried the posix_fadvise(fd, offset,len, POSIX_FADV_DONTNEED) system call.
I have read through this similar question but haven't found a solution. I can provide code if it would be helpful.
My though is that if you write ENOUGH data, then there simply won't be enough memory to cache it, and thus SOME data must be written to disk.
You can also, if you want to make sure that small writes to your file works, try writing ANOTHER large file (either from the same process or a different one - for example, you could start a process like dd if=/dev/zero of=myfile.dat bs=4k count=some_large_number) to force other data to fill the cache.
Another "trick" may be to "chew up" some (more like most) of the RAM in the system - just allocate a large lump of memory, then write to some small part of it at a time - for example, an array of integers, where you write to every 256th entry of the array in a loop, moving to one step forward each time - that way, you walk through ALL of the memory quickly, and since you are writing continuously to all of it, the memory will have to be resident. [I used this technique to simulate a "busy" virtual machine when running VM tests].
The other option is of course to nobble the caching system itself in OS/filesystem driver, but I would be very worried about doing that - it will almost certainly slow the system down to a slow crawl, and unless there is an existing option to disable it, you may find it hard to do accurately/correctly/reliably.
...exercise several disk units in a raid configuration... How? IBM i doesn't allow a program access to the hardware. How are you directing I/O to any specific physical disks?
ANSWER: The write/read operations are done in parallel against IFS so the stream file manager is selecting which disks to target. By having enough threads reading/writing, the busyness of SYSBASE or an IASP can be driven up.
...none of the disk units show read accesses. None of them? Unless you are running the sole job on a system in restricted state, there is going to be read activity on the disks from other tasks. Is the system divided into multiple LPARs? Multiple ASPs? I'm suggesting that you may be monitoring disks that this program isn't writing to, because IBM i handles physical I/O, not programs.
ANSWER I guess none of them is a slight exaggeration - I know which disks belong to SYSBASE and those disks are not being targeted with many read requests. I was just trying to generalize for an audience not familiar w/IBM i. In the picture below, you will see that the write reqs are driving the % busyness up, but the read reqs are not even though they are targeting the same files.
...how can I make the OS (IBM i) not buffer the data when it is written to disk... Use a memory starved main storage pool to maximise paging, write immense blocks of data so as to guarantee that the system and disk controller caches overflow and use a busy machine so that other tasks are demanding disk I/O as well.
I am working on an application which does sequentially write a large file (and does not read at all), and I would like to use posix_fadvise() to optimize the filesystem behavior.
The function description in the manpage suggests that the most appropriate strategy would be POSIX_FADV_SEQUENTIAL. However, the Linux implementation description doubts that:
Under Linux, POSIX_FADV_NORMAL sets the readahead window to the default size for the backing device; POSIX_FADV_SEQUENTIAL doubles this size, and POSIX_FADV_RANDOM disables file readahead entirely.
As I'm only writing data (overwriting files possibly too), I don't expect any readahead. Should I then stick with my POSIX_FADV_SEQUENTIAL or rather use POSIX_FADV_RANDOM to disable it?
How about other options, such as POSIX_FADV_NOREUSE? Or maybe do not use posix_fadvise() for writing at all?
Most of the posix_fadvise() flags (eg POSIX_FADV_SEQUENTIAL and POSIX_FADV_RANDOM) are hints about readahead rather than writing.
There's some advice from Linus here and here about getting good sequential write performance. The idea is to break the file into large-ish (8MB) windows, then loop around doing:
Write out window N with write();
Request asynchronous write-out of window N with sync_file_range(..., SYNC_FILE_RANGE_WRITE)
Wait for the write-out of window N-1 to complete with sync_file_range(..., SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER)
Drop window N-1 from the pagecache with posix_fadvise(..., POSIX_FADV_DONTNEED)
This way you never have more than two windows worth of data in the page cache, but you still get the kernel writing out part of the pagecache to disk while you fill the next part.
It all depends on the temporal locality of your data. If your application won't need the data soon after it was written, then you can go with POSIX_FADV_NOREUSE to avoid writing to the buffer cache (in a similar way as the O_DIRECT flag from open()).
As far as writes go I think that you can just rely on the OSes disk IO scheduler to do the right thing.
You should keep in mind that while posix_fadvise is there specifically to give the kernel hints about future file usage patterns the kernel also has other data to help it out.
If you don't open the file for reading then it would only need to read blocks in when they were partially written. If you were to truncate the file to 0 then it doesn't even have to do that (you said that you were overwriting).