I have a C program running on Linux that acquires data from a USB device (sensor data), does some processing and streams the result to disk. Currently I save to a text file using fputs(), a line looks like this:
timestamp value1 value2 ... valueN
the sample rate being up to 250Hz.
The program should run on a RPi or similar board and possibly write the data to a flash memory (SD card).
I have following questions:
Should I be optimizing the data stream or let the OS do the job? More specifically, should I be trying to minimize how often data is actually written to disk (also given the use of a flash memory)?
I have read about setbuf() and setvbuf(), as I understand they should effectively delay writing until a "block" is filled. Are these appropriate or is there a better way other than perhaps implementing my own buffer?
Which output function is best suited for data streaming with the above in mind (fputs() / fprintf() / write())?
Should I be trying to increase randomness (as to use all sectors) when writing to a SD card? If yes what's the best way to achieve this?
Here some more thoughts:
I can consider using a binary format to decrease size, but I would prefer keeping the text format to simplify later data handling.
Using a hard drive is also an option in the final design, especially if a high acquisition rate is to be carried on over a long time.
The data rate being relatively low I do not expect bandwidth problem with either hard drive or SD card. It is possible that the rate will be higher in the future (kHz or more).
EDIT 20130128
Thank you for all the answers so far, they give me some good insight. I'll sum it up a bit:
In general I should not have bandwidth issues, however to avoid unnecessary large log files I might consider a binary format. Yes the log should be human readable, if not I'll make an export function or similar. Yes unwind's assumption is correct, about 10 or 15 data values each line.
The mentioned read/write cycles per cell should be enough for some time, at least in the testing phase, considering we don't always write and delete the same cells. I will play around with buffer size in setvbuf() and set the buffering mode to full buffering to see if I can optimize this while keeping a reasonable save interval (a few seconds or more also depending on sample rate).
In the final design I might use a hard drive to avoid most of the problems mentioned here, or a second SD card which can be easily replaced (might be also good to quickly retrieve the data). I will format this with one of the format suggested here (FAT or JFFS2/F2FS).
Following zmo's suggestion I will try to make the system as read only as possible (at least the system partition), I was already considering this.
A Beaglebone, also mentioned by zmo, is my next choice if I'm not happy with the RPi (I read that its USB bus is not always stable, USB is obviously very important for my application).
I have already implemented a UDP port to send data over network, still I would like to keep at least a local copy of that data and maybe only send a subset of or already processed data, as well as "control data".

Should I be optimizing the data stream or let the OS do the job? More specifically, should I be trying to minimize how often data is actually written to disk (also given the use of a flash memory)?
Well, you can usually assume that the OS does a pretty awesome job at buffering and handling output to the hard drive… As long as you don't do unbuffered writes.
Though, from my experience, you should not write logs to a SD Card, because it definitely kills the SD Card faster than you can imagine. On my first projects, I had installed linux on beaglebones, and between 6 months to 12 months after, all my SD Cards had to be replaced…
Since then, I've learned to run read only systems on the SD card and send any kind of regular updates over the network, the trick being to use a ramdisk for /tmp and /var.
In your case, using a hard drive is an easy solution (which will works smoothly), but you can also use a secondary SD Card where you write the logs. Then you'll be able to use a "stupid" filesystem such as a FAT one where you'll write your data aligned, as your data will be the only thing to be written on the SD. What is killing a SDCard is lots of little read/writes that happen a lot with temporary files, and defragmentation of the drive.
I have read about setbuf() and setvbuf(), as I understand they should effectively delay writing until a "block" is filled. Are these appropriate or is there a better way other than perhaps implementing my own buffer?
well, just keep it to full buffering, it will help write your data aligned on the filesystem.
Which output function is best suited for data streaming with the above in mind (fputs() / fprintf() / write())?
they should all behave similarly for your problematic.
Should I be trying to increase randomness (as to use all sectors) when writing to a SD card? If yes what's the best way to achieve this?
the firmware of the sdcard should be taking care of that for you. The only thing would be to use a simpler filesystem like FAT (or JFFS2/F2FS like ivan-voras suggets), because ext2/ext3/ext4 filesystems do automatic defragmentation which basically is moving around inodes to keep everything aligned. Though I'm not sure if it disables that behavior with SDcards and SSDs.

Writing to the SD card often will definitely kill it sooner, but it also means you can attempt to prolong this time by reducing the number of writes. As others have said, the best solution for you would be to write the logs over the network to a server or just another machine which has proper storage (in the simplest case, maybe you can use syslog(3) or just plain NFS).
If you want to continue with the original plan, then using setvbuf(3) to enable block buffered mode and setting a large buffer size (like 128 KiB or 256 KiB) would be best. A large buffer size also means that you will lose unwritten data from the buffer if power goes out, etc.
However, a large buffer only delays the inevitable and you should search for other options. It's not as alarming as Lundin's answer states because there are many cells and you're not writing always to the same one, so if you get the largest SD card you can buy, then using his method you can calculate approximately how many times you can rewrite the entire card before it fails. Using a flash-friendly file system such as F2FS or JFFS2 will be beneficial.

Here're my thoughts:
It might be a good idea to buffer some data in memory before writing to disk, but keep in mind that this might cause data loss in case of power failure.
I think this is highly dependent on the file system and type of storage you use. There is no generic answer but it could prove useful to implement and benchmark it on your specific configuration.
Considering the huge amount of data you're outputting, I'd choose a binary format (unless you want the file to be human readable)
The firmware of the flash drive should already take care of this. Basically this is the cornerstone of all modern SSDs. (SD card controllers should implement it too.)


How to avoid damaging SD card for large writes?

Ok, first a little background to help make my question clear:
I am working on a device that collects certain data from sensors and posts them to a server using a GSM modem. As a GSM connection is not 100% reliable, it would contain a logging mechanism that would write unsent data to an SD card.
We are using Chan's FatFs module for providing us with a file system as we want the log to be readable on a PC.
Now I've been testing the FAT system for boundary conditions, i.e., trying to fill up the card completely.
In the first run I opened the file and set the code to keep writing a string until the drive was full. The program would synch after every write.
I left the code running overnight.
The next day, I examined the SD card. I found that the file was only 150 MB in size. There were about 1.2 million lines written to it. The card could still be read from but not written to or formatted.
Next time I tried the same type of test, but this time I used the f_lseek() function to pre-allocate the file to 1GB. It would then write to that file until that limit was reached. This time the data would be synced after 50 writes. It would then close that file and open another to do the same.
As you can guess another brave little card lost it's mind that day.
So these are what I would like help with :
How to prevent damage to the card while writing large amounts of data?
Does leaving the file open for extended periods have any negative effects?
Since the full code may be too long, here's the main part where the writing happens
ax_log_msg(E_LOG_INFO,"\n\rf_open Failed\n\rResult code");
ax_log_msg(E_LOG_INFO,"\n\rf_open Sucessfull");
ax_log_msg(E_LOG_INFO,"\n\rf_lseek Failed for preallocation\n\rResult code");
ax_log_msg(E_LOG_INFO,"\n\rf_lseek Sucessfull for preallocation");
while( (f_tell(&file_ptr) < FILE_SIZE_LIMIT_1GB )){
ax_log_msg(E_LOG_INFO,"\n\rWrite failed\n\rFRESULT=");
Note :
ax_log_msg() is part of the device firmware to print on console.
FRESULT_S[result] is used to convert the enum result code to a string.
If there is any data missing, please do mention it.
You probably need to buffer an entire block of data, perhaps 4 KB, to avoid flashing an entire block with every flush. But, the filesystem or driver should do this for you, as long as you don't call fflush explicitly, which is the real lesson.
Why do you need it to be synced so often? Perhaps a timer would work better than an interval per number of records?
Due to 100,000 write cycles limit per sector it is a really challenging task to extend a flash memory lifespan. One of my cards died over one night after I run writing tests on it. I then counted time periods, and that's indeed easy to perform 100,000 writes (in the same sector) just in one night (without taking into account a calculation it comes through experience).
At that time I was told that there is a smart monitors in some filesystems and they count and keep writes number for every sector in order to writings number per every sector was the same, I guess. I neither took nor tested one.
I now found some extremely popular/highly voted answer/suggestion for Raspberrypi and I quote it here now:
These methods should increase the lifespan of the SD card by minimising the number of read/writes in various ways:
Disable Swap
Swapping is the process of using part of the SD card as volatile memory. This will increase the amount of RAM available, but it will result in a high number of read/writes. It is unlikely to increase performance significantly.
Disable swap with the swapoff command:
sudo swapoff --all
You must also prevent it from coming back after a reboot:
For Raspbian which uses dphys-swapfile to manage a swap file (instead of a "normal" swap partition) you can simply sudo apt-get remove dphys-swapfile to remove it permanently. Best to remove because setting the CONF_SWAPSIZE to 0, as explained in this answer, doesn't seem to work and still creates a 100MB swap file after reboot.
For other distributions that use a swap partition instead of a swap file, remove the appropriate line from /etc/fstab
Disabling Journaling on the Filesystem
Using a journaling filesystem such as ext3 or ext4 WITHOUT a journal is an option to decrease read/writes. The obvious drawback of using a filesystem with journaling disabled is data loss as a result of an ungraceful dismount (i.e. post power failure, kernel lockup, etc.).
You can disable journaling on ext3 by mounting it as ext2
You can disable journaling on ext4 on an unmounted drive like this:
tune4fs -O ^has_journal /dev/sdaX
e4fsck –f /dev/sdaX
sudo reboot
The noatime Mount Flag
Assign the noatime mount flag to partitions residing on the SD card by adding it to the options section of the partition in /etc/fstab.
Reading accesses to the file system will no longer result in an update to the atime information associated with the file. The importance of the noatime setting is that it eliminates the need by the system to make writes to the file system for files which are simply being read. Since writes can be somewhat expensive as mentioned in previous section, this can result in measurable performance gains. Note that the write time information to a file will continue to be updated anytime the file is written to with this option enabled.
Directories in RAM
Highly used directories such as /var/tmp/ and possibly /var/log can be relocated to RAM in /etc/fstab like this:
tmpfs /var/tmp tmpfs nodev,nosuid,size=50M 0 0
This will allow /var/tmp to use 50MB of RAM as disk space. The only issue with doing this is that any drives mounted in RAM will not persist past a reboot. Thus if you mount /var/log and your system encounters an error that causes it to reboot, you will not be able to find out why.
Directories in external Hard Disk
You can also mount some directories on a persistent USB hard disk. More details of this can be found in this question.
The Raspberry Pi can also boot it's root partition from an external drive. This could be via USB or Ethernet and means that the SD card will only be used to delegate to different device during boot. This requires a bit of kernel hacking to accomplish, as I don't think the default kernel supports USB storage. You can find more information at this question, or this external blog post.
Here is one more interesting consideration from another answerer:
Excellent article about flash filesystems.
Important question when talking about flash filesystems is following: What is wear leveling? Wikipedia article. Basically, on flash disks you can write limited number of times until block goes bad. After that, filesystem (if there is no built-in wear leveling management on hardware, as in case of SSDs there usually is) must mark that block as invalid, and avoid using it anymore.
Typical filesystems (for example reiserfs, ntfs, ext3 and so on) are designed for hard disks, that do not have such limitations.
Includes compression and elegant wear leveling protection.
Single thing that makes the difference: short mount times, after successful umount.
Implements write once property: once data is written to one block, there is no need to rewrite it. This is important for protecting against wear leveling.
Not very mature, but already included in Linux kernel tree.
Supports larger filesystems than JFFS2/YAFFS2 without problems.
More mature than LogFS
Write caching support
On scalability: article. On large disks, better performance than with JFFS2
If no driver or card (for example SSD drives do have internal wear leveling, at least usually) handle wear leveling, then ext4 is not the best idea, as it is not intended for raw flash usage.
What is best one?
Of course, it depends on usage and support. From what I read from the internet, I would recommend UBIFS. Good support for large filesystems, mature development phase, adequate performance and no huge downsides.
How to prevent C read() from reading from cache

I have a program that is used to exercise several disk units in a raid configuration. 1 process synchronously (O_SYNC) writes random data to a file using write(). It then puts the name of the directory into a shared-memory queue, where a 2nd process is waiting for the queue to have entries to read the data back into memory using read().
The problem that I can't seem to overcome is that when the 2nd process attempts to read the data back into memory, none of the disk units show read accesses. The program has code to check whether or not the data read back in is equal to the code that is written to disk, and the data always matches.
My question is, how can I make the OS (IBM i) not buffer the data when it is written to disk so that the read() system call accesses the data on the disk rather than in cache? I am doing simple throughput calculations and the read() operations are always 10+ times faster than the write operations.
I have tried using the O_DIRECT flag, but cannot seem to get the data to write to the file. It could have to do with setting up the correct aligned buffers. I have also tried the posix_fadvise(fd, offset,len, POSIX_FADV_DONTNEED) system call.
I have read through this similar question but haven't found a solution. I can provide code if it would be helpful.
My though is that if you write ENOUGH data, then there simply won't be enough memory to cache it, and thus SOME data must be written to disk.
You can also, if you want to make sure that small writes to your file works, try writing ANOTHER large file (either from the same process or a different one - for example, you could start a process like dd if=/dev/zero of=myfile.dat bs=4k count=some_large_number) to force other data to fill the cache.
Another "trick" may be to "chew up" some (more like most) of the RAM in the system - just allocate a large lump of memory, then write to some small part of it at a time - for example, an array of integers, where you write to every 256th entry of the array in a loop, moving to one step forward each time - that way, you walk through ALL of the memory quickly, and since you are writing continuously to all of it, the memory will have to be resident. [I used this technique to simulate a "busy" virtual machine when running VM tests].
The other option is of course to nobble the caching system itself in OS/filesystem driver, but I would be very worried about doing that - it will almost certainly slow the system down to a slow crawl, and unless there is an existing option to disable it, you may find it hard to do accurately/correctly/reliably.
...exercise several disk units in a raid configuration... How? IBM i doesn't allow a program access to the hardware. How are you directing I/O to any specific physical disks?
ANSWER: The write/read operations are done in parallel against IFS so the stream file manager is selecting which disks to target. By having enough threads reading/writing, the busyness of SYSBASE or an IASP can be driven up.
...none of the disk units show read accesses. None of them? Unless you are running the sole job on a system in restricted state, there is going to be read activity on the disks from other tasks. Is the system divided into multiple LPARs? Multiple ASPs? I'm suggesting that you may be monitoring disks that this program isn't writing to, because IBM i handles physical I/O, not programs.
ANSWER I guess none of them is a slight exaggeration - I know which disks belong to SYSBASE and those disks are not being targeted with many read requests. I was just trying to generalize for an audience not familiar w/IBM i. In the picture below, you will see that the write reqs are driving the % busyness up, but the read reqs are not even though they are targeting the same files.
...how can I make the OS (IBM i) not buffer the data when it is written to disk... Use a memory starved main storage pool to maximise paging, write immense blocks of data so as to guarantee that the system and disk controller caches overflow and use a busy machine so that other tasks are demanding disk I/O as well.

When are sequential seeks with small reads slower than reading a whole file?

I've run into a situation where lseek'ing forward repeatedly through a 500MB file and reading a small chunk (300-500 bytes) between each seek appears to be slower than read'ing through the whole file from the beginning and ignoring the bytes I don't want. This appears to be true even when I only do 5-10 seeks (so when I only end up reading ~1% of the file). I'm a bit surprised by this -- why would seeking forward repeatedly, which should involve less work, be slower than reading which actually has to copy the data from kernel space to userspace?
Presumably on local disk when seeking the OS could even send a message to the drive to seek without sending any data back across the bus for even more savings. But I'm accessing a network mount, where I'd expect read to be much slower than seek (sending one packet saying to move N bytes ahead versus actually transferring data across the network).
Regardless of whether reading from local disk or a network filesystem, how could this happen? My only guess is the OS is prefetching a ton of data after each location I seek to. Is this something that can normally occur or does it likely indicate a bug in my code?
The magnitude of the difference will be a factor of the ratio of the seek count/data being read to the size of the entire file.
But I'm accessing a network mount, where I'd expect read to be much slower than seek (sending one packet saying to move N bytes ahead versus actually transferring data across the network).
If there's rotational magnetic drives at the other end of the network, the effect will still be present and likely significantly compounded by the round trip time. The network protocol may play a role too. Even solid state drives may take some penalty.
I/O schedulers may reorder requests in order to minimize head movements (perhaps naively even for storage devices without a head). A single bulk request might give you some greater efficiency across many layers. The filesystems have an opportunity to interfere here somewhat.
Regardless of whether reading from local disk or a network filesystem, how could this happen?
I wouldn't be quick to dismiss the effect of those layers -- do you have measurements which show the same behavior from a local disk? It's much easier to draw conclusions without quite so much between you and the hardware. Start with a raw device and bisect from there.
Have you considered using a memory map instead? It's perfect for this use case.
Depending on the filesystem, the specific lseek implementation make create some overhead.
For example, I believe when using NFS, lseek locks the kernel by calling remote_llseek().

NAND RAW access

I'm working with a C++ application in an embedded systems running Linux. This device receives messages (small chunk of few bytes) and need to be stored in a non volatile memory in case of power failure. This worked well with another platform because a static RAM was available.
The problem on this platform is that we only have a NAND Flash to do this and we would like to append different message in the same block without having to erase the whole block before updating it with a new message ! Writing a file per messages is not a good solution because there can be a lot of them ! Moreover, this must be efficient and should be life sparing for the flash by avoiding too much erases ! What I would like to be able to do is writing byte after byte into the flash without worrying about bad blocks.
I found "Petit FAT File System" and I'm wondering if this would suite my needs ... ?
Could someone tell me if this is possible with "Petit FAT File System" or give me any suggestion on how to handle this ?
Thanks !
I haven't looked into Petit file system, but your real limitation is the NAND flash. The manufacture data sheet will likely indicate how many writes you can successfully make to each block, before an erase is required. It's possible that there is no hard limit, but the integrity of the data will not be guaranteed after a max write count.
The answer depends on the process technology and flash cell design. For example, is it SLC or MLC NAND? SLC is going to be able to handle multiple block writes better.
Another question would be what type of flash controller is on your system? If it uses hardware ECC, then you might be limited by the controller, since 2nd writes will invalidate the ECC value of the 1st data write. If it is possible that you can do ECC calculations in software, then it comes back to the NAND limitation.
Small write support might be addressed in the data sheet, via a special set aside memory area that might be provided. So again, check the data sheet.
If you post a link, or indicate what hardware you are using, I can try and give you a more definite answer.
If you are dealing with flash, there's no way around deleting it before writing. All flash memory works in that way. Depending on your real-time requirements and the size of the data, this may or may not be an issue. But since you are using embedded Linux, real-time is probably not a major concern for the application anyhow.
I don't see why you would need a complete file system to store a few bytes?! Why do you need an external memory for this in the first place, can't you write to the internal flash of the MCU? If you just need to store a few bytes, an MCU with on-chip eeprom/data flash would likely suit your needs the best.
Also, that flash circuit doesn't look too promising. First I find it mighty fishy that they don't type out the number of cycles nor the data retention but refer to the "gualification report". This might indicate that the the memory is of poor quality.
And the data sheet says year 2009 and Samsung. If I may be cynical, that probably means that the chip is already obsolete. Samsung doesn't exactly have the best long-life reputation.
I'm curious why you want to use raw flash. Why not use something like JFFS2 or UBIFS on top of the MTD drive? Let the MTD driver manage the ECC while JFFS2 or UBIFS manages the wear-leveling. Then just open one file and write to it whenever you need.

C program stuck on uninterruptible wait while performing disk I/O on Mac OS X Snow Leopard

One line of background: I'm the developer of Redis, a NoSQL database. One of the new features I'm implementing is Virtual Memory, because Redis takes all the data in memory. Thanks to VM Redis is able to transfer rarely used objects from memory to disk, there are a number of reasons why this works much better than letting the OS do the work for us swapping (redis objects are built of many small objects allocated in non contiguous places, when serialized to disk by Redis they take 10 times less space compared to the memory pages where they live, and so forth).
Now I've an alpha implementation that's working perfectly on Linux, but not so well on Mac OS X Snow Leopard. From time to time, while Redis tries to move a page from memory to disk, the redis process enters the uninterruptible wait state for minutes. I was unable to debug this, but this happens either in a call to fseeko() or fwrite(). After minutes the call finally returns and redis continues working without problems at all: no crash.
The amount of data transfered is very small, something like 256 bytes. So it should not be a matter of a very big amount of I/O performed.
But there is an interesting detail about the swap file that's target of the write operation. It's a big file (26 Gigabytes) created opening a file with fopen() and then enlarged using ftruncate(). Finally the file is unlink()ed so that Redis continues to take a reference to it, but we are sure that when the Redis process will exit the OS will really free the swap file.
Ok that's all but I'm here for any further detail. And BTW you can even find the actual code in the Redis git, but it's not trivial to understand in five minutes given that's a fairly complex system.
Thank you very much for any help.
As I understand it, HFS+ has very poor support for sparse files. So it may be that your write is triggering a file expansion that is initializing/materializing a large fraction of the file.
For example, I know mmap'ing a new large empty file and then writing at a few random locations produces a very large file on disk with HFS+. It's quite annoying since mmap and sparse files are an extremely convenient way of working with data, and virtually every other platform/filesystem out there handles this gracefully.
Is the swap file written to linearly? Meaning we either replace an existing block or write a new block at the end and increment a free space pointer? If so, perhaps doing more frequent smaller ftruncate calls to expand the file would result in shorter pauses.
As an aside, I'm curious why redis VM doesn't use mmap and then just move blocks around in an attempt to concentrate hot blocks into hot pages.
antirez, I'm not sure I'll be much help since my Apple experience is limited to the Apple ][, but I'll give it a shot.
First thing is a question. I would have thought that, for virtual memory, speed of operation would be a more important measure than disk space (especially for a NoSQL DB where speed is the whole point, otherwise you'd be using SQL, no?). But, if your swap file is 26G, maybe not :-)
Some things to try (if possible).
Try to actually isolate the problem to the seek or write. I have a hard time believing a seek could take that long since, at worst, it should be a buffer pointer change. Still, I didn't write OSX so I can't be sure.
Try adjusting the size of the swap file to see if that's what is causing the problem.
Do you ever dynamically expand the swap file (as opposed to pre-allocation)? If you do, that may be what is causing the problem.
Do you always write as low in the file as you can? It may be that creating a 26G file may not actually fill it with data but, if you create it then write to the last byte, the OS may have to zero out the bytes before then (deferring the initialization, if any).
What happens if you just pre-allocate the entire file (write to every byte) and not unlink it? In other words, leave the file there between runs of your program (creating it if it doesn't already exist of course). Then in your startup code for Redis, just initialize the file (pointers and such). This may get rid of any problems like those in point 4 above.
Ask on the various BSD sites as well. I'm not sure how much Apple changed under the covers but OSX is just BSD at the lowest level (Pax ducks for cover).
Also consider asking on the Apple sites (if you haven't already done so).
Well, that's my small contribution, hopefully it'll help. Good luck with your project.
Have you turned off file caching for your file? i.e. fcntl(fd, F_GLOBAL_NOCACHE, 1)
Have you tried debugging with DTrace and or Instruments (Apple's experimental dtrace front-end)?
Exploring Leopard with DTrace
Debugging Chrome on OS X
As Linus said once on the Git mailing list:
"I realize that OS X people have a hard time accepting it, but OS X
filesystems are generally total and utter crap - even more so than
