I have a small amount of sensitive data (less than 1K) on flash memory which I would like to protect against some forms of data loss. Most notably, I would like to make sure that the data survives if the flash block it resides on fails.
The obvious answer is to have a backup of the file. Then all I need is to ensure somehow that the two files are located on different blocks. Is there a way to do this?
I'm mostly interested in having this work on Linux, so I'm looking for either a Linux-specific solution, or if there isn't any, a file system specific solution will do too.
EDIT: I'm also open to other approaches of protecting against flash block failure.
The easiest way is to create extra partition on this memory and put file there. I would avoid filesystem solution - most filesystem damages start with directory structure. And don't forget about wear leveling controller - you can't be 100% sure, where your data acutally is.
Best solution I can figure out is to put a write counter and CRC (optional) on each page, and increment counter on each write. You can allocate as many pages as you want (2-8?). You overwrite the page with the lowest counter. If a page write fails (and CRC fails?), overwrite over the next lowest number page.
When booting, the app only needs to find the page with the highest block number and intact CRC, and carry on from there.
Pages should be multiples of 1K per sector size of your memory. Check the specs.
Related
Does ftl have private storage space that is not flash?
If not, how does ftl store those meta data while avoiding wear leveling.
Actually I don’t know if there is a super block in ftl, but if you want to locate the mapping data and unusable block whose physical address changes frequently, a certain physical address may be needed. The content on this physical address must change frequently, how To avoid wear leveling of this physical address?
There are many possible solutions to this problem and it's very intertwined with the data representation that the drive uses to store its data, so I'm sure it differs a lot based on the drive / manufacturer. I'll just outline a general approach that could work.
Let's say you design an FTL that maintains several fixed-size, append-only "logs", and for simplicity we always have one "active" log that all writes are appended to. If the user is issuing random writes, the order of LBAs in the active log will be random too. When the active log fills all the space allocated to it, it gets "frozen" and we switch the active log to some empty log elsewhere in the flash. As the data in the frozen log becomes stale, we will eventually need to garbage collect it by copying any still-referenced blocks to a different log before erasing the original so that it can be reused for new writes.
Now, for each write to a log, nothing in our interface so far requires that the blocks be exactly 4KiB (or whatever), so you could append a small header to the data that tells you what its LBA is, and perhaps some other metadata -- write sequence number so you can tell if it's the most recent copy of a block, and maybe a checksum for read integrity checking. When a write finishes, you update an in-RAM copy of the map with the new location for the LBAs that were updated (RAM inside the SSD, not RAM for the main CPU of the computer obviously).
If the FTL crashes or loses power, you can reconstruct the map by reading all headers from all the logs. The downside is that scanning the logs will scale O(number of logs * number of blocks per log), so you optimize that somehow:
you could write the headers to a separate part of the disk by themselves so that you can scan them without also reading the user data (same big-O runtime but a lot faster in practice)
you could periodically flush the in-RAM copy of the map to flash somewhere, along with the latest IO sequence number, so that you only have to read the parts of the logs that were written since the latest map flush
How do you find the portion of the log to start scanning from? do a binary search on the IO sequence numbers in the log headers. So the boot runtime is now O(number of logs * (log_2(number of blocks per log) + number of blocks that need to be scanned))
How do you know when to stop scanning? either you recognize that all data in the block you read is 1's because that part of the log hasn't been written to yet, or you recognize that the checksum and data don't match.
Minor optimization: during a clean shutdown, always write the map to flash, so that this binary search + scanning only needs to happen if there's a crash or unclean shutdown.
So far, this lowers how often you need to write the map by a lot, but it's still probably too often to overwrite it to a fixed location for a drive with a very long lifetime. To resolve that, we have to cycle where we write the map:
The simplest solution would be to designate a small set of X special logs to store all map data and write to them like a circular buffer, where X is chosen to make the map updates last the expected lifetime of the device. To find the most recent map in the log on boot, you'd do binary search within those logs to find the last one that was written. So boot = O(X * log_2(number of maps per log) + runtime to scan the other logs if unclean shutdown).
Probably a more optimal solution (but one that might be more complicated), would include the map writes directly into the logs where the updates are happening. Then you need some way to find where the maps are at boot time -- the most obvious way to do that would be to write the map into the beginning of each active log, or you could allow arbitrary map writes by adding backpointers into the block headers that point back to the latest map in their log.
Another aspect of this is that full map flushes could be expensive, which would add tail latency if it ever interferes with the performance of user IOs -- would it be better to allow incremental updates? That's when you start looking at using something like a log-structured merge (LSM) tree to store your map, so that each incremental write is pretty small and you can amortize the full map write cost.
Obviously there are a bunch of tiny details that this explanation leaves out, but hopefully that's enough to get you started. :-)
I would like to be able to zero out a range of a file memory-mapping without invoking any io (in order to efficiently sequentially overwrite huge files without incurring any disk read io).
Doing std::memset(ptr, 0, length) will cause pages to be read from disk if they are not already in memory even if the entire pages are overwritten thus totally trashing disk performance.
I would like to be able to do something like madvise(ptr, length, MADV_ZERO) which would zero out the range (similar to FALLOC_FL_ZERO_RANGE) in order to cause zero fill page faults instead of regular io page faults when accessing the specified range.
Unfortunately MADV_ZERO does not exists. Even though the corresponding flag FALLOC_FL_ZERO_RANGE does exists in fallocate and can be used with fwrite to achieve a similar effect, though without instant cross process coherency.
One possible alternative I would guess is to use MADV_REMOVE. However, that can from my understanding cause file fragmentation and also blocks other operations while completing which makes me unsure of its long term performance implications. My experience with Windows is that the similar FSCTL_SET_ZERO_DATA command can incur significant performance spikes when invoked.
My question is how one could implement or emulate MADV_ZERO for shared mappings, preferably in user mode?
1. /dev/zero/
I have read it being suggested to simply read /dev/zero into the selected range. Though I am not quite sure what "reading into the range" means and how to do it. Is it like a fread from /dev/zero into the memory range? Not sure how that would avoid a regular page fault on access?
For Linux, simply read /dev/zero into the selected range. The
kernel already optimises this case for anonymous mappings.
If doing it in general turns out to be too hard to implement, I
propose MADV_ZERO should have this effect: exactly like reading
/dev/zero into the range, but always efficient.
EDIT: Following the thread a bit further it turns out that it will actually not work.
It does not do tricks when you are dealing with a shared mapping.
2. MADV_REMOVE
One guess of implementing it in Linux (i.e. not in user application which is what I would prefer) could be by simply copying and modifying MADV_REMOVE, i.e. madvise_remove to use FALLOC_FL_ZERO_RANGE instead of FALLOC_FL_PUNCH_HOLE. Though I am bit over my head in guessing this, especially as I don't quite understand what the code around the vfs_allocate is doing:
// madvice.c
static long madvise_remove(...)
...
/*
* Filesystem's fallocate may need to take i_mutex. We need to
* explicitly grab a reference because the vma (and hence the
* vma's reference to the file) can go away as soon as we drop
* mmap_sem.
*/
get_file(f); // Increment ref count.
up_read(¤t->mm->mmap_sem); // Release a read lock? Why?
error = vfs_fallocate(f,
FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, // FALLOC_FL_ZERO_RANGE?
offset, end - start);
fput(f); // Decrement ref count.
down_read(¤t->mm->mmap_sem); // Acquire read lock. Why?
return error;
}
You probably cannot do what you want (in user space, without hacking the kernel). Notice that writing zero pages might not incur physical disk IO because of the page cache.
You might want to replace a file segment by a file hole (but this is not exactly what you want) in a sparse file, but some file systems (e.g. VFAT) don't have holes or sparse files. See lseek(2) with SEEK_HOLE, ftruncate(2)
If not using mmap(), it seems like there should be a way to give certain files "priority", so that the only time they're swapped out is for page faults trying to bring in, e.g., executing code, or memory that was malloc()'d by some process, but never other files. One can think of situations where this could be useful. Consider search engines, which should keep their index files in cache, but which may be simultaneously writing new files (not being used for search).
There are a few ways.
The best way is with madvise(), which allows you to inform the kernel that you will need a particular range of memory soon, which gives it priority over other memory. You can also use it to say that a particular range will not be needed soon, so it should be swapped out sooner.
The hack way is with mlock(), which forces a range of memory to stay in RAM. This is generally not a good idea, and should only be used in special cases. The most common case is to store passwords in RAM so that the password cannot be recovered from the swap file after the computer is powered off. I would not use mlock() for performance tuning unless I had exhausted other options.
The worst way is to constantly poke memory, forcing it to stay fresh.
I'm working with a C++ application in an embedded systems running Linux. This device receives messages (small chunk of few bytes) and need to be stored in a non volatile memory in case of power failure. This worked well with another platform because a static RAM was available.
The problem on this platform is that we only have a NAND Flash to do this and we would like to append different message in the same block without having to erase the whole block before updating it with a new message ! Writing a file per messages is not a good solution because there can be a lot of them ! Moreover, this must be efficient and should be life sparing for the flash by avoiding too much erases ! What I would like to be able to do is writing byte after byte into the flash without worrying about bad blocks.
I found "Petit FAT File System" and I'm wondering if this would suite my needs ... ?
Could someone tell me if this is possible with "Petit FAT File System" or give me any suggestion on how to handle this ?
Thanks !
I haven't looked into Petit file system, but your real limitation is the NAND flash. The manufacture data sheet will likely indicate how many writes you can successfully make to each block, before an erase is required. It's possible that there is no hard limit, but the integrity of the data will not be guaranteed after a max write count.
The answer depends on the process technology and flash cell design. For example, is it SLC or MLC NAND? SLC is going to be able to handle multiple block writes better.
Another question would be what type of flash controller is on your system? If it uses hardware ECC, then you might be limited by the controller, since 2nd writes will invalidate the ECC value of the 1st data write. If it is possible that you can do ECC calculations in software, then it comes back to the NAND limitation.
Small write support might be addressed in the data sheet, via a special set aside memory area that might be provided. So again, check the data sheet.
If you post a link, or indicate what hardware you are using, I can try and give you a more definite answer.
If you are dealing with flash, there's no way around deleting it before writing. All flash memory works in that way. Depending on your real-time requirements and the size of the data, this may or may not be an issue. But since you are using embedded Linux, real-time is probably not a major concern for the application anyhow.
I don't see why you would need a complete file system to store a few bytes?! Why do you need an external memory for this in the first place, can't you write to the internal flash of the MCU? If you just need to store a few bytes, an MCU with on-chip eeprom/data flash would likely suit your needs the best.
Also, that flash circuit doesn't look too promising. First I find it mighty fishy that they don't type out the number of cycles nor the data retention but refer to the "gualification report". This might indicate that the the memory is of poor quality.
And the data sheet says year 2009 and Samsung. If I may be cynical, that probably means that the chip is already obsolete. Samsung doesn't exactly have the best long-life reputation.
I'm curious why you want to use raw flash. Why not use something like JFFS2 or UBIFS on top of the MTD drive? Let the MTD driver manage the ECC while JFFS2 or UBIFS manages the wear-leveling. Then just open one file and write to it whenever you need.
I am considering using a FAT file system for an embedded data logging application. The logger will only create one file to which it continually appends 40 bytes of data every minute. After a couple years of use this would be over one million write cycles. MY QUESTION IS: Does a FAT system change the File Allocation Table every time a file is appended? How does it keep track where the end of the file is? Does it just put an EndOfFile marker at the end or does it store the length in the FAT table? If it does change the FAT table every time I do a write, I would ware out the FLASH memory in just a couple of years. Is a FAT system the right thing to use for this application?
My other thought is that I could just store the raw data bytes in the memory card and put an EndOfFile marker at the end of my data every time I do a write. This is less desirable though because it means the only way of getting data out of the logger is through serial transfers and not via a PC and a card reader.
FAT updates the directory table when you modify the file (at least, it will if you close the file, I'm not sure what happens if you don't). It's not just the file size, it's also the last-modified date:
http://en.wikipedia.org/wiki/File_Allocation_Table#Directory_table
If your flash controller doesn't do transparent wear levelling, and your flash driver doesn't relocate things in an effort to level wear, then I guess you could cause wear. Consult your manual, but if you're using consumer hardware I would have thought that everything has wear-levelling somewhere.
On the plus side, if the event you're worried about only occurs every minute, then you should be able to speed that up considerably in a test to see whether 2 years worth of log entries really does trash your actual hardware. Might even be faster than trying to find the relevant manufacturer docs...
No, a flash file system driver is explicitly designed to minimize the wear and spread it across the memory cells. Taking advantage of the near-zero seek time. Your data rates are low, it's going to last a long time. Specifying a yearly replacement of the media is a simple way to minimize the risk.
If your only operation is appending to one file it may be simpler to forgo a filesystem and use the flash device as a data tape. You have to take into account the type of flash and its block size, though.
Large flash chips are divided into sub-pages that are a power-of-two multiple of 264 (256+8) bytes in size, pages that are a power-of-two multiple of that, and blocks which are a power-of-two multiple of that. A blank page will read as all FF's. One can write a page at a time; the smallest unit one can write is a sub-page. Once a sub-page is written, it may not be rewritten until the entire block containing it is erased. Note that on smaller flash chips, it's possible to write the bytes of a page individually, provided one only writes to blank bytes, but on many larger chips that is not possible. I think in present-generation chips, the sub-page size is 528 bytes, the page size is 2048+64 bytes, and the block size is 128K+4096 bytes.
An MMC, SD, CompactFlash, or other such card (basically anything other than SmartMedia) combines a flash chip with a processor to handle PC-style sector writes. Essentially what happens is that when a sector is written, the controller locates a blank page, writes a new version of that sector there along with up to 16 bytes of 'header' information indicating what sector it is, etc. The controller then keeps a map of where all the different pages of information are located.
A SmartMedia card exposes the flash interface directly, and relies upon the camera, card reader, or other device using it to perform such data management according to standard methods.
Note that keeping track of the whereabouts of all 4,000,000 pages on a 2 gig card would require either having 12-16 megs of RAM, or else using 12-16 meg of flash as a secondary lookup table. Using the latter approach would mean that every write to a flash page would also require a write to the lookup table. I wouldn't be at all surprised if slower flash devices use such an approach (so as to only have to track the whereabouts of about 16,000 'indirect' pages).
In any case, the most important observation is that flash write times are not predictable, but you shouldn't normally have to worry about flash wear.
Did you check what happens to the FAT file system consistency in case of a power failure or reset of your device?
When your device experience such a failure you must not lose only that log entry, that you are just writing. Older entries must stay valid.
No, FAT is not the right thing if you need to read back the data.
You should further consider what happens, if the flash memory is filled with data. How do you get space for new data? You need to define the requirements for this point.