How does Flash Translation Layer store mapping data, unusable block and super block? - filesystems

Does ftl have private storage space that is not flash?
If not, how does ftl store those meta data while avoiding wear leveling.
Actually I don’t know if there is a super block in ftl, but if you want to locate the mapping data and unusable block whose physical address changes frequently, a certain physical address may be needed. The content on this physical address must change frequently, how To avoid wear leveling of this physical address?

There are many possible solutions to this problem and it's very intertwined with the data representation that the drive uses to store its data, so I'm sure it differs a lot based on the drive / manufacturer. I'll just outline a general approach that could work.
Let's say you design an FTL that maintains several fixed-size, append-only "logs", and for simplicity we always have one "active" log that all writes are appended to. If the user is issuing random writes, the order of LBAs in the active log will be random too. When the active log fills all the space allocated to it, it gets "frozen" and we switch the active log to some empty log elsewhere in the flash. As the data in the frozen log becomes stale, we will eventually need to garbage collect it by copying any still-referenced blocks to a different log before erasing the original so that it can be reused for new writes.
Now, for each write to a log, nothing in our interface so far requires that the blocks be exactly 4KiB (or whatever), so you could append a small header to the data that tells you what its LBA is, and perhaps some other metadata -- write sequence number so you can tell if it's the most recent copy of a block, and maybe a checksum for read integrity checking. When a write finishes, you update an in-RAM copy of the map with the new location for the LBAs that were updated (RAM inside the SSD, not RAM for the main CPU of the computer obviously).
If the FTL crashes or loses power, you can reconstruct the map by reading all headers from all the logs. The downside is that scanning the logs will scale O(number of logs * number of blocks per log), so you optimize that somehow:
you could write the headers to a separate part of the disk by themselves so that you can scan them without also reading the user data (same big-O runtime but a lot faster in practice)
you could periodically flush the in-RAM copy of the map to flash somewhere, along with the latest IO sequence number, so that you only have to read the parts of the logs that were written since the latest map flush
How do you find the portion of the log to start scanning from? do a binary search on the IO sequence numbers in the log headers. So the boot runtime is now O(number of logs * (log_2(number of blocks per log) + number of blocks that need to be scanned))
How do you know when to stop scanning? either you recognize that all data in the block you read is 1's because that part of the log hasn't been written to yet, or you recognize that the checksum and data don't match.
Minor optimization: during a clean shutdown, always write the map to flash, so that this binary search + scanning only needs to happen if there's a crash or unclean shutdown.
So far, this lowers how often you need to write the map by a lot, but it's still probably too often to overwrite it to a fixed location for a drive with a very long lifetime. To resolve that, we have to cycle where we write the map:
The simplest solution would be to designate a small set of X special logs to store all map data and write to them like a circular buffer, where X is chosen to make the map updates last the expected lifetime of the device. To find the most recent map in the log on boot, you'd do binary search within those logs to find the last one that was written. So boot = O(X * log_2(number of maps per log) + runtime to scan the other logs if unclean shutdown).
Probably a more optimal solution (but one that might be more complicated), would include the map writes directly into the logs where the updates are happening. Then you need some way to find where the maps are at boot time -- the most obvious way to do that would be to write the map into the beginning of each active log, or you could allow arbitrary map writes by adding backpointers into the block headers that point back to the latest map in their log.
Another aspect of this is that full map flushes could be expensive, which would add tail latency if it ever interferes with the performance of user IOs -- would it be better to allow incremental updates? That's when you start looking at using something like a log-structured merge (LSM) tree to store your map, so that each incremental write is pretty small and you can amortize the full map write cost.
Obviously there are a bunch of tiny details that this explanation leaves out, but hopefully that's enough to get you started. :-)

Related

How to protect against flash block failure?

I have a small amount of sensitive data (less than 1K) on flash memory which I would like to protect against some forms of data loss. Most notably, I would like to make sure that the data survives if the flash block it resides on fails.
The obvious answer is to have a backup of the file. Then all I need is to ensure somehow that the two files are located on different blocks. Is there a way to do this?
I'm mostly interested in having this work on Linux, so I'm looking for either a Linux-specific solution, or if there isn't any, a file system specific solution will do too.
EDIT: I'm also open to other approaches of protecting against flash block failure.
The easiest way is to create extra partition on this memory and put file there. I would avoid filesystem solution - most filesystem damages start with directory structure. And don't forget about wear leveling controller - you can't be 100% sure, where your data acutally is.
Best solution I can figure out is to put a write counter and CRC (optional) on each page, and increment counter on each write. You can allocate as many pages as you want (2-8?). You overwrite the page with the lowest counter. If a page write fails (and CRC fails?), overwrite over the next lowest number page.
When booting, the app only needs to find the page with the highest block number and intact CRC, and carry on from there.
Pages should be multiples of 1K per sector size of your memory. Check the specs.

what is a sequential write and what is random write

I want to know what exactly is sequential write and what is random write in definition. I will be even more helpful with example. I tried to google the result. But not much google explanation.
Thanks
When you write two blocks that are next to each-other on disk, you have a sequential write.
When you write two blocks that are located far away from eachother on disk, you have random writes.
With a spinning hard disk, the second pattern is much slower (can be magnitudes), because the head has to be moved around to the new position.
Database technology is (or has been, maybe not that important with SSD anymore) to a large part about optimizing disk access patterns. So what you often see, for example, is trading direct updates of data in their on-disk location (random access) versus writing to a transaction log (sequential access). Makes it more complicated and time-consuming to reconstruct the actual value, but makes for much faster commits (and you have checkpoints to eventually consolidate the logs that build up).

How to prevent C read() from reading from cache

I have a program that is used to exercise several disk units in a raid configuration. 1 process synchronously (O_SYNC) writes random data to a file using write(). It then puts the name of the directory into a shared-memory queue, where a 2nd process is waiting for the queue to have entries to read the data back into memory using read().
The problem that I can't seem to overcome is that when the 2nd process attempts to read the data back into memory, none of the disk units show read accesses. The program has code to check whether or not the data read back in is equal to the code that is written to disk, and the data always matches.
My question is, how can I make the OS (IBM i) not buffer the data when it is written to disk so that the read() system call accesses the data on the disk rather than in cache? I am doing simple throughput calculations and the read() operations are always 10+ times faster than the write operations.
I have tried using the O_DIRECT flag, but cannot seem to get the data to write to the file. It could have to do with setting up the correct aligned buffers. I have also tried the posix_fadvise(fd, offset,len, POSIX_FADV_DONTNEED) system call.
I have read through this similar question but haven't found a solution. I can provide code if it would be helpful.
My though is that if you write ENOUGH data, then there simply won't be enough memory to cache it, and thus SOME data must be written to disk.
You can also, if you want to make sure that small writes to your file works, try writing ANOTHER large file (either from the same process or a different one - for example, you could start a process like dd if=/dev/zero of=myfile.dat bs=4k count=some_large_number) to force other data to fill the cache.
Another "trick" may be to "chew up" some (more like most) of the RAM in the system - just allocate a large lump of memory, then write to some small part of it at a time - for example, an array of integers, where you write to every 256th entry of the array in a loop, moving to one step forward each time - that way, you walk through ALL of the memory quickly, and since you are writing continuously to all of it, the memory will have to be resident. [I used this technique to simulate a "busy" virtual machine when running VM tests].
The other option is of course to nobble the caching system itself in OS/filesystem driver, but I would be very worried about doing that - it will almost certainly slow the system down to a slow crawl, and unless there is an existing option to disable it, you may find it hard to do accurately/correctly/reliably.
...exercise several disk units in a raid configuration... How? IBM i doesn't allow a program access to the hardware. How are you directing I/O to any specific physical disks?
ANSWER: The write/read operations are done in parallel against IFS so the stream file manager is selecting which disks to target. By having enough threads reading/writing, the busyness of SYSBASE or an IASP can be driven up.
...none of the disk units show read accesses. None of them? Unless you are running the sole job on a system in restricted state, there is going to be read activity on the disks from other tasks. Is the system divided into multiple LPARs? Multiple ASPs? I'm suggesting that you may be monitoring disks that this program isn't writing to, because IBM i handles physical I/O, not programs.
ANSWER I guess none of them is a slight exaggeration - I know which disks belong to SYSBASE and those disks are not being targeted with many read requests. I was just trying to generalize for an audience not familiar w/IBM i. In the picture below, you will see that the write reqs are driving the % busyness up, but the read reqs are not even though they are targeting the same files.
...how can I make the OS (IBM i) not buffer the data when it is written to disk... Use a memory starved main storage pool to maximise paging, write immense blocks of data so as to guarantee that the system and disk controller caches overflow and use a busy machine so that other tasks are demanding disk I/O as well.

How to write few bytes to a file in NAND without continuously trashing pages?

I have a persistent dictionary (string -> string map) that I write to a NAND filesystem as a log file. My binary format is:
1 byte - key size
<key size> bytes - key without the terminating nul-character
2 bytes - value size and flags
<value size> bytes - value without terminating nul-character
Usually, I write between 20 to 100 bytes with each change in the dictionary. But since this is been persisted on NAND this design causes a new page to be allocated each time, reducing my available NAND space by 2k on each iteration.
I can't cache the writes because I can't lose this data. It's running point of sales terminals, saving transaction data. My algorithm effectively writes and call flush, so this doesn't happen.
Now, is there a de-facto or standard algorithm to solve this problem? I've looked at power-down triggers on the target platform, but the SDK suggests using the NAND as the persistent layer.
Well ... Just keep track of where in the page you are, and before writing, read in the last page, append your new data in memory, and write the same page back.
If the new data broke the page limit, start on a new page. This will happen if the new data doesn't fit in the remaining space for the last page written. When done, update the information about where the last write happened.
You say that your appends "cause a new page to be allocated each time", I'm not sure why that would be the case if you're appending to a byte sequence.
Of course, this requires page-level access to the memory. If all you have is file-level (like C's FILE * functions) access, you can't do this unless you know that your file layer is somewhat clever.
Since the NAND FLASH is limited to page-granularity on writes, I'm out of luck.
I could rely on some external persistent device, or on-shutdown routines, but I don't have access to any other secondary memory and old point of sale terminals even have capacitors for last resort routines, but they aren't reliable enough for transaction data.
I'll have to live with the FLASH limitation.

FAT File system and lots of writes

I am considering using a FAT file system for an embedded data logging application. The logger will only create one file to which it continually appends 40 bytes of data every minute. After a couple years of use this would be over one million write cycles. MY QUESTION IS: Does a FAT system change the File Allocation Table every time a file is appended? How does it keep track where the end of the file is? Does it just put an EndOfFile marker at the end or does it store the length in the FAT table? If it does change the FAT table every time I do a write, I would ware out the FLASH memory in just a couple of years. Is a FAT system the right thing to use for this application?
My other thought is that I could just store the raw data bytes in the memory card and put an EndOfFile marker at the end of my data every time I do a write. This is less desirable though because it means the only way of getting data out of the logger is through serial transfers and not via a PC and a card reader.
FAT updates the directory table when you modify the file (at least, it will if you close the file, I'm not sure what happens if you don't). It's not just the file size, it's also the last-modified date:
http://en.wikipedia.org/wiki/File_Allocation_Table#Directory_table
If your flash controller doesn't do transparent wear levelling, and your flash driver doesn't relocate things in an effort to level wear, then I guess you could cause wear. Consult your manual, but if you're using consumer hardware I would have thought that everything has wear-levelling somewhere.
On the plus side, if the event you're worried about only occurs every minute, then you should be able to speed that up considerably in a test to see whether 2 years worth of log entries really does trash your actual hardware. Might even be faster than trying to find the relevant manufacturer docs...
No, a flash file system driver is explicitly designed to minimize the wear and spread it across the memory cells. Taking advantage of the near-zero seek time. Your data rates are low, it's going to last a long time. Specifying a yearly replacement of the media is a simple way to minimize the risk.
If your only operation is appending to one file it may be simpler to forgo a filesystem and use the flash device as a data tape. You have to take into account the type of flash and its block size, though.
Large flash chips are divided into sub-pages that are a power-of-two multiple of 264 (256+8) bytes in size, pages that are a power-of-two multiple of that, and blocks which are a power-of-two multiple of that. A blank page will read as all FF's. One can write a page at a time; the smallest unit one can write is a sub-page. Once a sub-page is written, it may not be rewritten until the entire block containing it is erased. Note that on smaller flash chips, it's possible to write the bytes of a page individually, provided one only writes to blank bytes, but on many larger chips that is not possible. I think in present-generation chips, the sub-page size is 528 bytes, the page size is 2048+64 bytes, and the block size is 128K+4096 bytes.
An MMC, SD, CompactFlash, or other such card (basically anything other than SmartMedia) combines a flash chip with a processor to handle PC-style sector writes. Essentially what happens is that when a sector is written, the controller locates a blank page, writes a new version of that sector there along with up to 16 bytes of 'header' information indicating what sector it is, etc. The controller then keeps a map of where all the different pages of information are located.
A SmartMedia card exposes the flash interface directly, and relies upon the camera, card reader, or other device using it to perform such data management according to standard methods.
Note that keeping track of the whereabouts of all 4,000,000 pages on a 2 gig card would require either having 12-16 megs of RAM, or else using 12-16 meg of flash as a secondary lookup table. Using the latter approach would mean that every write to a flash page would also require a write to the lookup table. I wouldn't be at all surprised if slower flash devices use such an approach (so as to only have to track the whereabouts of about 16,000 'indirect' pages).
In any case, the most important observation is that flash write times are not predictable, but you shouldn't normally have to worry about flash wear.
Did you check what happens to the FAT file system consistency in case of a power failure or reset of your device?
When your device experience such a failure you must not lose only that log entry, that you are just writing. Older entries must stay valid.
No, FAT is not the right thing if you need to read back the data.
You should further consider what happens, if the flash memory is filled with data. How do you get space for new data? You need to define the requirements for this point.

Resources