Working with block special files/devices to implement a filesystem

Working with block special files/devices to implement a filesystem - c

I've implemented a basic filesystem using FUSE, with all foreseeable POSIX functionality implemented [naturally I haven't even profiled yet ;)]. Currently I'm able to run the filesystem on a regular file (st_mode & S_IFREG), but the next step in development is to host it on an actual block device. Running my code as is, immediately fails on reading st_size after calling fstat on the device. Of course I don't expect the problems to stop there so:
What changes are required to operate on block devices as opposed to regular files?
What are some special considerations I need to make with regard to performance, limitations, special features and the like?
Are there any tutorials and references with dealing with block special files? Googling has turned up very little useful; I only have background knowledge (ironically from MSDN in my dark past) and some scanty information in the manpages.
Update0
I've pointed out what I mean by "regular file".
I don't want to concentrate on getting the device size, I want general guidelines for differences between regular files and device files with respect to performance and usage.

Currently I'm able to run the
filesystem on a regularly file, but
the next step in development is to
host it on an actual block device
I don't completely understand what you mean - I assume you are saying that "you currently save your filesystem data to a plain file on a normally mounted filesystem - but now wish to use a raw block device for your data storage".
If so - having done this a few times - I'd advise the following:
Never use an "actual" block device for you filesystem. Always use a partition. There are several rarely-used partition-types that you can use to denote that such a filesystem may be your filesystem type, and that your filesystem can check and mount it if it is such. Thus, you will never be running on something like "/dev/sdb", but rather you will store you data on one such as /dev/sdb1, and assign it some partition type. This has obvious advantages, like allowing your filesystem to be co-resident on a single phyiscal disk as another, etc.
If you are implementing any caching in your filesystem (like Linux does with the Page Cache), do all I/Os to the block devices with O_DIRECT. This requires you to pass page-alligned memory to do all I/O, and requires that the requests be sector/block aligned - but will remove a data copy which would otherwise be required when data is moved from the block device to the page cache, then from the page-cache to your user-space [filesystem] reader.
What do you mean that the fstat "fails"? This is an fstat trying to determing the length of the block device? Do you receive an error? What is it?

block devices behave very much like files - tools like dd can operate on them without any special handling. fstat, though, returns information about the special-file node, not the blockdev it refers to. you probably want to use the BLKGETSIZE64 ioctl to read the size.
there's no particular reason to use a partition over a raw device, though - a blockdev is a blockdev. O_DIRECT is good, as well, assuming your workload won't generate repeated accesses. don't confuse it with a real protocol for ensuring the permanence and atomicity of your filesystem, though (fsync, barriers, etc).

Related

windows - plain shared memory between 2 processes (no file mapping, no pipe, no other extra)

How to have an isolated part of memory, that is NOT at all backed to any file or extra management layers such as piping and that can be shared between two dedicated processes on the same Windows machine?
Majority of articles point me into the direction of CreateFileMapping. Let's start from there:
How does CreateFileMapping with hFile=INVALID_HANDLE_VALUE actually work?
According to
https://msdn.microsoft.com/en-us/library/windows/desktop/aa366537(v=vs.85).aspx
it
"...creates a file mapping object of a specified size that is backed by the system paging file instead of by a file in the file system..."
Assume I write something into the memory which is mapped by CreateFileMapping with hFile=INVALID_HANDLE_VALUE. Under which conditions will this content be written to the page file on disk?
Also my understanding of what motivates the use of shared memory is to keep performance up and optimized. Why is the article "Creating Named Shared Memory"
(https://msdn.microsoft.com/de-de/library/windows/desktop/aa366551(v=vs.85).aspx) refering to
CreateFileMapping, if there is not a single attribute combination, that would prevent writing to files, e.g. the page file?
Going back to the original question: I am afraid, that CreateFileMapping is not good enough... So what would work?

You misunderstand what it means for memory to be "backed" by the system paging file. (Don't feel bad; Raymond Chen has described the text you quoted from MSDN as "one of the most misunderstood sentences in the Win32 documentation.") Almost all of the computer's memory is "backed" by something on disk; only the "non-paged pool", used exclusively by the kernel and as little as possible, isn't. If a page isn't backed by an ordinary named file, then it's backed by the system paging file. The operating system won't write pages out to the system paging file unless it needs to, but it can if it does need to.
This architecture is intended to ensure that processes can be completely "paged out" of RAM when they have nothing to do. This used to be much more important than it is nowadays, but it's still valuable; a typical Windows desktop will have dozens of processes "idle" waiting for events (e.g. needing to spool a print job) that may never happen. Those processes can get paged out and the memory can be put to more constructive use.
CreateFileMapping with hfile=INVALID_HANDLE_VALUE is, in fact, what you want. As long as the processes sharing the memory are actively doing stuff with it, it will remain resident in RAM and there will be no performance problem. If they go idle, yeah, it may get paged out, but that's fine because they're not doing anything with it.
You can direct the system not to page out a chunk of memory; that's what VirtualLock is for. But it's meant to be used for small chunks of memory containing secret information, where writing it to the page file could conceivably leak the secret. The MSDN page warns you that "Each version of Windows has a limit on the maximum number of pages a process can lock. This limit is intentionally small to avoid severe performance degradation."

pread/pwrite, buffers and disk cache

If my code does something like fd = open("/dev/sdXY", ...) and pwrite(fd, ...)/pread(fd, ...), do the I/O operations skip the buffers or disk cache? Suppose /dev/sdXY is a unmounted, formatted disk partition (ext4, ufs, etc.).
I ask that because there is a need to grant contiguous file storage in an application I'm working on and I read that the only way to achieve it is doing something like what I described. However, I may remove the need for contiguous storage if that would lead in lost of buffers, disk cache or some other useful feature.
I'm also confused about if I would need to re-implement low level stuff since the partition would already be formatted with a file system. I read that would be the case for RAW disks/partitions. I already know it will be needed to handle which blocks are free or in use, files and folders structures, etc., I'm already working on that.
Another question: I have only seen something about buffers when reading about fopen()/fread()/fwrite() and C++'s file streams. Is it right that only these streams and the f* family of functions have some kind of buffer, unlike open/write/read/pwrite/pread/etc? Is this buffer the same as disk cache or something different?
A last one: Is HDD cache handled by its own drive or by file system (ext4, ufs, etc.)?

The simple answer is 'it depends'. What's hard is characterizing what it depends on.
Simply using open() doesn't avoid the kernel disk buffer pool. To do that, you need special options (O_DIRECT) on Linux. However, using open() does avoid using hidden application buffers; you get to choose where the data is read from or written to without any intermediate copies. By contrast, the f* family of functions do have a 'hidden' application buffer; the data is frequently read into an I/O buffer associated with the FILE * file stream, and then copied into your application buffers.
If your /dev/sdXY device is already formatted with a file system but you want to ensure contiguous file storage for a file, you are going to have to replicate a significant portion of the file system driver to ensure you allocate the space correctly. It is unlikely to be a sensible use of your time or energy. Yes, you would need to reimplement all sorts of low-level disk space management — it would be entirely non-trivial. Further, the implementation for ext4 would be quite different from the implementation for ufs, etc — so you'd really have your work cut out for you.

Optimizing data stream to disk in C (also flash memory)

I have a C program running on Linux that acquires data from a USB device (sensor data), does some processing and streams the result to disk. Currently I save to a text file using fputs(), a line looks like this:
timestamp value1 value2 ... valueN
the sample rate being up to 250Hz.
The program should run on a RPi or similar board and possibly write the data to a flash memory (SD card).
I have following questions:
Should I be optimizing the data stream or let the OS do the job? More specifically, should I be trying to minimize how often data is actually written to disk (also given the use of a flash memory)?
I have read about setbuf() and setvbuf(), as I understand they should effectively delay writing until a "block" is filled. Are these appropriate or is there a better way other than perhaps implementing my own buffer?
Which output function is best suited for data streaming with the above in mind (fputs() / fprintf() / write())?
Should I be trying to increase randomness (as to use all sectors) when writing to a SD card? If yes what's the best way to achieve this?
Here some more thoughts:
I can consider using a binary format to decrease size, but I would prefer keeping the text format to simplify later data handling.
Using a hard drive is also an option in the final design, especially if a high acquisition rate is to be carried on over a long time.
The data rate being relatively low I do not expect bandwidth problem with either hard drive or SD card. It is possible that the rate will be higher in the future (kHz or more).
Thanks for your answers.
EDIT 20130128
Thank you for all the answers so far, they give me some good insight. I'll sum it up a bit:
In general I should not have bandwidth issues, however to avoid unnecessary large log files I might consider a binary format. Yes the log should be human readable, if not I'll make an export function or similar. Yes unwind's assumption is correct, about 10 or 15 data values each line.
The mentioned read/write cycles per cell should be enough for some time, at least in the testing phase, considering we don't always write and delete the same cells. I will play around with buffer size in setvbuf() and set the buffering mode to full buffering to see if I can optimize this while keeping a reasonable save interval (a few seconds or more also depending on sample rate).
In the final design I might use a hard drive to avoid most of the problems mentioned here, or a second SD card which can be easily replaced (might be also good to quickly retrieve the data). I will format this with one of the format suggested here (FAT or JFFS2/F2FS).
Following zmo's suggestion I will try to make the system as read only as possible (at least the system partition), I was already considering this.
A Beaglebone, also mentioned by zmo, is my next choice if I'm not happy with the RPi (I read that its USB bus is not always stable, USB is obviously very important for my application).
I have already implemented a UDP port to send data over network, still I would like to keep at least a local copy of that data and maybe only send a subset of or already processed data, as well as "control data".

Should I be optimizing the data stream or let the OS do the job? More specifically, should I be trying to minimize how often data is actually written to disk (also given the use of a flash memory)?
Well, you can usually assume that the OS does a pretty awesome job at buffering and handling output to the hard drive… As long as you don't do unbuffered writes.
Though, from my experience, you should not write logs to a SD Card, because it definitely kills the SD Card faster than you can imagine. On my first projects, I had installed linux on beaglebones, and between 6 months to 12 months after, all my SD Cards had to be replaced…
Since then, I've learned to run read only systems on the SD card and send any kind of regular updates over the network, the trick being to use a ramdisk for /tmp and /var.
In your case, using a hard drive is an easy solution (which will works smoothly), but you can also use a secondary SD Card where you write the logs. Then you'll be able to use a "stupid" filesystem such as a FAT one where you'll write your data aligned, as your data will be the only thing to be written on the SD. What is killing a SDCard is lots of little read/writes that happen a lot with temporary files, and defragmentation of the drive.
I have read about setbuf() and setvbuf(), as I understand they should effectively delay writing until a "block" is filled. Are these appropriate or is there a better way other than perhaps implementing my own buffer?
well, just keep it to full buffering, it will help write your data aligned on the filesystem.
Which output function is best suited for data streaming with the above in mind (fputs() / fprintf() / write())?
they should all behave similarly for your problematic.
Should I be trying to increase randomness (as to use all sectors) when writing to a SD card? If yes what's the best way to achieve this?
the firmware of the sdcard should be taking care of that for you. The only thing would be to use a simpler filesystem like FAT (or JFFS2/F2FS like ivan-voras suggets), because ext2/ext3/ext4 filesystems do automatic defragmentation which basically is moving around inodes to keep everything aligned. Though I'm not sure if it disables that behavior with SDcards and SSDs.

Writing to the SD card often will definitely kill it sooner, but it also means you can attempt to prolong this time by reducing the number of writes. As others have said, the best solution for you would be to write the logs over the network to a server or just another machine which has proper storage (in the simplest case, maybe you can use syslog(3) or just plain NFS).
If you want to continue with the original plan, then using setvbuf(3) to enable block buffered mode and setting a large buffer size (like 128 KiB or 256 KiB) would be best. A large buffer size also means that you will lose unwritten data from the buffer if power goes out, etc.
However, a large buffer only delays the inevitable and you should search for other options. It's not as alarming as Lundin's answer states because there are many cells and you're not writing always to the same one, so if you get the largest SD card you can buy, then using his method you can calculate approximately how many times you can rewrite the entire card before it fails. Using a flash-friendly file system such as F2FS or JFFS2 will be beneficial.

Here're my thoughts:
It might be a good idea to buffer some data in memory before writing to disk, but keep in mind that this might cause data loss in case of power failure.
I think this is highly dependent on the file system and type of storage you use. There is no generic answer but it could prove useful to implement and benchmark it on your specific configuration.
Considering the huge amount of data you're outputting, I'd choose a binary format (unless you want the file to be human readable)
The firmware of the flash drive should already take care of this. Basically this is the cornerstone of all modern SSDs. (SD card controllers should implement it too.)

NAND RAW access

I'm working with a C++ application in an embedded systems running Linux. This device receives messages (small chunk of few bytes) and need to be stored in a non volatile memory in case of power failure. This worked well with another platform because a static RAM was available.
The problem on this platform is that we only have a NAND Flash to do this and we would like to append different message in the same block without having to erase the whole block before updating it with a new message ! Writing a file per messages is not a good solution because there can be a lot of them ! Moreover, this must be efficient and should be life sparing for the flash by avoiding too much erases ! What I would like to be able to do is writing byte after byte into the flash without worrying about bad blocks.
I found "Petit FAT File System" and I'm wondering if this would suite my needs ... ?
Could someone tell me if this is possible with "Petit FAT File System" or give me any suggestion on how to handle this ?
Thanks !

I haven't looked into Petit file system, but your real limitation is the NAND flash. The manufacture data sheet will likely indicate how many writes you can successfully make to each block, before an erase is required. It's possible that there is no hard limit, but the integrity of the data will not be guaranteed after a max write count.
The answer depends on the process technology and flash cell design. For example, is it SLC or MLC NAND? SLC is going to be able to handle multiple block writes better.
Another question would be what type of flash controller is on your system? If it uses hardware ECC, then you might be limited by the controller, since 2nd writes will invalidate the ECC value of the 1st data write. If it is possible that you can do ECC calculations in software, then it comes back to the NAND limitation.
Small write support might be addressed in the data sheet, via a special set aside memory area that might be provided. So again, check the data sheet.
If you post a link, or indicate what hardware you are using, I can try and give you a more definite answer.

If you are dealing with flash, there's no way around deleting it before writing. All flash memory works in that way. Depending on your real-time requirements and the size of the data, this may or may not be an issue. But since you are using embedded Linux, real-time is probably not a major concern for the application anyhow.
I don't see why you would need a complete file system to store a few bytes?! Why do you need an external memory for this in the first place, can't you write to the internal flash of the MCU? If you just need to store a few bytes, an MCU with on-chip eeprom/data flash would likely suit your needs the best.
Also, that flash circuit doesn't look too promising. First I find it mighty fishy that they don't type out the number of cycles nor the data retention but refer to the "gualification report". This might indicate that the the memory is of poor quality.
And the data sheet says year 2009 and Samsung. If I may be cynical, that probably means that the chip is already obsolete. Samsung doesn't exactly have the best long-life reputation.

I'm curious why you want to use raw flash. Why not use something like JFFS2 or UBIFS on top of the MTD drive? Let the MTD driver manage the ECC while JFFS2 or UBIFS manages the wear-leveling. Then just open one file and write to it whenever you need.

Direct access to hard disk with no FS from C program on Linux

I want to access the whole hard disk directly from a C program. There's no FS on it and never's gonna be one.
I just want to open /dev/sda (for example) and do I/O at the block/sector level of the disk.
I'm planning to write some programs for learning C programming in the Linux environment (I know C language, Python, Perl and Java) but lack confidence with the Linux environment.
For my learning purposes I'm thinking about playing with kyoto-cabinet and saving the value corresponding to the computed hash directly into a "block/sector" of the hard disk, recording the pair: "hash, block/sector reference" into a kyoto-cabinet hash database file.
I don't know if this is feasible using standard C I/O functions or otherwise I'd have to write a "device driver" or something like...

As mentioned elsewhere, under *NIX systems, block devices like /dev/sda can be accessed as plain files. Note that if file system is mounted from the device, opening it as file for writing would fail.
If you want to play with block devices, I would advise to first use the loop device, which presents a plain file as a block device. For example:
dd if=/dev/zero of=./loop_file_10MB bs=1024 count=10K
losetup /dev/loop0 $PWD/loop_file_10MB
After that, /dev/loop0 would behave as if it was a block device, but all information written would be stored in the file.

As device files for drives (e.g. /dev/sda) are block devices, this means you can open, seek and use the file almost like a normal file.

Yes, as others have noted, you can simply open the block device.
However, it's a really good idea to do IO (writes anyway) on block boundaries and whole blocks. You can use something like pread() and pwrite() to do these IO, or mmap some or all of the device.
There are a bunch of ioctls which can be used, see "man sd" for some more info. They don't seem to all be documented in the same place.
In linux/fs.h BLKROSET and a bunch of other ioctls are defined, you have to look around to find out how to use them. You can do useful things like find out how big the device is, and what the block size is.
The source code of the util-linux-ng package is your friend, it contains examples.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight