How are "write-anywhere file systems" useful, and how are they implemented? - filesystems

Basically, I was wondering how write-anywhere file systems provide any advantage over the other kinds of filesystems out there, and how the write-anywhere model manages to do this (in a broad sense)?
Thanks

There are three popular file systems out there that follow in a very board sense the write-anywhere file system approach: The original WAFL used by NetApp (old technical report), ZFS, BTRFS.
The key properties of these file systems are
that there are no pre-assigned parts of the underlying block storage for data and meta data (hence the write-anywhere) and
that data is never overwritten, but redirected to a different location on the block storage. The latter property is shared with Flash Transition Layers or special Flash file systems, but usually they don't have property 1.
They have a few nice advantages (as a short summary):
It is easier and more straightforward to implement advantages file system features like snapshots, CDP, data deduplication.
Consistency is easier. Recovery after a crash is faster. In theory, a file system check should never be necessary.
RAID writes can be optimized. Multiple unrelated writes can be placed in a single RAID group, so that the IOs needed for the writes is reduced.

Related

Why don't Operating Systems (Windows,Linux) use Relational Databases (RDBMS) Instead of File Sytems?

We all know that most operating systems use file systems to store all data but don't you think it is more efficient to use databases as we use in websites/web apps?
tl;dr: Diversity.
First of all, if you look at the original FAT filesystem, and the original Unix filesystem, they were both key-value stores, they did not have a directory hierarchy.
Second, this link suggests there there are filesystems implemented with an RDBMS backend, which is tangential to your question.
Having said these, comparing RDBMS to a filesystem as storage for an OS, there are several drawbacks to using RDBMS:
First, RDBMS makes very strong guarantees (ACID) by means of locking, at the cost of performance. However, most programs do not require such guarantees (for examples, think of every program that works with a NoSQL DB). In comparison, POSIX makes strong-ish guarantees about metadata, but barely any guarantees about I/O. You can build an RDBMS on top of POSIX and add locking, but you can't build a filesystem on top of an RDBMS and remove locking.
Second, an RDBMS requires a schema. Imagine that you create a new storage volume for an OS. Instead of formatting a filesystem, you need to decide on a schema. What schema will be the most useful?
With filesystems, the "schema" is basically one table, with the columns "path", "data", and a column for each file attributes like modification time, type, and size. Using an RDBMS for this schema allows you to perform operations like mass truncate, mass rename, mass access control etc. atomically. However, it will not allow you to modify the data of the same record (file) concurrently. Nor will it allow you to implement hard links. Extended attributes or Alternate Data Streams will still have to be implemented as they are today rather than leveraging RDBMS capabilities, as well as special index logic for the path column in order to implement features like changing directory, listing directory, checking permissions for every directory in the path of a file etc., and special logic for the data column because files can be TBs in size. At that point the ROI of RDBMS is going down the more you add features.
Alternatively you can have the schema be per-program (i.e. every program can do CREATE TABLE etc.), but then your features are again limited by what the RDBMS can do. For example, how do you get the equivalent of find / -size +1GB or md5sum, or even cat or ls? which columns will these programs read? You'll find that all generic programs now need to take a set of columns that are of interest. It also makes scripting much harder.
Thirdly, Hierarchical systems are typically easier to scale.
One example is when you want to add storage. In a hierarchical filesystem, even without any fancy filesystem features, you can simply mount another filesystem onto a directory, and you have new storage. The tradeoff vs increasing the storage capacity for the current filesystem is that hard links & renames don't work across filesystem, and they don't share the storage capacity. However, on an RDBMS your options are either to create a new table and have your programs/scripts manage both tables, or to add more storage volume, for which you might need to do more advanced things like partitioning.
Another example is ecosystem requirements. As an end user wanting to put some order into their 60,000 pictures, 5000 songs, hundreds of work spreadsheets, 10,000 memes, hundreds of eBooks, videos etc. - things that are convenient to arrange in a hierarchy - you currently only need two programs - a file manager (Explorer, bash, Nautilus etc.), and a search capability (e.g. find(1)). On an RDBMS, you either have different tables with different columns, or one table with generic columns. Either way, you have to have a set of SQL scripts to work with these specific collections, which would be equivalent to having a shell script or a program for each type of collection. Meaning, managing large collections requires more programming.
Since hierarchical systems are useful in a generic context (which is the context the major OSes operate in), and since it's easier to build a non-hierarchical system on top of hierarchical one than doing the other way around (hierarchical filesystem cache even makes the job easier for libsqlfs), it is valuable for OSes to support hierarchical systems first-class.
The executive summary is: OSes serve many use cases, and storage access is a major part of that. It would be wise for an OS to build a storage access mechanism that's as minimal as possible, but that allows applications to build more specialized storage access mechanism on top of the OS.
That means providing a small but useful set of features (like permissions, locking, mounting, and symlinks) but not force too much requirements (like locking, or specifying the data format to the OS).
RDBMSes are just too specific.

Why database is considered different from a file system

Well every database book starts with the story that how earlier people used to store data as files and it was very inconvenient. After database came, things became really easy and seamless, because we can now query data etc. My question is how are the tables really stored in the disk and retrieved ? Aren't they stored as files only or they are just copied to the address space bit by bit, and access via address only ? Or there is a underneath file system and the database server handles accessing the file system and presents the abstraction of a table in front of us.
Might be a very trivial question but, I have not found answer in any book
The question is not trivial, but the distinction between the two is quite apparent.
File systems provide a way to logically view the streams in a hierarchical manner.
A virtual representation of what lies on the disk; which would otherwise just be a binary stream, unreadable.
When we talk about storing data, we can extend a method of writing data to files and later define our own protocols for CRUD'ing on it; thus mimicking a fractional part of what databases do.
There are numerous limitations to storing data in files. If you store them in file and define your own protocol, it will be very specific to you. Plus, there are various other concerns like security, disaster recovery etc etc.
Even though everything is stored in some or the other way on disk, the main advantage databases bring to the table versus files are the mechanisms that they offer.
To minimize the io, we have db caches and numerous other features.
As you imagine a File system to be something that helps visualize and access the data on the disk in streams, we can imagine a database to be such a tool for data - Data systems, which organizes your data. Files can only fractionally do that; again, unless you extend your program to mimic a database.
How the tables are really stored on the disk and retrieved, that's a vast topic. Advise reading your favourite database internals. A book by Korth might also be a good read.

Database vs File system storage

Database ultimately stores the data in files, whereas File system also stores the data in files. In this case what is the difference between DB and File System. Is it in the way it is retrieved or anything else?
A database is generally used for storing related, structured data, with well defined data formats, in an efficient manner for insert, update and/or retrieval (depending on application).
On the other hand, a file system is a more unstructured data store for storing arbitrary, probably unrelated data. The file system is more general, and databases are built on top of the general data storage services provided by file systems. [Quora]
The file system is useful if you are looking for a particular file, as operating systems maintain a sort of index. However, the contents of a txt file won't be indexed, which is one of the main advantages of a database.
For very complex operations, the filesystem is likely to be very slow.
Main RDBMS advantages:
Tables are related to each other
SQL query/data processing language
Transaction processing addition to SQL (Transact-SQL)
Server-client implementation with server-side objects like stored procedures, functions, triggers, views, etc.
Advantage of the File System over Data base Management System is:
When handling small data sets with arbitrary, probably unrelated data, file is more efficient than database.
For simple operations, read, write, file operations are faster and simple.
You can find n number of difference over internet.
"They're the same"
Yes, storing data is just storing data. At the end of the day, you have files. You can store lots of stuff in lots of files & folders, there are situations where this will be the way. There is a well-known versioning solution (svn) that finally ended up using a filesystem-based model to store data, ditching their BerkeleyDB. Rare but happens. More info.
"They're quite different"
In a database, you have options you don't have with files. Imagine a textfile (something like tsv/csv) with 99999 rows. Now try to:
Insert a column. It's painful, you have to alter each row and read+write the whole file.
Find a row. You either scan the whole file or build an index yourself.
Delete a row. Find row, then read+write everything after it.
Reorder columns. Again, full read+write.
Sort rows. Full read, some kind of sort - then do it next time all over.
There are lots of other good points but these are the first mountains you're trying to climb when you think of a file based db alternative. Those guys programmed all this for you, it's yours to use; think of the likely (most frequent) scenarios, enumerate all possible actions you want to perform on your data, and decide which one works better for you. Think in benefits, not fashion.
Again, if you're storing JPG pictures and only ever look for them by one key (their id maybe?), a well-thought filesystem storage is better. Filesystems, btw, are close to databases today, as many of them use a balanced tree approach, so on a BTRFS you can just put all your pictures in one folder - and the OS will silently implement something like an early SQL query each time you access your files.
So, database or files?...
Let's see a few typical examples when one is better than the other. (These are no complete lists, surely you can stuff in a lot more on both sides.)
DB tables are much better when:
You want to store many rows with the exact same structure (no block waste)
You need lightning-fast lookup / sorting by more than one value (indexed tables)
You need atomic transactions (data safety)
Your users will read/write the same data all the time (better locking)
Filesystem is way better if:
You like to use version control on your data (a nightmare with dbs)
You have big chunks of data that grow frequently (typically, logfiles)
You want other apps to access your data without API (like text editors)
You want to store lots of binary content (pictures or mp3s)
TL;DR
Programming rarely says "never" or "always". Those who say "database always wins" or "files always win" probably just don't know enough. Think of the possible actions (now + future), consider both ways, and choose the fastest / most efficient for the case. That's it.
Something one should be aware of is that Unix has what is called an inode limit. If you are storing millions of records then this can be a serious problem. You should run df -i to view the % used as effectively this is a filesystem file limit - EVEN IF you have plenty of disk space.
The difference between file processing system and database management system is as follow:
A file processing system is a collection of programs that store and manage files in computer hard-disk. On the other hand, A database management system is collection of programs that enables to create and maintain a database.
File processing system has more data redundancy, less data redundancy in dbms.
File processing system provides less flexibility in accessing data, whereas dbms has more flexibility in accessing data.
File processing system does not provide data consistency, whereas dbms provides data consistency through normalization.
File processing system is less complex, whereas dbms is more complex.
Context: I've written a filesystem that has been running in production for 7 years now. [1]
The key difference between a filesystem and a database is that the filesystem API is part of the OS, thus filesystem implementations have to implement that API and thus follow certain rules, whereas databases are built by 3rd parties having complete freedom.
Historically, databases where created when the filesystem provided by the OS were not good enough for the problem at hand. Just think about it: if you had special requirements, you couldn't just call Microsoft or Apple to redesign their filesystem API. You would either go ahead and write your own storage software or you would look around for existing alternatives. So the need created a market for 3rd party data storage software which ended up being called databases. That's about it.
While it may seem that filesystems have certain rules like having files and directories, this is not true. The biggest operating systems work like that but there are many mall small OSs that work differently. It's certainly not a hard requirement. (Just remember, to build a new filesystem, you also need to write a new OS, which will make adoption quite a bit harder. Why not focus on just the storage engine and call it a database instead?)
In the end, both databases and filesystems come in all shapes and sizes. Transactional, relational, hierarchical, graph, tabled; whatever you can think of.
[1] I've worked on the Boomla Filesystem which is the storage system behind the Boomla OS & Web Application Platform.
The main differences between the Database and File System storage is:
The database is a software application used to insert, update and delete
data while the file system is a software used to add, update and delete
files.
Saving the files and retrieving is simpler in file system
while SQL needs to be learn to perform any query on the database to
get (SELECT), add (INSERT) and update the data.
Database provides a proper data recovery process while file system did not.
In terms of security the database is more secure then the file system (usually).
The migration process is very easy in File system just copy and paste into the target
while for database this task is not as simple.

Understanding KeyValue embedded datastore vs FileSystem

I have a basic question with regards to FileSystem usage
I want to use a embedded KeyValue store, which is very write oriented. (persistent) Say my value size is
a) 10 K
b) 1 M
and read and updates are equal in number
Cant I simply create files containing the value and there name acting as keys.
Wont it as fast as using a KeyValue store as LevelDB or RocksDB.
Can anybody please help me understand .
In principle, yes, a filesystem can be used as a key-value store. The differences only come in when you look at individual use cases and limitations in the implementations.
Without going into too much details here, there are some things likely to be very different:
A filesystem splits data into fixed size blocks. Two files can't typically occupy parts of the same block. Common block sizes are 4-16 KiB; you can calculate how much overhead your 10 KiB example would cause. Key/value stores tend to account for smaller-sized pieces of data.
Directory indexes in filesystems are often not capable of efficiently iterating over the filenames/keys in sort order. You can efficiently look up a specific key, but you can't retrieve ranges without reading pretty much all of the directory entries. Some key/value stores, including LevelDB, support efficient ordered iterating.
Some key/value stores, including LevelDB, are transactional. This means you can bundle several updates together, and LevelDB will make sure that either all of these updates make it through, or none of them do. This is very important to prevent your data getting inconsistent. Filesystems make this much harder to implement, especially when multiple files are involved.
Key/value stores usually try to keep data contiguous on disk (so data can be retrieved with less seeking), whereas modern filesystems deliberately do not do this across files. This can impact performance rather severely when reading many records. It's not an issue on solid-state disks, though.
While some filesystems do offer compression features, they are usually either per-file or per-block. As far as I can see, LevelDB compresses entire chunks of records, potentially yielding better compression (though they biased their compression strategy towards performance over compression efficiency).
Lets try to build Minimal NoSQL DB server using Linux and modern File System in 2022, just for fun, not for serious environment.
DO NOT TRY THIS IN PRODUCTION
—————————————————————————————————————————————
POSIX file Api for read write,
POSIX ACL for native user accounts and group permission management.
POSIX filename as key ((root db folder)/(tablename folder)/(partition folder)/(64bitkey)). Per db and table we can define permission for read/write using POSIX ACL. (64bitkey) is generated in compute function.
Mount BTRFS/OpenZFS/F2fs as filesystem to provide compression (Lz4/zstd) and encryption (fscrypt) as native support. F2fs is more suitable as it implements LSM which many nosql db used in their low level architecture.
Meta data is handled by filesystem so no need to implement it.
Use Linux and/or filesystem to configure page or file or disk block cache according to read write patterns as implemented in business login written in compute function or db procedure.
Use RAID and sshfs for remote replication to create Master/Slave high availability and/or backup
Compute function or db procedure for writing logic could be NodeJS file or Go binary or whatever along with standard http/tcp/ws server module which reads and write contents to DB.

Non-file FileSystems?

I've been thinking on this for a while now (you know, that dangerous thing programmers tend to do) and I've been wondering, is the method of storing data that we're so accustomed to really all that efficient? The trouble with answering this question is that I really don't have anything to compare it to, since it's the only thing I've ever used.
I don't mean FAT or NTFS or a particular type of file system, I mean the filesystem structure as a whole. We are simply used to thinking of "files" inside "folders" like our hard drive was one giant filing cabinet. This is a great analogy and indeed, it makes it a lot easier to learn when we think of it this way, but is it really the best way to go about describing programs and their respective parts?
I'd like to know if anyone can think of (or knows about) a data storage technique that might be used to store data for an Operating System to use that would organize the parts of data in a different manner. Does anything... different even exist?
Emails are often stored in folders. But ever since I have migrated to Gmail, I have become accustomed to classifying my emails with tags.
I often wondered if we could manage a whole file-system that way: instead of storing files in folders, you could tag files with the tags you like. A file identifier would not look like this:
/home/john/personal/contacts.txt
but more like this:
contacts[john,personal]
Well... just food for thought (maybe this already exists!)
You can for example have dedicated solutions, like Oracle Raw Partitions. Other databases support similar thing. In these cases the filesystem provides unnecessary overhead and can be ommited - DB software will take care of organising the structure.
The problem seems very application dependent and files/folders seem to be a reasonable compromise for many applications (and is easy for human beings to comprehend).
Mainframes used to just give programmers a number of 'devices' to use. The device corresponsed to a drive or a partition thereof and the programmer was responsible for the organisation of all data on it. Of course they quickly built up libraries to help with that.
The only OS I think think of that does use the common hierachical arrangement of flat files (like UNIX) is PICK. That used a sort of relational database as the filesystem.
Microsoft had originally planned to introduce a new file-system for windows vista (WinFS - windows future storage). The idea was to store everything in a relational database (SQL Server). As far as I know, this project was never (or not yet?) finished.
There's more information about it on wikipedia.
I knew a guy who wrote his doctorate about a hard disk that comes with its own file system. It was based on an extension of SCSI commands that allowed the usual open, read, write and close commands to be sent to the disk directly, bypassing the file system drivers of the OS. I think the conclusion was that it is inflexible, and does not add much efficiency.
Anyway, this disk based file system still had a folder like structure I believe, so I don't think it really counts for you ;-)
Well, there's always Pick, where the OS and file system were an integrated database.
Traditional file systems are optimized for fast file access if you know the name of the file you want (including its path). Directories are a way of grouping files together so that they're easier to find if you know properties of the file but not its actual name.
Traditional file systems are not good at finding files if you know very little about them, however they are robust enough that one can add a layer on top of them to aid in retrieving files based on content or meta-information such as tags. That's what indexers are for.
The bottom line is we need a way to store persistently the bytes that the CPU needs to execute. So we have traditional file systems which are very good at organizing sequential sets of bytes. We also need to store persistently the bytes of files that aren't executed directly, but are used by things that do execute. Why create a new system for the same fundamental thing?
What more should a file system do other than store and retrieve bytes?
I'll echo the other responses. If I could pick a filesystem type, I personally would rather see a hybrid approach: a flat database of subtrees, where each subtree is considered as a cohesive unit, but if you consider the subtrees themselves as discrete units they would have no hierarchy, but instead could have metadata + be queryable on that metadata.
The reason for files is that humans like to attach names to "things" they have to use. Otherwise, it becomes hard to talk or think about or even distinguish them.
When we have too many things on a heap, we like to separate the heap. We sort it by some means and we like to build hierarchies where you can navigate arbitrarily sized amounts of things.
Hence directories and files just map our natural way of working with real objects. Since you can put anything in a file. On Unix, even hardware is mapped as "device nodes" into the file system which are special files which you can read/write to send commands to the hardware.
I think the metaphor is so powerful, it will stay.
I spent a while trying to come up with an automagically versioning file system that would maintain versions (and version history) of any specific file and/or directory structure.
The idea was that all of the standard access command (e.g. dir, read, etc.) would have an optional date/time parameter that could be passed to access the file system as it looked at that point in time.
I got pretty far with it, but had to abandon it when I had to actually go out and earn some money. It's been on the back-burner since then.
If you take a look at the start-up times for operating systems, it should be clear that improvements in accessing disks can be made. I'm not sure if the changes should be in the file system or rather in the OS start-up code.
Personally, I'm really sorry WinFS didn't fly. I loved the concept..
From Wikipedia (http://en.wikipedia.org/wiki/WinFS) :
WinFS includes a relational database
for storage of information, and allows
any type of information to be stored
in it, provided there is a well
defined schema for the type.
Individual data items could then be
related together by relationships,
which are either inferred by the
system based on certain attributes or
explicitly stated by the user. As the
data has a well defined schema, any
application can reuse the data; and
using the relationships, related data
can be effectively organized as well
as retrieved. Because the system knows
the structure and intent of the
information, it can be used to make
complex queries that enable advanced
searching through the data and
aggregating various data items by
exploiting the relationships between
them.

Resources