Non-file FileSystems? - filesystems

I've been thinking on this for a while now (you know, that dangerous thing programmers tend to do) and I've been wondering, is the method of storing data that we're so accustomed to really all that efficient? The trouble with answering this question is that I really don't have anything to compare it to, since it's the only thing I've ever used.
I don't mean FAT or NTFS or a particular type of file system, I mean the filesystem structure as a whole. We are simply used to thinking of "files" inside "folders" like our hard drive was one giant filing cabinet. This is a great analogy and indeed, it makes it a lot easier to learn when we think of it this way, but is it really the best way to go about describing programs and their respective parts?
I'd like to know if anyone can think of (or knows about) a data storage technique that might be used to store data for an Operating System to use that would organize the parts of data in a different manner. Does anything... different even exist?

Emails are often stored in folders. But ever since I have migrated to Gmail, I have become accustomed to classifying my emails with tags.
I often wondered if we could manage a whole file-system that way: instead of storing files in folders, you could tag files with the tags you like. A file identifier would not look like this:
/home/john/personal/contacts.txt
but more like this:
contacts[john,personal]
Well... just food for thought (maybe this already exists!)

You can for example have dedicated solutions, like Oracle Raw Partitions. Other databases support similar thing. In these cases the filesystem provides unnecessary overhead and can be ommited - DB software will take care of organising the structure.
The problem seems very application dependent and files/folders seem to be a reasonable compromise for many applications (and is easy for human beings to comprehend).

Mainframes used to just give programmers a number of 'devices' to use. The device corresponsed to a drive or a partition thereof and the programmer was responsible for the organisation of all data on it. Of course they quickly built up libraries to help with that.
The only OS I think think of that does use the common hierachical arrangement of flat files (like UNIX) is PICK. That used a sort of relational database as the filesystem.

Microsoft had originally planned to introduce a new file-system for windows vista (WinFS - windows future storage). The idea was to store everything in a relational database (SQL Server). As far as I know, this project was never (or not yet?) finished.
There's more information about it on wikipedia.

I knew a guy who wrote his doctorate about a hard disk that comes with its own file system. It was based on an extension of SCSI commands that allowed the usual open, read, write and close commands to be sent to the disk directly, bypassing the file system drivers of the OS. I think the conclusion was that it is inflexible, and does not add much efficiency.
Anyway, this disk based file system still had a folder like structure I believe, so I don't think it really counts for you ;-)

Well, there's always Pick, where the OS and file system were an integrated database.

Traditional file systems are optimized for fast file access if you know the name of the file you want (including its path). Directories are a way of grouping files together so that they're easier to find if you know properties of the file but not its actual name.
Traditional file systems are not good at finding files if you know very little about them, however they are robust enough that one can add a layer on top of them to aid in retrieving files based on content or meta-information such as tags. That's what indexers are for.
The bottom line is we need a way to store persistently the bytes that the CPU needs to execute. So we have traditional file systems which are very good at organizing sequential sets of bytes. We also need to store persistently the bytes of files that aren't executed directly, but are used by things that do execute. Why create a new system for the same fundamental thing?
What more should a file system do other than store and retrieve bytes?

I'll echo the other responses. If I could pick a filesystem type, I personally would rather see a hybrid approach: a flat database of subtrees, where each subtree is considered as a cohesive unit, but if you consider the subtrees themselves as discrete units they would have no hierarchy, but instead could have metadata + be queryable on that metadata.

The reason for files is that humans like to attach names to "things" they have to use. Otherwise, it becomes hard to talk or think about or even distinguish them.
When we have too many things on a heap, we like to separate the heap. We sort it by some means and we like to build hierarchies where you can navigate arbitrarily sized amounts of things.
Hence directories and files just map our natural way of working with real objects. Since you can put anything in a file. On Unix, even hardware is mapped as "device nodes" into the file system which are special files which you can read/write to send commands to the hardware.
I think the metaphor is so powerful, it will stay.

I spent a while trying to come up with an automagically versioning file system that would maintain versions (and version history) of any specific file and/or directory structure.
The idea was that all of the standard access command (e.g. dir, read, etc.) would have an optional date/time parameter that could be passed to access the file system as it looked at that point in time.
I got pretty far with it, but had to abandon it when I had to actually go out and earn some money. It's been on the back-burner since then.

If you take a look at the start-up times for operating systems, it should be clear that improvements in accessing disks can be made. I'm not sure if the changes should be in the file system or rather in the OS start-up code.

Personally, I'm really sorry WinFS didn't fly. I loved the concept..
From Wikipedia (http://en.wikipedia.org/wiki/WinFS) :
WinFS includes a relational database
for storage of information, and allows
any type of information to be stored
in it, provided there is a well
defined schema for the type.
Individual data items could then be
related together by relationships,
which are either inferred by the
system based on certain attributes or
explicitly stated by the user. As the
data has a well defined schema, any
application can reuse the data; and
using the relationships, related data
can be effectively organized as well
as retrieved. Because the system knows
the structure and intent of the
information, it can be used to make
complex queries that enable advanced
searching through the data and
aggregating various data items by
exploiting the relationships between
them.

Related

Why database is considered different from a file system

Well every database book starts with the story that how earlier people used to store data as files and it was very inconvenient. After database came, things became really easy and seamless, because we can now query data etc. My question is how are the tables really stored in the disk and retrieved ? Aren't they stored as files only or they are just copied to the address space bit by bit, and access via address only ? Or there is a underneath file system and the database server handles accessing the file system and presents the abstraction of a table in front of us.
Might be a very trivial question but, I have not found answer in any book
The question is not trivial, but the distinction between the two is quite apparent.
File systems provide a way to logically view the streams in a hierarchical manner.
A virtual representation of what lies on the disk; which would otherwise just be a binary stream, unreadable.
When we talk about storing data, we can extend a method of writing data to files and later define our own protocols for CRUD'ing on it; thus mimicking a fractional part of what databases do.
There are numerous limitations to storing data in files. If you store them in file and define your own protocol, it will be very specific to you. Plus, there are various other concerns like security, disaster recovery etc etc.
Even though everything is stored in some or the other way on disk, the main advantage databases bring to the table versus files are the mechanisms that they offer.
To minimize the io, we have db caches and numerous other features.
As you imagine a File system to be something that helps visualize and access the data on the disk in streams, we can imagine a database to be such a tool for data - Data systems, which organizes your data. Files can only fractionally do that; again, unless you extend your program to mimic a database.
How the tables are really stored on the disk and retrieved, that's a vast topic. Advise reading your favourite database internals. A book by Korth might also be a good read.

Database vs File system storage

Database ultimately stores the data in files, whereas File system also stores the data in files. In this case what is the difference between DB and File System. Is it in the way it is retrieved or anything else?
A database is generally used for storing related, structured data, with well defined data formats, in an efficient manner for insert, update and/or retrieval (depending on application).
On the other hand, a file system is a more unstructured data store for storing arbitrary, probably unrelated data. The file system is more general, and databases are built on top of the general data storage services provided by file systems. [Quora]
The file system is useful if you are looking for a particular file, as operating systems maintain a sort of index. However, the contents of a txt file won't be indexed, which is one of the main advantages of a database.
For very complex operations, the filesystem is likely to be very slow.
Main RDBMS advantages:
Tables are related to each other
SQL query/data processing language
Transaction processing addition to SQL (Transact-SQL)
Server-client implementation with server-side objects like stored procedures, functions, triggers, views, etc.
Advantage of the File System over Data base Management System is:
When handling small data sets with arbitrary, probably unrelated data, file is more efficient than database.
For simple operations, read, write, file operations are faster and simple.
You can find n number of difference over internet.
"They're the same"
Yes, storing data is just storing data. At the end of the day, you have files. You can store lots of stuff in lots of files & folders, there are situations where this will be the way. There is a well-known versioning solution (svn) that finally ended up using a filesystem-based model to store data, ditching their BerkeleyDB. Rare but happens. More info.
"They're quite different"
In a database, you have options you don't have with files. Imagine a textfile (something like tsv/csv) with 99999 rows. Now try to:
Insert a column. It's painful, you have to alter each row and read+write the whole file.
Find a row. You either scan the whole file or build an index yourself.
Delete a row. Find row, then read+write everything after it.
Reorder columns. Again, full read+write.
Sort rows. Full read, some kind of sort - then do it next time all over.
There are lots of other good points but these are the first mountains you're trying to climb when you think of a file based db alternative. Those guys programmed all this for you, it's yours to use; think of the likely (most frequent) scenarios, enumerate all possible actions you want to perform on your data, and decide which one works better for you. Think in benefits, not fashion.
Again, if you're storing JPG pictures and only ever look for them by one key (their id maybe?), a well-thought filesystem storage is better. Filesystems, btw, are close to databases today, as many of them use a balanced tree approach, so on a BTRFS you can just put all your pictures in one folder - and the OS will silently implement something like an early SQL query each time you access your files.
So, database or files?...
Let's see a few typical examples when one is better than the other. (These are no complete lists, surely you can stuff in a lot more on both sides.)
DB tables are much better when:
You want to store many rows with the exact same structure (no block waste)
You need lightning-fast lookup / sorting by more than one value (indexed tables)
You need atomic transactions (data safety)
Your users will read/write the same data all the time (better locking)
Filesystem is way better if:
You like to use version control on your data (a nightmare with dbs)
You have big chunks of data that grow frequently (typically, logfiles)
You want other apps to access your data without API (like text editors)
You want to store lots of binary content (pictures or mp3s)
TL;DR
Programming rarely says "never" or "always". Those who say "database always wins" or "files always win" probably just don't know enough. Think of the possible actions (now + future), consider both ways, and choose the fastest / most efficient for the case. That's it.
Something one should be aware of is that Unix has what is called an inode limit. If you are storing millions of records then this can be a serious problem. You should run df -i to view the % used as effectively this is a filesystem file limit - EVEN IF you have plenty of disk space.
The difference between file processing system and database management system is as follow:
A file processing system is a collection of programs that store and manage files in computer hard-disk. On the other hand, A database management system is collection of programs that enables to create and maintain a database.
File processing system has more data redundancy, less data redundancy in dbms.
File processing system provides less flexibility in accessing data, whereas dbms has more flexibility in accessing data.
File processing system does not provide data consistency, whereas dbms provides data consistency through normalization.
File processing system is less complex, whereas dbms is more complex.
Context: I've written a filesystem that has been running in production for 7 years now. [1]
The key difference between a filesystem and a database is that the filesystem API is part of the OS, thus filesystem implementations have to implement that API and thus follow certain rules, whereas databases are built by 3rd parties having complete freedom.
Historically, databases where created when the filesystem provided by the OS were not good enough for the problem at hand. Just think about it: if you had special requirements, you couldn't just call Microsoft or Apple to redesign their filesystem API. You would either go ahead and write your own storage software or you would look around for existing alternatives. So the need created a market for 3rd party data storage software which ended up being called databases. That's about it.
While it may seem that filesystems have certain rules like having files and directories, this is not true. The biggest operating systems work like that but there are many mall small OSs that work differently. It's certainly not a hard requirement. (Just remember, to build a new filesystem, you also need to write a new OS, which will make adoption quite a bit harder. Why not focus on just the storage engine and call it a database instead?)
In the end, both databases and filesystems come in all shapes and sizes. Transactional, relational, hierarchical, graph, tabled; whatever you can think of.
[1] I've worked on the Boomla Filesystem which is the storage system behind the Boomla OS & Web Application Platform.
The main differences between the Database and File System storage is:
The database is a software application used to insert, update and delete
data while the file system is a software used to add, update and delete
files.
Saving the files and retrieving is simpler in file system
while SQL needs to be learn to perform any query on the database to
get (SELECT), add (INSERT) and update the data.
Database provides a proper data recovery process while file system did not.
In terms of security the database is more secure then the file system (usually).
The migration process is very easy in File system just copy and paste into the target
while for database this task is not as simple.

How Fast is Windows File System for Database Operations?

I came across a crazy thought and I wanted to share it with you and ask about its feasibility, especially performance wise:
The idea is to manage object database operations by:
creating a folder for each class named after class name
creating a sub-folder for each sub-class named after sub-class name
creating a file for each object named after its unique ID
creating a sub-folder for each index named after names of indexed fields
creating a shortcut file for each index entry referring to the original object file
reading/writing binary objects by very fast serializer/deserializer
inserting/updating/deleting objects and index entries by renaming object and shortcut files
caching/paging by using memory-mapped files
querying would utilize binary search on sorted file names
UPDATE: Thank you all for your replies. I was thinking this can be even improved by using some compression/encryption library such as 7z, instead of dealing with the OS file system. Otherwise, all of your stated concerns so far are valid. I'm wondering what kind of underlying file system does, for example, Oracle uses
Cons:
On most filesystems, even a 1 byte file takes a full block of 4kb. Can be a huge problem depending of the kind of object you wanna store in your database.
Most filesystems are not designed to scale with directories containing millions of files.
Complex queries will require the opening/reading/deserializing/closing of million of files, and thus will be very slow.
Its an interesting concept, a few thoughts on immediate issues you will have to resolve.
Windows file performance takes a hit after a few hundred thousands files, you need to alter certain aspects (turn off 8.3 and last update timestamps) to get it to not cause delays when reading the file system.
Locking - the locking mechanism will be an interesting challenge, you need to be able to lock things for update, but permit reads simultaneously.
ACID - whilst performing operations against this 'database' how would you enforce the ACID principles - each of them is a non-trivial problem.
As a practice to learn more about databases sure, but for a real world projekt that should do anyting, no.
There is so much under the hood of databases that unless you actually know exactly what you are doing, and in that case you would not be asking here ;), you will most probable never match existing solutions.
Go fo an existing object database instead and concentrate on the specifics of you application/site/...

When to use an Embedded Database

I am writing an application, which parses a large file, generates a large amount of data and do some complex visualization with it. Since all this data can't be kept in memory, I did some research and I'm starting to consider embedded databases as a temporary container for this data.
My question is: is this a traditional way of solving this problem? And is an embedded database (other than structuring data) supposed to manage data by keeping in memory only a subset (like a cache), while the rest is kept on disk? Thank you.
Edit: to clarify: I am writing a desktop application. The application will be inputted with a file of size of 100s of Mb. After reading the file, the application will generate a large number of graphs which will be visualized. Since, the graphs may have such a large number of nodes, they may not fit into memory. Should I save them into an embedded database which will take care of keeping only the relevant data in memory? (Do embedded databases do that?), or I should write my own sophisticated module which does that?
Tough question - but I'll share my experience and let you decide if it helps.
If you need to retain the output from processing the source file, and you use that to produce multiple views of the derived data, then you might consider using an embedded database. The reasons to use an embedded database (IMHO):
To take advantage of RDBMS features (ACID, relationships, foreign keys, constraints, triggers, aggregation...)
To make it easier to export the data in a flexible manner
To enable access to your processed data to external clients (known format)
To allow more flexible transformation of the data when preparing for viewing
Factors which you should consider when making the decision:
What is the target platform(s) (windows, linux, android, iPhone, PDA)?
What technology base? (Java, .Net, C, C++, ...)
What resource constraints are expected or need to be designed for? (RAM, CPU, HD space)
What operational behaviours do you need to take into account (connected to network, disconnected)?
On the typical modern desktop there is enough spare capacity to handle most operations. On eeePCs, PDAs, and other portable devices, maybe not. On embedded devices, very likely not. The language you use may have build in features to help with memory management - maybe you can take advantage of those. The connectivity aspect (stateful / stateless / etc.) may impact how much you really need to keep in memory at any given point.
If you are dealing with really big files, then you might consider a streaming process approach so you only have in memory a small portion of the overall data at a time - but that doesn't really mean you should (or shouldn't) use an embedded database. Straight text or binary files could work just as well (record based, column based, line based... whatever).
Some databases will allow you more effective ways to interact with the data once it is stored - it depends on the engine. I find that if you have a lot of aggregation required in your base files (by which I mean the files you generate initially from the original source) then an RDBMS engine can be very helpful to simplify your logic. Other options include building your base transform and then adding additional steps to process that into other temporary stores for each specific view, which are then in turn processed for rendering to the target (report?) format.
Just a stream-of-consciousness response - hope that helps a little.
Edit:
Per your further clarification, I'm not sure an embedded database is the direction you want to take. You either need to make some sort of simplifying assumptions for rendering your graphs or investigate methods like segmentation (render sections of the graph and then cache the output before rendering the next section).

Best way to bundles photos with app: files or in sqlite database?

Lets say that I have an app that lets you browse through a listing of cars found in a Sqlite database. When you click on a car in the listing, it'll open up a view with the description of the car and a photo of the car.
My question is: should I keep the photo in the database as a binary data column in the row for this specific car, or should I have the photo somewhere in the resources directory? Which is better to do? Are there any limitations of Sqlite in terms of how big a binary data column can be?
The database will be pretty much be read only and bundled with the app (so the user wouldn't be inserting any cars and their photos).
This is a decision which is discussed quite a lot. And in my opinion, it's a matter of personal taste. Pretty much like vim/emacs, windows/linux kind of debates. Not that heated though.
Both sides have their advantages and disadvantages. When you store them in a database, you don't need to worry much about filename and location. Management is also easier (you delete the row(s) containing the BLOBs, and that's it). But the files are also harder to access and you may need to write wrapper code in one way or the other (f.ex. some of those "download.php" links).
On the other hand, if the binary data is stored as a separate file, management is more complicated (You need to open the proper file from the disk by constructing the filename first). On large data sets you may run into filesystem bottlenecks when the number of files in one directory grows very large (but this can be prevented by creating subdirs easily). But then, if the data is stored as files, replacing them becomes so much easier. Other people can also access it without the need to know the internals (imagine for example a user who takes fun in customizing his/her UI).
I'm certain that there are other points to be raised, but I don't want to write too much now...
I would say: Think about what operations you would like to do with the photos (and the limitations of both storage methods), and from there take an informed decision. There's not much that can go wrong.
TX-Log and FS-Journal
On further investigation I found some more info:
SQLite uses a Transaction Log
Android uses YAFFS on the system mount points and VFAT on the SD-Card. Both are (to the best of my knowlege) unjournaled.
I don't know the exact implementation of SQLite's TX-Log, but it is to be expected that each INSERT/UPDATE operation will perform two writes on the disk. I may be mistaken (it largely depends on the implementation of transactions), but frankly, I couldn't be bothered to skim through the SQLite source code. It feels to me like we start splitting hairs here (premature optimization anyone?)...
As both file systems (YAFFS and VFAT) are not journalled, you have no additional "hidden" write operations.
These two points speak in favour of the file system.
Note that this info is to be taken with a grain of salt. I only skimmed over the Google results of YAFFS journaling and sqlite transaction log. I may have missed some details.
The default MAX SQLite size for a BLOB or String is 231-1 bytes. That value also applies to the maximum bytes to be stored in a ROW.
As to which method is better, I do not know. What I would do in your case is test both methods and monitor the memory usage and its effect on battery life. Maybe you'll find the filesystem method has a clear advantage over the other in that area and the ease of storing it in SQLite is not worth it for your use case, or maybe you'll find the opposite.

Resources