Database vs File system storage - database

Database ultimately stores the data in files, whereas File system also stores the data in files. In this case what is the difference between DB and File System. Is it in the way it is retrieved or anything else?

A database is generally used for storing related, structured data, with well defined data formats, in an efficient manner for insert, update and/or retrieval (depending on application).
On the other hand, a file system is a more unstructured data store for storing arbitrary, probably unrelated data. The file system is more general, and databases are built on top of the general data storage services provided by file systems. [Quora]
The file system is useful if you are looking for a particular file, as operating systems maintain a sort of index. However, the contents of a txt file won't be indexed, which is one of the main advantages of a database.
For very complex operations, the filesystem is likely to be very slow.
Main RDBMS advantages:
Tables are related to each other
SQL query/data processing language
Transaction processing addition to SQL (Transact-SQL)
Server-client implementation with server-side objects like stored procedures, functions, triggers, views, etc.
Advantage of the File System over Data base Management System is:
When handling small data sets with arbitrary, probably unrelated data, file is more efficient than database.
For simple operations, read, write, file operations are faster and simple.
You can find n number of difference over internet.

"They're the same"
Yes, storing data is just storing data. At the end of the day, you have files. You can store lots of stuff in lots of files & folders, there are situations where this will be the way. There is a well-known versioning solution (svn) that finally ended up using a filesystem-based model to store data, ditching their BerkeleyDB. Rare but happens. More info.
"They're quite different"
In a database, you have options you don't have with files. Imagine a textfile (something like tsv/csv) with 99999 rows. Now try to:
Insert a column. It's painful, you have to alter each row and read+write the whole file.
Find a row. You either scan the whole file or build an index yourself.
Delete a row. Find row, then read+write everything after it.
Reorder columns. Again, full read+write.
Sort rows. Full read, some kind of sort - then do it next time all over.
There are lots of other good points but these are the first mountains you're trying to climb when you think of a file based db alternative. Those guys programmed all this for you, it's yours to use; think of the likely (most frequent) scenarios, enumerate all possible actions you want to perform on your data, and decide which one works better for you. Think in benefits, not fashion.
Again, if you're storing JPG pictures and only ever look for them by one key (their id maybe?), a well-thought filesystem storage is better. Filesystems, btw, are close to databases today, as many of them use a balanced tree approach, so on a BTRFS you can just put all your pictures in one folder - and the OS will silently implement something like an early SQL query each time you access your files.
So, database or files?...
Let's see a few typical examples when one is better than the other. (These are no complete lists, surely you can stuff in a lot more on both sides.)
DB tables are much better when:
You want to store many rows with the exact same structure (no block waste)
You need lightning-fast lookup / sorting by more than one value (indexed tables)
You need atomic transactions (data safety)
Your users will read/write the same data all the time (better locking)
Filesystem is way better if:
You like to use version control on your data (a nightmare with dbs)
You have big chunks of data that grow frequently (typically, logfiles)
You want other apps to access your data without API (like text editors)
You want to store lots of binary content (pictures or mp3s)
TL;DR
Programming rarely says "never" or "always". Those who say "database always wins" or "files always win" probably just don't know enough. Think of the possible actions (now + future), consider both ways, and choose the fastest / most efficient for the case. That's it.

Something one should be aware of is that Unix has what is called an inode limit. If you are storing millions of records then this can be a serious problem. You should run df -i to view the % used as effectively this is a filesystem file limit - EVEN IF you have plenty of disk space.

The difference between file processing system and database management system is as follow:
A file processing system is a collection of programs that store and manage files in computer hard-disk. On the other hand, A database management system is collection of programs that enables to create and maintain a database.
File processing system has more data redundancy, less data redundancy in dbms.
File processing system provides less flexibility in accessing data, whereas dbms has more flexibility in accessing data.
File processing system does not provide data consistency, whereas dbms provides data consistency through normalization.
File processing system is less complex, whereas dbms is more complex.

Context: I've written a filesystem that has been running in production for 7 years now. [1]
The key difference between a filesystem and a database is that the filesystem API is part of the OS, thus filesystem implementations have to implement that API and thus follow certain rules, whereas databases are built by 3rd parties having complete freedom.
Historically, databases where created when the filesystem provided by the OS were not good enough for the problem at hand. Just think about it: if you had special requirements, you couldn't just call Microsoft or Apple to redesign their filesystem API. You would either go ahead and write your own storage software or you would look around for existing alternatives. So the need created a market for 3rd party data storage software which ended up being called databases. That's about it.
While it may seem that filesystems have certain rules like having files and directories, this is not true. The biggest operating systems work like that but there are many mall small OSs that work differently. It's certainly not a hard requirement. (Just remember, to build a new filesystem, you also need to write a new OS, which will make adoption quite a bit harder. Why not focus on just the storage engine and call it a database instead?)
In the end, both databases and filesystems come in all shapes and sizes. Transactional, relational, hierarchical, graph, tabled; whatever you can think of.
[1] I've worked on the Boomla Filesystem which is the storage system behind the Boomla OS & Web Application Platform.

The main differences between the Database and File System storage is:
The database is a software application used to insert, update and delete
data while the file system is a software used to add, update and delete
files.
Saving the files and retrieving is simpler in file system
while SQL needs to be learn to perform any query on the database to
get (SELECT), add (INSERT) and update the data.
Database provides a proper data recovery process while file system did not.
In terms of security the database is more secure then the file system (usually).
The migration process is very easy in File system just copy and paste into the target
while for database this task is not as simple.

Related

Why don't Operating Systems (Windows,Linux) use Relational Databases (RDBMS) Instead of File Sytems?

We all know that most operating systems use file systems to store all data but don't you think it is more efficient to use databases as we use in websites/web apps?
tl;dr: Diversity.
First of all, if you look at the original FAT filesystem, and the original Unix filesystem, they were both key-value stores, they did not have a directory hierarchy.
Second, this link suggests there there are filesystems implemented with an RDBMS backend, which is tangential to your question.
Having said these, comparing RDBMS to a filesystem as storage for an OS, there are several drawbacks to using RDBMS:
First, RDBMS makes very strong guarantees (ACID) by means of locking, at the cost of performance. However, most programs do not require such guarantees (for examples, think of every program that works with a NoSQL DB). In comparison, POSIX makes strong-ish guarantees about metadata, but barely any guarantees about I/O. You can build an RDBMS on top of POSIX and add locking, but you can't build a filesystem on top of an RDBMS and remove locking.
Second, an RDBMS requires a schema. Imagine that you create a new storage volume for an OS. Instead of formatting a filesystem, you need to decide on a schema. What schema will be the most useful?
With filesystems, the "schema" is basically one table, with the columns "path", "data", and a column for each file attributes like modification time, type, and size. Using an RDBMS for this schema allows you to perform operations like mass truncate, mass rename, mass access control etc. atomically. However, it will not allow you to modify the data of the same record (file) concurrently. Nor will it allow you to implement hard links. Extended attributes or Alternate Data Streams will still have to be implemented as they are today rather than leveraging RDBMS capabilities, as well as special index logic for the path column in order to implement features like changing directory, listing directory, checking permissions for every directory in the path of a file etc., and special logic for the data column because files can be TBs in size. At that point the ROI of RDBMS is going down the more you add features.
Alternatively you can have the schema be per-program (i.e. every program can do CREATE TABLE etc.), but then your features are again limited by what the RDBMS can do. For example, how do you get the equivalent of find / -size +1GB or md5sum, or even cat or ls? which columns will these programs read? You'll find that all generic programs now need to take a set of columns that are of interest. It also makes scripting much harder.
Thirdly, Hierarchical systems are typically easier to scale.
One example is when you want to add storage. In a hierarchical filesystem, even without any fancy filesystem features, you can simply mount another filesystem onto a directory, and you have new storage. The tradeoff vs increasing the storage capacity for the current filesystem is that hard links & renames don't work across filesystem, and they don't share the storage capacity. However, on an RDBMS your options are either to create a new table and have your programs/scripts manage both tables, or to add more storage volume, for which you might need to do more advanced things like partitioning.
Another example is ecosystem requirements. As an end user wanting to put some order into their 60,000 pictures, 5000 songs, hundreds of work spreadsheets, 10,000 memes, hundreds of eBooks, videos etc. - things that are convenient to arrange in a hierarchy - you currently only need two programs - a file manager (Explorer, bash, Nautilus etc.), and a search capability (e.g. find(1)). On an RDBMS, you either have different tables with different columns, or one table with generic columns. Either way, you have to have a set of SQL scripts to work with these specific collections, which would be equivalent to having a shell script or a program for each type of collection. Meaning, managing large collections requires more programming.
Since hierarchical systems are useful in a generic context (which is the context the major OSes operate in), and since it's easier to build a non-hierarchical system on top of hierarchical one than doing the other way around (hierarchical filesystem cache even makes the job easier for libsqlfs), it is valuable for OSes to support hierarchical systems first-class.
The executive summary is: OSes serve many use cases, and storage access is a major part of that. It would be wise for an OS to build a storage access mechanism that's as minimal as possible, but that allows applications to build more specialized storage access mechanism on top of the OS.
That means providing a small but useful set of features (like permissions, locking, mounting, and symlinks) but not force too much requirements (like locking, or specifying the data format to the OS).
RDBMSes are just too specific.

Why database is considered different from a file system

Well every database book starts with the story that how earlier people used to store data as files and it was very inconvenient. After database came, things became really easy and seamless, because we can now query data etc. My question is how are the tables really stored in the disk and retrieved ? Aren't they stored as files only or they are just copied to the address space bit by bit, and access via address only ? Or there is a underneath file system and the database server handles accessing the file system and presents the abstraction of a table in front of us.
Might be a very trivial question but, I have not found answer in any book
The question is not trivial, but the distinction between the two is quite apparent.
File systems provide a way to logically view the streams in a hierarchical manner.
A virtual representation of what lies on the disk; which would otherwise just be a binary stream, unreadable.
When we talk about storing data, we can extend a method of writing data to files and later define our own protocols for CRUD'ing on it; thus mimicking a fractional part of what databases do.
There are numerous limitations to storing data in files. If you store them in file and define your own protocol, it will be very specific to you. Plus, there are various other concerns like security, disaster recovery etc etc.
Even though everything is stored in some or the other way on disk, the main advantage databases bring to the table versus files are the mechanisms that they offer.
To minimize the io, we have db caches and numerous other features.
As you imagine a File system to be something that helps visualize and access the data on the disk in streams, we can imagine a database to be such a tool for data - Data systems, which organizes your data. Files can only fractionally do that; again, unless you extend your program to mimic a database.
How the tables are really stored on the disk and retrieved, that's a vast topic. Advise reading your favourite database internals. A book by Korth might also be a good read.

Write performance between Filesystem and Database

I have a very simple program for data acquisition. The data comes frequently (around 5200 Hz). One piece of data has around 24 kB, so it is around 122 MB/s.
What would be more efficient only for storing this data? Saving it in raw binary files, or use the database? If the database, then which? SQLite, or maybe some other?
The database, of course, is more tempting, because when saving it to file I would have to separate them by delimiters (data can have different sizes), also processing data would be much easier with the database. I'm not sure about database performance compared to files though, I couldn't find any specific pieces of information about it.
[EDIT]
I am using Linux based OS and SSD disk which supports writing up to 350 MB/s. Data will be acquired with that frequency all the time (with a small service break every day to transfer the data to another machine)
The file system is useful if you are looking for a particular file, as operating systems maintain a sort of index. However, the contents of a txt file won't be indexed, which is one of the main advantages of a database.
Another point is understanding the relational model meaning how you design your database, so that data doesn't need to be repeated over and over.
Moreover understanding types is inportant as well. If you have a txt file, you'll need to parse numbers, dates, etc.
For the performance point of view I would say that DB are slower to start (is usually faster to open a file than open a connection to a db). However once they are open I can guarantee that DB is faster then XML or whatever file you are thinking to use. BTW this is the main purpose of a database: manage huge amount of data, filesystems are made for storing files.
Last points for DB is that they usually can handle multi-threading and concurrency problems, which a file cannot and last but not least important in a database you cannot delete a file by mistake and loose your data
So my choice would be a DB and anway I hope that providing you some info you can decide what is best for you
-- UPDATE --
Since you your needs are more specific now I tried to dig deeper: I found some solutions that could be interesting for you however I don't have experience in any of them to provide you a personal suggestion about them:
SharedHashFile: SharedHashFile is a lightweight NoSQL key value store / hash table, a zero-copy IPC queue, & a multiplexed IPC logging library written in C for Linux. There is no server process. Data is read and written directly from/to shared memory or SSD; no sockets are used between SharedHashFile and the application program. APIs for C, C++, & nodejs. However keep an eye out for issues because this project seems to be no longer maintained on Github
WhiteDB another NoSql database that claims to be really fast, go to the speed section of their website to consult it
Symas an extraordinarily fast, memory-efficient database
Just take a look at them and if you ever use them just provide here a feedback for the community

Best way storing binary or image files

What is the best way storing binary or image files?
Database System
File System
Would you please explain, why?
There is no real best way, just a bunch of trade offs.
Database Pros:
1. Much easier to deal with in a clustering environment.
2. No reliance on additional resources like a file server.
3. No need to set up "sync" operations in load balanced environment.
4. Backups automatically include the files.
Database Cons:
1. Size / Growth of the database.
2. Depending on DB Server and your language, it might be difficult to put in and retrieve.
3. Speed / Performance.
4. Depending on DB server, you have to virus scan the files at the time of upload and export.
File Pros:
1. For single web/single db server installations, it's fast.
2. Well understood ability to manipulate files. In other words, it's easy to move the files to a different location if you run out of disk space.
3. Can virus scan when the files are "at rest". This allows you to take advantage of scanner updates.
File Cons:
1. In multi web server environments, requires an accessible share. Which should also be clustered for failover.
2. Additional security requirements to handle file access. You have to be careful that the web server and/or share does not allow file execution.
3. Transactional Backups have to take the file system into account.
The above said, SQL 2008 has a thing called FILESTREAM which combines both worlds. You upload to the database and it transparently stores the files in a directory on disk. When retrieving you can either pull from the database; or you can go direct to where it lives on the file system.
Pros of Storing binary files in a DB:
Some decrease in complexity since the
data access layer of your system need
only interface to a DB and not a DB +
file system.
You can secure your files using the
same comprehensive permissions-based
security that protects the rest of
the database.
Your binary files are protected
against loss along with the rest of
your data by way of database backups.
No separate filesystem backup system
required.
Cons of Storing binary files in a DB:
Depending on size/number of files,
can take up significant space
potentially decreasing performance
(dependening on whether your binary
files are stored in a table that is
queried for other content often or
not) and making for longer backup
times.
Pros of Storing binary files in file system:
This is what files systems are good
at. File systems will handle
defragmenting well and retrieving
files (say to stream a video file to
through a web server) will likely be
faster that with a db.
Cons of Storing binary files in file system:
Slightly more complex data access
layer. Needs its own backup system.
Need to consider referential
integrity issues (e.g. deleted
pointer in database will need to
result in deletion of file so as to
not have 'orphaned' files in the
filesystem).
On balance I would use the file system. In the past, using SQL Server 2005 I would simply store a 'pointer' in db tables to the binary file. The pointer would typically be a GUID.
Here's the good news if you are using SQL Server 2008 (and maybe others - I don't know): there is built in support for a hybrid solution with the new VARBINARY(MAX) FILESTREAM data type. These behave logically like VARBINARY(MAX) columns but behind the scenes, SQL Sever 2008 will store the data in the file system.
There is no best way.
What? You need more info?
There are three ways I know of... One, as byte arrays in the database. Two, as a file with the path stored in the database. Three, as a hybrid (only if DB allows, such as with the FileStream type).
The first is pretty cool because you can query and get your data in the same step. Which is always nice. But what happens when you have LOTS of files? Your database gets big. Now you have to deal with big database maintenance issues, such as the trials of backing up databases that are over a terabyte. And what happens if you need outside access to the files? Such as type conversions, mass manipulation (resize all images, appy watermarks, etc)? Its much harder to do than when you have files.
The second is great for somewhat large numbers of files. You can store them on NAS devices, back them up incrementally, keep your database small, etc etc. But then, when you have LOTS of files, you start running into limitations in the file system. And if you spread them over the network, you get latency issues, user rights issues, etc. Also, I take pity on you if your network gets rearranged. Now you have to run massive updates on the database to change your file locations, and I pity you if something screws up.
Then there's the hybrid option. Its almost perfect--you can get your files via your query, yet your database isn't massive. Does this solve all your problems? Probably not. Your database isn't portable anymore; you're locked to a particular DBMS. And this stuff isn't mature yet, so you get to enjoy the teething process. And who says this solves all the different issues?
Fact is, there is no "best" way. You just have to determine your requirements, make the best choice depending on them, and then suck it up when you figure out you did the wrong thing.
I like storing images in a database. It makes it easy to switch from development to production just by changing databases (no copying files). And the database can keep track of properties like created/modified dates just as well as the File System.
I personally never store images IN the database for performance purposes. In all of my sites I have a "/files" folder where I can put sub-folders based on what kind of images i'm going to store. Then I name them on convention.
For example if i'm storing a profile picture, I'll store it in "/files/profile/" as profile_2.jpg (if 2 is the ID of the account). I always make it a rule to resize the image on the server to the largest size I'll need, and then smaller ones if I need them. So I'd save "profile_2_thumb.jpg" and "profile_2_full.jpg".
By creating rules for yourself you can simply in the code call img src="/files/profile__thumb.jpg"
Thats how I do it anyway!

BLOB Storage - 100+ GB, MySQL, SQLite, or PostgreSQL + Python

I have an idea for a simple application which will monitor a group of folders, index any files it finds. A gui will allow me quickly tag new files and move them into a single database for storage and also provide an easy mechanism for querying the db by tag, name, file type and date. At the moment I have about 100+ GB of files on a couple removable hard drives, the database will be at least that big. If possible I would like to support full text search of the embedded binary and text documents. This will be a single user application.
Not trying to start a DB war, but what open source DB is going to work best for me? I am pretty sure SQLLite is off the table but I could be wrong.
I'm still researching this option for one of my own projects, but CouchDB may be worth a look.
Why store the files in the database at all? Simply store your meta-data and a filename. If you need to copy them to a new location for some reason, just do that as a file system copy.
Once you remove the file contents then any competent database will be able to handle the meta-data for a few hundred thousand files.
My preference would be to store the document with the metadata. One reason, is relational integrity. You can't easily move the files or modify the files without the action being brokered by the db. I am sure I can handle these problems but it isn't as clean as I would like and my experience has been that most vendors can handle huge amounts of binary data in the database these days. I guess I was wondering if PostgreSQL or MySQL have any obvious advantages in these areas, I am primarily familiar with Oracle. Anyway, thanks for the response, if the DB knows where the external file is it will also be easy to bring the file in at a later date if I want. Another aspect of the question was if either database is easier to work with when using Python. I'm assuming that is a wash.
I always hate to answer "don't", but you'd be better off indexing with something like Lucene (PyLucene). That and storing the paths in the database rather than the file contents is almost always recommended.
To add to that, none of those database engines will store LOBs in a separate dataspace (they'll be embedded in the table's data space) so any of those engines should perfom nearly equally as well (well except sqllite). You need to move to Informix, DB2, SQLServer or others to get that kind of binary object handling.
Pretty much any of them would work (even though SQLLite wasn't meant to be used in a concurrent multi-user environment, which could be a problem...) since you don't want to index the actual contents of the files.
The only limiting factor is the maximum "packet" size of the given DB (by packet I'm referring to a query/response). Usually these limit are around 2MB, meaning that your files must be smaller than 2MB. Of course you could increase this limit, but the whole process is rather inefficient, since for example to insert a file you would have to:
Read the entire file into memory
Transform the file in a query (which usually means hex encoding it - thus doubling the size from the start)
Executing the generated query (which itself means - for the database - that it has to parse it)
I would go with a simple DB and the associated files stored using a naming convention which makes them easy to find (for example based on the primary key). Of course this design is not "pure", but it will perform much better and is also easier to use.
why are you wasting time emulating something that the filesystem should be able to handle? more storage + grep is your answer.

Resources