RDBMS vs file system for file storage - filesystems

Are there any advantages of storing entire files in an RDBMS over storing the files in the file system with references to the file path in the RDBMS?
Which approach shall be faster? When do I choose one over the other? Does it matter which file system is in use? (say ext3)
I do not expect the files to change at all. The files may be json or xml or might be pdf (less likely). Also, there might be no need to refer to these files often. They are only meant for archival.
Thank you.

Given that the files are not expected to change, there is limited value in keeping the files in the DBMS. The primary advantage of keeping files in the DBMS is that the DBMS knows how to manage transactions, but if the files won't change, then that advantage becomes minuscule.
Another advantage of storing files in the DBMS is that the database backup will contain the files; with the files stored separately, you have to backup the separate stash of files as well as the DBMS itself to keep all the data secure.
Another advantage of storing files in the DBMS is that the database can enforce more subtle controls on access to the files.
The primary advantage of storing the files in the file system is that it is easy (easier) to see what you've got.
A secondary advantage is that you can back up or manipulate the files outside the DBMS - though that is also a disadvantage from some points of view.
If the files are stored in blobs in the DBMS, then the normal SQL client software can retrieve the contents over a normal SQL connection. If the SQL client software is not on the same machine as the DBMS and the files, then you have to worry about how clients do get hold of the file data.
Another advantage of separating the files from the DBMS is that the files could be stored off the DBMS machine. On the other hand, that then complicates getting the files loaded 'into the DBMS'.
On the whole, given the issues outlined above, there seem to be some advantages with going with the 'files in DBMS' approach. On the other hand, many people do go with 'files in file system' approach, and they survive. It may be that their SQL clients are on the same machine as the DBMS, so the file transfer issues are not insurmountable, but that's the bit that has me most worried.

If file size is less than 1MB you may store them in the RDBMS, but otherwise consider storing them on the file system. See http://technet.microsoft.com/en-us/library/bb933993.aspx
Are there any advantages of storing entire files in an RDBMS over storing the files in the file system with references to the file path in the RDBMS?
if money etc. is no problem, then storing in RDBMS is advantageous, since you will get all the benefit of a RDBMS, plus no overhead of dereferencing the file from the reference stored in the db.
Which approach shall be faster?
RDBMS
When do I choose one over the other?
dictated by practical considerations. consider file system if file is > 1MB. Many shared-hosting providers do not enable FILESTREAM.
Does it matter which file system is in use?
I don't know about this.

To add to what Jonathan Leffler has written:
DBMS are not as efficient when dealing with BLOBs as when dealing with fixed-size objects, so we can say that DBMS don't "like" large objects. Also many DBMS store BLOBs outside of tables, in a separate storage.
If your goal is archiving purposes, it makes sense to store files separately for easier backup and retrieval. Also you can move files to some backend storage and make them "offline" to free space on the server (if needed).

Related

Database vs File system storage

Database ultimately stores the data in files, whereas File system also stores the data in files. In this case what is the difference between DB and File System. Is it in the way it is retrieved or anything else?
A database is generally used for storing related, structured data, with well defined data formats, in an efficient manner for insert, update and/or retrieval (depending on application).
On the other hand, a file system is a more unstructured data store for storing arbitrary, probably unrelated data. The file system is more general, and databases are built on top of the general data storage services provided by file systems. [Quora]
The file system is useful if you are looking for a particular file, as operating systems maintain a sort of index. However, the contents of a txt file won't be indexed, which is one of the main advantages of a database.
For very complex operations, the filesystem is likely to be very slow.
Main RDBMS advantages:
Tables are related to each other
SQL query/data processing language
Transaction processing addition to SQL (Transact-SQL)
Server-client implementation with server-side objects like stored procedures, functions, triggers, views, etc.
Advantage of the File System over Data base Management System is:
When handling small data sets with arbitrary, probably unrelated data, file is more efficient than database.
For simple operations, read, write, file operations are faster and simple.
You can find n number of difference over internet.
"They're the same"
Yes, storing data is just storing data. At the end of the day, you have files. You can store lots of stuff in lots of files & folders, there are situations where this will be the way. There is a well-known versioning solution (svn) that finally ended up using a filesystem-based model to store data, ditching their BerkeleyDB. Rare but happens. More info.
"They're quite different"
In a database, you have options you don't have with files. Imagine a textfile (something like tsv/csv) with 99999 rows. Now try to:
Insert a column. It's painful, you have to alter each row and read+write the whole file.
Find a row. You either scan the whole file or build an index yourself.
Delete a row. Find row, then read+write everything after it.
Reorder columns. Again, full read+write.
Sort rows. Full read, some kind of sort - then do it next time all over.
There are lots of other good points but these are the first mountains you're trying to climb when you think of a file based db alternative. Those guys programmed all this for you, it's yours to use; think of the likely (most frequent) scenarios, enumerate all possible actions you want to perform on your data, and decide which one works better for you. Think in benefits, not fashion.
Again, if you're storing JPG pictures and only ever look for them by one key (their id maybe?), a well-thought filesystem storage is better. Filesystems, btw, are close to databases today, as many of them use a balanced tree approach, so on a BTRFS you can just put all your pictures in one folder - and the OS will silently implement something like an early SQL query each time you access your files.
So, database or files?...
Let's see a few typical examples when one is better than the other. (These are no complete lists, surely you can stuff in a lot more on both sides.)
DB tables are much better when:
You want to store many rows with the exact same structure (no block waste)
You need lightning-fast lookup / sorting by more than one value (indexed tables)
You need atomic transactions (data safety)
Your users will read/write the same data all the time (better locking)
Filesystem is way better if:
You like to use version control on your data (a nightmare with dbs)
You have big chunks of data that grow frequently (typically, logfiles)
You want other apps to access your data without API (like text editors)
You want to store lots of binary content (pictures or mp3s)
TL;DR
Programming rarely says "never" or "always". Those who say "database always wins" or "files always win" probably just don't know enough. Think of the possible actions (now + future), consider both ways, and choose the fastest / most efficient for the case. That's it.
Something one should be aware of is that Unix has what is called an inode limit. If you are storing millions of records then this can be a serious problem. You should run df -i to view the % used as effectively this is a filesystem file limit - EVEN IF you have plenty of disk space.
The difference between file processing system and database management system is as follow:
A file processing system is a collection of programs that store and manage files in computer hard-disk. On the other hand, A database management system is collection of programs that enables to create and maintain a database.
File processing system has more data redundancy, less data redundancy in dbms.
File processing system provides less flexibility in accessing data, whereas dbms has more flexibility in accessing data.
File processing system does not provide data consistency, whereas dbms provides data consistency through normalization.
File processing system is less complex, whereas dbms is more complex.
Context: I've written a filesystem that has been running in production for 7 years now. [1]
The key difference between a filesystem and a database is that the filesystem API is part of the OS, thus filesystem implementations have to implement that API and thus follow certain rules, whereas databases are built by 3rd parties having complete freedom.
Historically, databases where created when the filesystem provided by the OS were not good enough for the problem at hand. Just think about it: if you had special requirements, you couldn't just call Microsoft or Apple to redesign their filesystem API. You would either go ahead and write your own storage software or you would look around for existing alternatives. So the need created a market for 3rd party data storage software which ended up being called databases. That's about it.
While it may seem that filesystems have certain rules like having files and directories, this is not true. The biggest operating systems work like that but there are many mall small OSs that work differently. It's certainly not a hard requirement. (Just remember, to build a new filesystem, you also need to write a new OS, which will make adoption quite a bit harder. Why not focus on just the storage engine and call it a database instead?)
In the end, both databases and filesystems come in all shapes and sizes. Transactional, relational, hierarchical, graph, tabled; whatever you can think of.
[1] I've worked on the Boomla Filesystem which is the storage system behind the Boomla OS & Web Application Platform.
The main differences between the Database and File System storage is:
The database is a software application used to insert, update and delete
data while the file system is a software used to add, update and delete
files.
Saving the files and retrieving is simpler in file system
while SQL needs to be learn to perform any query on the database to
get (SELECT), add (INSERT) and update the data.
Database provides a proper data recovery process while file system did not.
In terms of security the database is more secure then the file system (usually).
The migration process is very easy in File system just copy and paste into the target
while for database this task is not as simple.

Database storage engine for images

My application uses a database for (among other things) storing scanned documents. These are usually JPEG images, and mostly just text.
I know it's recommended to use files for images and link to them from the database, but it's just easier to retrieve them this way. The application uses one single server and multiple clients, and storing files would waste a lot of hard drive space (overhead, sector allocations, etc.) and require the use of shared folders and mapped drives and stuff. Slow, and less secure.
Anyway, some of our customers have been using our application for a few years, and their databases have grown into the tens of gigabytes, mostly because of these images. At the moment they're stored in an InnoDB table (MySQL).
Question is: is there a better way?
The data itself is write-once, and not normally deleted (but theoretically possible), so something that is slow to store (within reason), unchangeable and fast to access would be perfect.
I'm thinking of making my own storage engine that would include compression, indexing, and caching that only allows two columns (bigint, blob) per table. Does something like this already exist?
MSSQL 2005 introduced file-streaming which allows you to store pointers in the database, but keep the blobs out of the database and in a directory (as they should be). The neat thing is that when you backup the database, the blobs get backed up as well. Best of both worlds. The directories are restricted access as well, typically only SQL itself gets access.
Store the images as files, keep the pointers in the database.

Would implementing files (all types: audio, video, text, etc.) as database tables and their content as their rows be worthwile?

Consider I modify the way files are stored in a system, where every file name would actually be the table name in a database and each line in the file would actually be the rows of that table. Would that increase overall system speed? would it be worthwhile? what are the tradeoffs?
To further clarify the distinction between this and the normal use of a database, consider the files not to simply be text files, but also audio, video, binary, etc. where they are stored in the manner mentioned in the previous paragraph.
Immediate benefits that I can see from this is that i can read/write any line from/to a file without having to repeatedly read/write the previous lines until reaching the desired line. Another benefit would be simultaneous reading/writing of files.
Please do not confuse this with a database file system, this is a file implementation
To add to your benefit of reading writing by location, the additional benefit are.
Pros
Indexing
Indexing the text with full text indexing that can give searching benefits. Ofcourse the size of database will probably be more then the conventional file sizes. But you have benefit in terms of performance because database system will have only one file handle open, and it will do caching and will improve performance and cause less fragmentation.
Lock/Performance
Opening/closing multiple files will put little more overhead in terms of performance because each open/close requires access control check and locking.
Replication
Replication benefits, if you put them in mysql, mysql replication is easy to setup and you can keep multiple backups easily.
Maintanence
Transfering, maintaining and querying database will be much easier then in terms of files.
Cons
File Browser Access
You can not access files through explorer or normal file system api, you will need some sort of access api probably REST based api or some viewer that can read the database.
You can check my blog about more detailed analysis.
Maybe some benefits on speed are present, but there's a lot of issue which make the cons overwhelm the pros.
No easy way to support Transactions
No easy way to support typification of fields (think about a file with BLOB objects in it)
Relationship constraint
No trivial support for cache, accessing RAM is faster than accessing disk
No easy way to support something like "ALTER TABLE"
I guess that if you write something which support all this kind of issues you've written a sql engine...

Best way storing binary or image files

What is the best way storing binary or image files?
Database System
File System
Would you please explain, why?
There is no real best way, just a bunch of trade offs.
Database Pros:
1. Much easier to deal with in a clustering environment.
2. No reliance on additional resources like a file server.
3. No need to set up "sync" operations in load balanced environment.
4. Backups automatically include the files.
Database Cons:
1. Size / Growth of the database.
2. Depending on DB Server and your language, it might be difficult to put in and retrieve.
3. Speed / Performance.
4. Depending on DB server, you have to virus scan the files at the time of upload and export.
File Pros:
1. For single web/single db server installations, it's fast.
2. Well understood ability to manipulate files. In other words, it's easy to move the files to a different location if you run out of disk space.
3. Can virus scan when the files are "at rest". This allows you to take advantage of scanner updates.
File Cons:
1. In multi web server environments, requires an accessible share. Which should also be clustered for failover.
2. Additional security requirements to handle file access. You have to be careful that the web server and/or share does not allow file execution.
3. Transactional Backups have to take the file system into account.
The above said, SQL 2008 has a thing called FILESTREAM which combines both worlds. You upload to the database and it transparently stores the files in a directory on disk. When retrieving you can either pull from the database; or you can go direct to where it lives on the file system.
Pros of Storing binary files in a DB:
Some decrease in complexity since the
data access layer of your system need
only interface to a DB and not a DB +
file system.
You can secure your files using the
same comprehensive permissions-based
security that protects the rest of
the database.
Your binary files are protected
against loss along with the rest of
your data by way of database backups.
No separate filesystem backup system
required.
Cons of Storing binary files in a DB:
Depending on size/number of files,
can take up significant space
potentially decreasing performance
(dependening on whether your binary
files are stored in a table that is
queried for other content often or
not) and making for longer backup
times.
Pros of Storing binary files in file system:
This is what files systems are good
at. File systems will handle
defragmenting well and retrieving
files (say to stream a video file to
through a web server) will likely be
faster that with a db.
Cons of Storing binary files in file system:
Slightly more complex data access
layer. Needs its own backup system.
Need to consider referential
integrity issues (e.g. deleted
pointer in database will need to
result in deletion of file so as to
not have 'orphaned' files in the
filesystem).
On balance I would use the file system. In the past, using SQL Server 2005 I would simply store a 'pointer' in db tables to the binary file. The pointer would typically be a GUID.
Here's the good news if you are using SQL Server 2008 (and maybe others - I don't know): there is built in support for a hybrid solution with the new VARBINARY(MAX) FILESTREAM data type. These behave logically like VARBINARY(MAX) columns but behind the scenes, SQL Sever 2008 will store the data in the file system.
There is no best way.
What? You need more info?
There are three ways I know of... One, as byte arrays in the database. Two, as a file with the path stored in the database. Three, as a hybrid (only if DB allows, such as with the FileStream type).
The first is pretty cool because you can query and get your data in the same step. Which is always nice. But what happens when you have LOTS of files? Your database gets big. Now you have to deal with big database maintenance issues, such as the trials of backing up databases that are over a terabyte. And what happens if you need outside access to the files? Such as type conversions, mass manipulation (resize all images, appy watermarks, etc)? Its much harder to do than when you have files.
The second is great for somewhat large numbers of files. You can store them on NAS devices, back them up incrementally, keep your database small, etc etc. But then, when you have LOTS of files, you start running into limitations in the file system. And if you spread them over the network, you get latency issues, user rights issues, etc. Also, I take pity on you if your network gets rearranged. Now you have to run massive updates on the database to change your file locations, and I pity you if something screws up.
Then there's the hybrid option. Its almost perfect--you can get your files via your query, yet your database isn't massive. Does this solve all your problems? Probably not. Your database isn't portable anymore; you're locked to a particular DBMS. And this stuff isn't mature yet, so you get to enjoy the teething process. And who says this solves all the different issues?
Fact is, there is no "best" way. You just have to determine your requirements, make the best choice depending on them, and then suck it up when you figure out you did the wrong thing.
I like storing images in a database. It makes it easy to switch from development to production just by changing databases (no copying files). And the database can keep track of properties like created/modified dates just as well as the File System.
I personally never store images IN the database for performance purposes. In all of my sites I have a "/files" folder where I can put sub-folders based on what kind of images i'm going to store. Then I name them on convention.
For example if i'm storing a profile picture, I'll store it in "/files/profile/" as profile_2.jpg (if 2 is the ID of the account). I always make it a rule to resize the image on the server to the largest size I'll need, and then smaller ones if I need them. So I'd save "profile_2_thumb.jpg" and "profile_2_full.jpg".
By creating rules for yourself you can simply in the code call img src="/files/profile__thumb.jpg"
Thats how I do it anyway!

BLOB Storage - 100+ GB, MySQL, SQLite, or PostgreSQL + Python

I have an idea for a simple application which will monitor a group of folders, index any files it finds. A gui will allow me quickly tag new files and move them into a single database for storage and also provide an easy mechanism for querying the db by tag, name, file type and date. At the moment I have about 100+ GB of files on a couple removable hard drives, the database will be at least that big. If possible I would like to support full text search of the embedded binary and text documents. This will be a single user application.
Not trying to start a DB war, but what open source DB is going to work best for me? I am pretty sure SQLLite is off the table but I could be wrong.
I'm still researching this option for one of my own projects, but CouchDB may be worth a look.
Why store the files in the database at all? Simply store your meta-data and a filename. If you need to copy them to a new location for some reason, just do that as a file system copy.
Once you remove the file contents then any competent database will be able to handle the meta-data for a few hundred thousand files.
My preference would be to store the document with the metadata. One reason, is relational integrity. You can't easily move the files or modify the files without the action being brokered by the db. I am sure I can handle these problems but it isn't as clean as I would like and my experience has been that most vendors can handle huge amounts of binary data in the database these days. I guess I was wondering if PostgreSQL or MySQL have any obvious advantages in these areas, I am primarily familiar with Oracle. Anyway, thanks for the response, if the DB knows where the external file is it will also be easy to bring the file in at a later date if I want. Another aspect of the question was if either database is easier to work with when using Python. I'm assuming that is a wash.
I always hate to answer "don't", but you'd be better off indexing with something like Lucene (PyLucene). That and storing the paths in the database rather than the file contents is almost always recommended.
To add to that, none of those database engines will store LOBs in a separate dataspace (they'll be embedded in the table's data space) so any of those engines should perfom nearly equally as well (well except sqllite). You need to move to Informix, DB2, SQLServer or others to get that kind of binary object handling.
Pretty much any of them would work (even though SQLLite wasn't meant to be used in a concurrent multi-user environment, which could be a problem...) since you don't want to index the actual contents of the files.
The only limiting factor is the maximum "packet" size of the given DB (by packet I'm referring to a query/response). Usually these limit are around 2MB, meaning that your files must be smaller than 2MB. Of course you could increase this limit, but the whole process is rather inefficient, since for example to insert a file you would have to:
Read the entire file into memory
Transform the file in a query (which usually means hex encoding it - thus doubling the size from the start)
Executing the generated query (which itself means - for the database - that it has to parse it)
I would go with a simple DB and the associated files stored using a naming convention which makes them easy to find (for example based on the primary key). Of course this design is not "pure", but it will perform much better and is also easier to use.
why are you wasting time emulating something that the filesystem should be able to handle? more storage + grep is your answer.

Resources