BLOB Storage - 100+ GB, MySQL, SQLite, or PostgreSQL + Python - database

I have an idea for a simple application which will monitor a group of folders, index any files it finds. A gui will allow me quickly tag new files and move them into a single database for storage and also provide an easy mechanism for querying the db by tag, name, file type and date. At the moment I have about 100+ GB of files on a couple removable hard drives, the database will be at least that big. If possible I would like to support full text search of the embedded binary and text documents. This will be a single user application.
Not trying to start a DB war, but what open source DB is going to work best for me? I am pretty sure SQLLite is off the table but I could be wrong.

I'm still researching this option for one of my own projects, but CouchDB may be worth a look.

Why store the files in the database at all? Simply store your meta-data and a filename. If you need to copy them to a new location for some reason, just do that as a file system copy.
Once you remove the file contents then any competent database will be able to handle the meta-data for a few hundred thousand files.

My preference would be to store the document with the metadata. One reason, is relational integrity. You can't easily move the files or modify the files without the action being brokered by the db. I am sure I can handle these problems but it isn't as clean as I would like and my experience has been that most vendors can handle huge amounts of binary data in the database these days. I guess I was wondering if PostgreSQL or MySQL have any obvious advantages in these areas, I am primarily familiar with Oracle. Anyway, thanks for the response, if the DB knows where the external file is it will also be easy to bring the file in at a later date if I want. Another aspect of the question was if either database is easier to work with when using Python. I'm assuming that is a wash.

I always hate to answer "don't", but you'd be better off indexing with something like Lucene (PyLucene). That and storing the paths in the database rather than the file contents is almost always recommended.
To add to that, none of those database engines will store LOBs in a separate dataspace (they'll be embedded in the table's data space) so any of those engines should perfom nearly equally as well (well except sqllite). You need to move to Informix, DB2, SQLServer or others to get that kind of binary object handling.

Pretty much any of them would work (even though SQLLite wasn't meant to be used in a concurrent multi-user environment, which could be a problem...) since you don't want to index the actual contents of the files.
The only limiting factor is the maximum "packet" size of the given DB (by packet I'm referring to a query/response). Usually these limit are around 2MB, meaning that your files must be smaller than 2MB. Of course you could increase this limit, but the whole process is rather inefficient, since for example to insert a file you would have to:
Read the entire file into memory
Transform the file in a query (which usually means hex encoding it - thus doubling the size from the start)
Executing the generated query (which itself means - for the database - that it has to parse it)
I would go with a simple DB and the associated files stored using a naming convention which makes them easy to find (for example based on the primary key). Of course this design is not "pure", but it will perform much better and is also easier to use.

why are you wasting time emulating something that the filesystem should be able to handle? more storage + grep is your answer.

Related

Database vs File system storage

Database ultimately stores the data in files, whereas File system also stores the data in files. In this case what is the difference between DB and File System. Is it in the way it is retrieved or anything else?
A database is generally used for storing related, structured data, with well defined data formats, in an efficient manner for insert, update and/or retrieval (depending on application).
On the other hand, a file system is a more unstructured data store for storing arbitrary, probably unrelated data. The file system is more general, and databases are built on top of the general data storage services provided by file systems. [Quora]
The file system is useful if you are looking for a particular file, as operating systems maintain a sort of index. However, the contents of a txt file won't be indexed, which is one of the main advantages of a database.
For very complex operations, the filesystem is likely to be very slow.
Main RDBMS advantages:
Tables are related to each other
SQL query/data processing language
Transaction processing addition to SQL (Transact-SQL)
Server-client implementation with server-side objects like stored procedures, functions, triggers, views, etc.
Advantage of the File System over Data base Management System is:
When handling small data sets with arbitrary, probably unrelated data, file is more efficient than database.
For simple operations, read, write, file operations are faster and simple.
You can find n number of difference over internet.
"They're the same"
Yes, storing data is just storing data. At the end of the day, you have files. You can store lots of stuff in lots of files & folders, there are situations where this will be the way. There is a well-known versioning solution (svn) that finally ended up using a filesystem-based model to store data, ditching their BerkeleyDB. Rare but happens. More info.
"They're quite different"
In a database, you have options you don't have with files. Imagine a textfile (something like tsv/csv) with 99999 rows. Now try to:
Insert a column. It's painful, you have to alter each row and read+write the whole file.
Find a row. You either scan the whole file or build an index yourself.
Delete a row. Find row, then read+write everything after it.
Reorder columns. Again, full read+write.
Sort rows. Full read, some kind of sort - then do it next time all over.
There are lots of other good points but these are the first mountains you're trying to climb when you think of a file based db alternative. Those guys programmed all this for you, it's yours to use; think of the likely (most frequent) scenarios, enumerate all possible actions you want to perform on your data, and decide which one works better for you. Think in benefits, not fashion.
Again, if you're storing JPG pictures and only ever look for them by one key (their id maybe?), a well-thought filesystem storage is better. Filesystems, btw, are close to databases today, as many of them use a balanced tree approach, so on a BTRFS you can just put all your pictures in one folder - and the OS will silently implement something like an early SQL query each time you access your files.
So, database or files?...
Let's see a few typical examples when one is better than the other. (These are no complete lists, surely you can stuff in a lot more on both sides.)
DB tables are much better when:
You want to store many rows with the exact same structure (no block waste)
You need lightning-fast lookup / sorting by more than one value (indexed tables)
You need atomic transactions (data safety)
Your users will read/write the same data all the time (better locking)
Filesystem is way better if:
You like to use version control on your data (a nightmare with dbs)
You have big chunks of data that grow frequently (typically, logfiles)
You want other apps to access your data without API (like text editors)
You want to store lots of binary content (pictures or mp3s)
TL;DR
Programming rarely says "never" or "always". Those who say "database always wins" or "files always win" probably just don't know enough. Think of the possible actions (now + future), consider both ways, and choose the fastest / most efficient for the case. That's it.
Something one should be aware of is that Unix has what is called an inode limit. If you are storing millions of records then this can be a serious problem. You should run df -i to view the % used as effectively this is a filesystem file limit - EVEN IF you have plenty of disk space.
The difference between file processing system and database management system is as follow:
A file processing system is a collection of programs that store and manage files in computer hard-disk. On the other hand, A database management system is collection of programs that enables to create and maintain a database.
File processing system has more data redundancy, less data redundancy in dbms.
File processing system provides less flexibility in accessing data, whereas dbms has more flexibility in accessing data.
File processing system does not provide data consistency, whereas dbms provides data consistency through normalization.
File processing system is less complex, whereas dbms is more complex.
Context: I've written a filesystem that has been running in production for 7 years now. [1]
The key difference between a filesystem and a database is that the filesystem API is part of the OS, thus filesystem implementations have to implement that API and thus follow certain rules, whereas databases are built by 3rd parties having complete freedom.
Historically, databases where created when the filesystem provided by the OS were not good enough for the problem at hand. Just think about it: if you had special requirements, you couldn't just call Microsoft or Apple to redesign their filesystem API. You would either go ahead and write your own storage software or you would look around for existing alternatives. So the need created a market for 3rd party data storage software which ended up being called databases. That's about it.
While it may seem that filesystems have certain rules like having files and directories, this is not true. The biggest operating systems work like that but there are many mall small OSs that work differently. It's certainly not a hard requirement. (Just remember, to build a new filesystem, you also need to write a new OS, which will make adoption quite a bit harder. Why not focus on just the storage engine and call it a database instead?)
In the end, both databases and filesystems come in all shapes and sizes. Transactional, relational, hierarchical, graph, tabled; whatever you can think of.
[1] I've worked on the Boomla Filesystem which is the storage system behind the Boomla OS & Web Application Platform.
The main differences between the Database and File System storage is:
The database is a software application used to insert, update and delete
data while the file system is a software used to add, update and delete
files.
Saving the files and retrieving is simpler in file system
while SQL needs to be learn to perform any query on the database to
get (SELECT), add (INSERT) and update the data.
Database provides a proper data recovery process while file system did not.
In terms of security the database is more secure then the file system (usually).
The migration process is very easy in File system just copy and paste into the target
while for database this task is not as simple.

Is couchdb good for lot of documents with file attachments over multiple servers?

i would love to hear your thoughts about couchdb, and would it handle my use case.
What i will do, i will have database where i store documents in size about 20kb with attachment of 1-10MB for each.
will couch handle database 10TB or more per server with my schema?(in 4u case you can put 24 2TB drives is this too much per couch node?, there will be very less reads, so i down need speed)
will couch be able replicate all documents with attachments
how about splitting all data to multiple servers (for example to 4 nodes)? will it handle that much attachments?
what problems do you see here?
need more info please ask :)
I don't think you will hit a physical limitation with a 10TB file, that is I don't think couch has some inbuilt "can't use files bigger than X" with X being < 10TB.
However.
The biggest issue is the file compaction. In order to reclaim space, Couch wants to compress the file. This effectively means copying the file. So, for some point at least, 10TB needs to be 20TB as it duplicates the live data in the new copy.
If you are mostly appending to the file, that is you are simply adding new data and not updating or overwriting old data, then this will be less of a problem, as compaction won't gain you quite that much. If your data is basically static, then I would build the file and compact it a final time and be doe with it.
There are "3rd party" sharding solution for Couch, Lounge is popular.
When I approach a couch solution the primary thing to consider is what your query criteria is. Couch is all about the views, really. What kind of views are you looking at? If you're simply storing data by some simple key (file name, the date, or whatever), you may well be better off simply using a file system, and an appropriate directory structure, frankly.
So I'd like to hear more about your views you plan to use since you don't intend to do a lot of reading.
Addenda:
You still haven't mentioned what kind of queries you're looking for. The queries are, effectively, THE design component, especially for a Couch DB since it gets more and more difficult to add new queries on large datasets.
When you said attachments, I assumed you meant attachments to the Couch DB payload (since it can handle attachments).
So, all that said, you could easily create meta-data document capturing all of the whatever information you want to capture, and as part of that document add a path name to the actual file stored on the file system. This will reduce the overall size of the Couch file dramatically, which makes the maintenance faster and more efficient. You lose some of the "Self contained" part of having it all in a single document, of course.

How to efficiently store hundrets of thousands of documents?

I'm working on a system that will need to store a lot of documents (PDFs, Word files etc.) I'm using Solr/Lucene to search for revelant information extracted from those documents but I also need a place to store the original files so that they can be opened/downloaded by the users.
I was thinking about several possibilities:
file system - probably not that good idea to store 1m documents
sql database - but I won't need most of it's relational features as I need to store only the binary document and its id so this might not be the fastest solution
no-sql database - don't have any expierience with them so I'm not sure if they are any good either, there are also many of them so I don't know which one to pick
The storage I'm looking for should be:
fast
scallable
open-source (not crucial but nice to have)
Can you recommend what's the best way of storing those files will be in your opinion?
A filesystem -- as the name suggests -- is designed and optimised to store large numbers of files in an efficient and scalable way.
You can follow Facebook as it stores a lot of files (15 billion photos):
They Initially started with NFS share served by commercial storage appliances.
Then they moved to their onw implementation http file server called Haystack
Here is a facebook note if you want to learn more http://www.facebook.com/note.php?note_id=76191543919
Regarding the NFS share. Keep in mind that NFS shares usually limits amount of files in one folder for performance reasons. (This could be a bit counter intuitive if you assume that all recent file systems use b-trees to store their structure.) So if you are using comercial NFS shares like (NetApp) you will likely need to keep files in multiple folders.
You can do that if you have any kind of id for your files. Just divide it Ascii representation in to groups of few characters and make folder for each group.
For example we use integers for ids so file with id 1234567891 is stored as storage/0012/3456/7891.
Hope that helps.
In my opinion...
I would store files compressed onto disk (file system) and use a database to keep track of them.
and posibly use Sqlite if this is its only job.
File System : While thinking about the big picture, The DBMS use the file system again. And the File system is dedicated for keeping the files, so you can see the optimizations (as LukeH mentioned)

Best way storing binary or image files

What is the best way storing binary or image files?
Database System
File System
Would you please explain, why?
There is no real best way, just a bunch of trade offs.
Database Pros:
1. Much easier to deal with in a clustering environment.
2. No reliance on additional resources like a file server.
3. No need to set up "sync" operations in load balanced environment.
4. Backups automatically include the files.
Database Cons:
1. Size / Growth of the database.
2. Depending on DB Server and your language, it might be difficult to put in and retrieve.
3. Speed / Performance.
4. Depending on DB server, you have to virus scan the files at the time of upload and export.
File Pros:
1. For single web/single db server installations, it's fast.
2. Well understood ability to manipulate files. In other words, it's easy to move the files to a different location if you run out of disk space.
3. Can virus scan when the files are "at rest". This allows you to take advantage of scanner updates.
File Cons:
1. In multi web server environments, requires an accessible share. Which should also be clustered for failover.
2. Additional security requirements to handle file access. You have to be careful that the web server and/or share does not allow file execution.
3. Transactional Backups have to take the file system into account.
The above said, SQL 2008 has a thing called FILESTREAM which combines both worlds. You upload to the database and it transparently stores the files in a directory on disk. When retrieving you can either pull from the database; or you can go direct to where it lives on the file system.
Pros of Storing binary files in a DB:
Some decrease in complexity since the
data access layer of your system need
only interface to a DB and not a DB +
file system.
You can secure your files using the
same comprehensive permissions-based
security that protects the rest of
the database.
Your binary files are protected
against loss along with the rest of
your data by way of database backups.
No separate filesystem backup system
required.
Cons of Storing binary files in a DB:
Depending on size/number of files,
can take up significant space
potentially decreasing performance
(dependening on whether your binary
files are stored in a table that is
queried for other content often or
not) and making for longer backup
times.
Pros of Storing binary files in file system:
This is what files systems are good
at. File systems will handle
defragmenting well and retrieving
files (say to stream a video file to
through a web server) will likely be
faster that with a db.
Cons of Storing binary files in file system:
Slightly more complex data access
layer. Needs its own backup system.
Need to consider referential
integrity issues (e.g. deleted
pointer in database will need to
result in deletion of file so as to
not have 'orphaned' files in the
filesystem).
On balance I would use the file system. In the past, using SQL Server 2005 I would simply store a 'pointer' in db tables to the binary file. The pointer would typically be a GUID.
Here's the good news if you are using SQL Server 2008 (and maybe others - I don't know): there is built in support for a hybrid solution with the new VARBINARY(MAX) FILESTREAM data type. These behave logically like VARBINARY(MAX) columns but behind the scenes, SQL Sever 2008 will store the data in the file system.
There is no best way.
What? You need more info?
There are three ways I know of... One, as byte arrays in the database. Two, as a file with the path stored in the database. Three, as a hybrid (only if DB allows, such as with the FileStream type).
The first is pretty cool because you can query and get your data in the same step. Which is always nice. But what happens when you have LOTS of files? Your database gets big. Now you have to deal with big database maintenance issues, such as the trials of backing up databases that are over a terabyte. And what happens if you need outside access to the files? Such as type conversions, mass manipulation (resize all images, appy watermarks, etc)? Its much harder to do than when you have files.
The second is great for somewhat large numbers of files. You can store them on NAS devices, back them up incrementally, keep your database small, etc etc. But then, when you have LOTS of files, you start running into limitations in the file system. And if you spread them over the network, you get latency issues, user rights issues, etc. Also, I take pity on you if your network gets rearranged. Now you have to run massive updates on the database to change your file locations, and I pity you if something screws up.
Then there's the hybrid option. Its almost perfect--you can get your files via your query, yet your database isn't massive. Does this solve all your problems? Probably not. Your database isn't portable anymore; you're locked to a particular DBMS. And this stuff isn't mature yet, so you get to enjoy the teething process. And who says this solves all the different issues?
Fact is, there is no "best" way. You just have to determine your requirements, make the best choice depending on them, and then suck it up when you figure out you did the wrong thing.
I like storing images in a database. It makes it easy to switch from development to production just by changing databases (no copying files). And the database can keep track of properties like created/modified dates just as well as the File System.
I personally never store images IN the database for performance purposes. In all of my sites I have a "/files" folder where I can put sub-folders based on what kind of images i'm going to store. Then I name them on convention.
For example if i'm storing a profile picture, I'll store it in "/files/profile/" as profile_2.jpg (if 2 is the ID of the account). I always make it a rule to resize the image on the server to the largest size I'll need, and then smaller ones if I need them. So I'd save "profile_2_thumb.jpg" and "profile_2_full.jpg".
By creating rules for yourself you can simply in the code call img src="/files/profile__thumb.jpg"
Thats how I do it anyway!

Storing a file in a database as opposed to the file system?

Generally, how bad of a performance hit is storing a file in a database (specifically mssql) as opposed to the file system? I can't come up with a reason outside of application portability that I would want to store my files as varbinaries in SQL Server.
Have a look at this answer:
Storing Images in DB - Yea or Nay?
Essentially, the space and performance hit can be quite big, depending on the number of users. Also, keep in mind that Web servers are cheap and you can easily add more to balance the load, whereas the database is the most expensive and hardest to scale part of a web architecture usually.
There are some opposite examples (e.g., Microsoft Sharepoint), but usually, storing files in the database is not a good idea.
Unless possibly you write desktop apps and/or know roughly how many users you will ever have, but on something as random and unexpectable like a public web site, you may pay a high price for storing files in the database.
If you can move to SQL Server 2008, you can take advantage of the FILESTREAM support which gives you the best of both - the files are stored in the filesystem, but the database integration is much better than just storing a filepath in a varchar field. Your query can return a standard .NET file stream, which makes the integration a lot simpler.
Getting Started with FILESTREAM Storage
I'd say, it depends on your situation. For example, I work in local government, and we have lots of images like mugshots, etc. We don't have a high number of users, but we need to have good security and auditing around the data. The database is a better solution for us since it makes this easier and we aren't going to run into scaling problems.
What's the question here?
Modern DBMS SQL2008 have a variety of ways of dealing with BLOBs which aren't just sticking in them in a table. There are pros and cons, of course, and you might need to think about it a little deeper.
This is an interesting paper, by the late (?) Jim Gray
To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem
In my own experience, it is always better to store files as files. The reason is that the filesystem is optimised for file storeage, whereas a database is not. Of course, there are some exceptions (e.g. the much heralded next-gen MS filesystem is supposed to be built on top of SQL server), but in general that's my rule.
While performance is an issue, I think modern database designs have made it much less of an issue for small files.
Performance aside, it also depends on just how tightly-coupled the data is. If the file contains data that is closely related to the fields of the database, then it conceptually belongs close to it and may be stored in a blob. If it contains information which could potentially relate to multiple records or may have some use outside of the context of the database, then it belongs outside. For example, an image on a web page is fetched on a separate request from the page that links to it, so it may belong outside (depending on the specific design and security considerations).
Our compromise, and I don't promise it's the best, has been to store smallish XML files in the database but images and other files outside it.
We made the decision to store as varbinary for http://www.freshlogicstudios.com/Products/Folders/ halfway expecting performance issues. I can say that we've been pleasantly surprised at how well it's worked out.
I agree with #ZombieSheep.
Just one more thing - I generally don't think that databases actually need be portable because you miss all the features your DBMS vendor provides. I think that migrating to another database would be the last thing one would consider. Just my $.02
The overhead of having to parse a blob (image) into a byte array and then write it to disk in the proper file name and then reading it is enough of an overhead hit to discourage you from doing this too often, especially if the files are rather large.
Not to be vague or anything but I think the type of 'file' you will be storing is one of the biggest determining factors. If you essentially talking about a large text field which could be stored as file my preference would be for db storage.
Interesting topic.
There is no absolutely one correct answer to this question.
There are few key elements to consider:
What’s your database engine?
What’s the route of file from database to end user and/or backwards?
What are the security requirements?
If files are meant for public audience and accessible via website, you shouldn’t even consider storing files in database. Use some smart indexing for files instead.
If files are containing highly sensitive information, then it might be worth of storing these into database. But you have to implement proper safe gateways too.
If performance is crucial, it’s better do not store files in database.
Backup and restoring and migrating of database might become a nightmare if database grows big just because of files. If you are DBA, then you would like to kill the person who “invented” an idea to put files into database.
I recommend to use storing files into database at last option, when there is absolutely no any better alternative available.

Resources