How are Huge files stored in a database? - database

I was just wondering how exactly huge files are stored in databases. Most BLOBs are limeted to 1GB as far as I know but if you take youtube for example, they have multiple full-HD videos with over an hour of running time (I think that's a bit larger than 1GB).
Are they using some kind of special database, is there another datatype I've never heard of or are they just using a simple method like splitting the files?
If they use let's say a method where they split and rearrange the bits and bytes, how can the end user look a video without noticing.
It's just a question out of pure curiosity but I would be happy if you could answer it.

It is not really the best idea to store files into a database. Youtube and other websites are web applications that store files in files systems. Databases are then only necessary to store information allowing to retrieve the required files on the file system before providing them to users.

They could be stored on disk and use a DB to hold only the paths. I'm not sure what you're asking.

why do you want to store it as BLOB? you can just store it as a file ( FLV or whatever ) and just stream it from there.

Related

Database or Filesystem?

I know there are several questions similar, but I can't find one that answers my specific problem:
I need to save some data in a server for a game I'm developing.
Each user will save one binary file and some time after ask for it.
The file can be something between just a bunch of bytes to around 50kb.
A lot of questions (mostly about images) say to use the filesystem, because that file can then be served as static content. In this case that's not possible, since I will have to check somehow that I'm sending that file to the right user, and also I need some logic return the file only if it's not the same the user already has.
Should I save that file in the database or in the filesystem?
Note: The server will be hosted on Linux, and the DB will probably by MySQL.
Thanks!
I'm afraid you're far from providing enough information to answer your question correctly. If I read your question "naively", all you're trying to do is write a save game system.
In such a case, the file system is really all you need. DB are good for storing structured data that you're going to search, sum, combine and index, not for storing arbitrary bunch of small blobs.
Now, if there is other requirement, for instance, you're writing a web-based game that store the data for all players in a central location, the answer MIGHT be different (again, you need to provide much more details about what you're doing, though)
I suggest reading this whitepaper (To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem) by Microsoft research (it deals with SQL Server 2005).
The conclusions are - if the files are under 250kb, use the DB to store them
In essence,
filesystems seem to have better fragmentation handling
than databases and this drives the break-even point
down from about 1MB to about 256KB.
Of course, you should test for your own DB and OS.
I think you could user a sort of database. Filesystem is slow, cause it go on hard disk, move head to find file. With db, the access is faster. If your server is hosted on windows, you can use a microsoft db, access, that is little and fast.

database for huge files like audio and video

My application creates a large number of files, each up to 100MB. Currently we store these files in the file system which works pretty well. But I am wondering if there is a better solution to store the files in a some kind of file database. The simple advantage with database is if it can split the file and store in small chunks instead of one 100mb file.
A file system is perfectly suited for storing files. If you need to associate them with a database, do it by filename. The filesystem already does numerous fancy things to assure it is efficient. It's probably best that you don't try to outsmart it.
Relational databases are no good at files this big. You could go to something like HDFS, but it may not be worth the trouble if what you have is doing the job. I believe it does break large files down into chunks though.

How to efficiently store hundrets of thousands of documents?

I'm working on a system that will need to store a lot of documents (PDFs, Word files etc.) I'm using Solr/Lucene to search for revelant information extracted from those documents but I also need a place to store the original files so that they can be opened/downloaded by the users.
I was thinking about several possibilities:
file system - probably not that good idea to store 1m documents
sql database - but I won't need most of it's relational features as I need to store only the binary document and its id so this might not be the fastest solution
no-sql database - don't have any expierience with them so I'm not sure if they are any good either, there are also many of them so I don't know which one to pick
The storage I'm looking for should be:
fast
scallable
open-source (not crucial but nice to have)
Can you recommend what's the best way of storing those files will be in your opinion?
A filesystem -- as the name suggests -- is designed and optimised to store large numbers of files in an efficient and scalable way.
You can follow Facebook as it stores a lot of files (15 billion photos):
They Initially started with NFS share served by commercial storage appliances.
Then they moved to their onw implementation http file server called Haystack
Here is a facebook note if you want to learn more http://www.facebook.com/note.php?note_id=76191543919
Regarding the NFS share. Keep in mind that NFS shares usually limits amount of files in one folder for performance reasons. (This could be a bit counter intuitive if you assume that all recent file systems use b-trees to store their structure.) So if you are using comercial NFS shares like (NetApp) you will likely need to keep files in multiple folders.
You can do that if you have any kind of id for your files. Just divide it Ascii representation in to groups of few characters and make folder for each group.
For example we use integers for ids so file with id 1234567891 is stored as storage/0012/3456/7891.
Hope that helps.
In my opinion...
I would store files compressed onto disk (file system) and use a database to keep track of them.
and posibly use Sqlite if this is its only job.
File System : While thinking about the big picture, The DBMS use the file system again. And the File system is dedicated for keeping the files, so you can see the optimizations (as LukeH mentioned)

Storing a small number of images: blob or fs?

I'm adding some functionality to my site so that users can upload their own profile pictures, so I was wondering about whether to store them in the database as a BLOB, or put them in the file system.
I found a question similar to this here: Storing images in DB: Yea or Nay, but the answers given were geared more towards people expecting many many thousands or even millions of images, whereas I'm more concerned about small images (JPEGs up to maybe 150x150 pixels), and small numbers of them: perhaps up to one or two thousand.
What are the feelings about DB BLOB vs Filesystem for this scenario? How do clients go with caching images from the DB vs from the filesystem?
If BLOBs stored in the DB are the way to go - is there anything I should know about where to store them? Since I imagine that a majority of my users won't be uploading a picture, should I create a user_pics table to (outer) join to the regular users table when needed?
Edit: I'm reopening this question, because it's not a duplicate of those two you linked to. This question is specifically about the pros/cons of using a DB or FS for a SMALL number of images. As I said above, the other question is targeted towards people who need to store thousands upon thousands of large images.
To answer parts of your question:
How do clients go with caching images from the DB vs from the filesystem?
For a database: Have a last_modified field in your database. Use the Last-Modified HTTP header so the client's browser can cache properly. Be sure to send the appropriate responses when the browser requests for an image "if newer" (can't recall what it's called; some HTTP request header).
For a filesystem: Do the same thing, but with the file's modified time.
If BLOBs stored in the DB are the way to go - is there anything I should know about where to store them? Since I imagine that a majority of my users won't be uploading a picture, should I create a user_pics table to (outer) join to the regular users table when needed?
I would put the BLOB and related metadata in its own table, with some kind of relation between it and your user table. Doing this will make it easier to optimize the table storage method for your data, makes things tidier, and leaves room for expandability (e.g. a general "files" table).
I once faced a similar question with a small DMS for pdf files. The scenario was different from yours: A maximum of may be 100 files with sizes up to 10 MB each - not what you expect for profile pictures. But the answer a friend gave me back then applies to your case as well:
Use each storage system for what it is designed to do.
Store data in a database. Store files in a file system.
This is not the ultimate answer(*), but its a good rule of thumb for starters.
I have never heard of the Windows FS being slow and sometimes unreliable, as Aaron Digulla states in his answer. If there are such problems, this certainly needs to be factored in. But for avatar pictures, it does not strike me as important.
(*) I know, I know, 42...
DB is optimized for latency, transactions, etc.
Image storage is optimized for read latency, storage cost, etc.
A blob store is ideal for storing millions of images. I work on SeaweedFS. It was based on Facebook's design for storing their user photos.
What would be more convenient, from the perspective of serving them, writing the code to serve them, backup procedures, etc.? You want the right answer for you, not the right answer for someone else.
From my point of view anything what may be left outside of database should stay outside. It may be file system or separate tables which you do not replicate or backup every day. It makes database much lighter, it grows slower and it easier to understand and maintain.
If you are on MSSQL make sure that blobs are stored in separate data file. Not in PRIMARY as everything else.
On Windows, put as much as you can in the database. The filesystem is somewhat slow and sometimes even unreliable.
On Linux, you have more options. Here, you should consider moving big files into a filesystem and just keep the name in the DB. If you use a modern filesystem like Ext3 or ReiseFS, you can even create many small files with pretty good performance.
You also need to take into account how you can access the data. If you have everything in the DB, you have one access path, need not worry about another set of permissions, but you have to deal with the extra complexity of reading/writing BLOBs. In many DBs, BLOBs can't be searched.
On the filesystem, you can run other tools on your data which isn't possible if the files are stored in a DB.
I would store them in the database:
Backup/restore is easy (if you backup files and also the database, point-in-time recovery is more complicated)
Transactions in the db mean you should never end up pointing at a file-name that is not there
Less chance someone is going to figure out a sneaky way of putting a script onto your server via a dodgy image upload hack
Since you are talking about a small number of images, ease of use/administration should take preference over performance issues which are debated in the linked questions.
I think there is a managability advantage storing them in the database; they can be backed up and restored consistently with the other data - you won't forget to delete obsolete ones (well, you might, but it's a bit less likely), and if you migrate the database to another machine, the images go with it.

BLOB Storage - 100+ GB, MySQL, SQLite, or PostgreSQL + Python

I have an idea for a simple application which will monitor a group of folders, index any files it finds. A gui will allow me quickly tag new files and move them into a single database for storage and also provide an easy mechanism for querying the db by tag, name, file type and date. At the moment I have about 100+ GB of files on a couple removable hard drives, the database will be at least that big. If possible I would like to support full text search of the embedded binary and text documents. This will be a single user application.
Not trying to start a DB war, but what open source DB is going to work best for me? I am pretty sure SQLLite is off the table but I could be wrong.
I'm still researching this option for one of my own projects, but CouchDB may be worth a look.
Why store the files in the database at all? Simply store your meta-data and a filename. If you need to copy them to a new location for some reason, just do that as a file system copy.
Once you remove the file contents then any competent database will be able to handle the meta-data for a few hundred thousand files.
My preference would be to store the document with the metadata. One reason, is relational integrity. You can't easily move the files or modify the files without the action being brokered by the db. I am sure I can handle these problems but it isn't as clean as I would like and my experience has been that most vendors can handle huge amounts of binary data in the database these days. I guess I was wondering if PostgreSQL or MySQL have any obvious advantages in these areas, I am primarily familiar with Oracle. Anyway, thanks for the response, if the DB knows where the external file is it will also be easy to bring the file in at a later date if I want. Another aspect of the question was if either database is easier to work with when using Python. I'm assuming that is a wash.
I always hate to answer "don't", but you'd be better off indexing with something like Lucene (PyLucene). That and storing the paths in the database rather than the file contents is almost always recommended.
To add to that, none of those database engines will store LOBs in a separate dataspace (they'll be embedded in the table's data space) so any of those engines should perfom nearly equally as well (well except sqllite). You need to move to Informix, DB2, SQLServer or others to get that kind of binary object handling.
Pretty much any of them would work (even though SQLLite wasn't meant to be used in a concurrent multi-user environment, which could be a problem...) since you don't want to index the actual contents of the files.
The only limiting factor is the maximum "packet" size of the given DB (by packet I'm referring to a query/response). Usually these limit are around 2MB, meaning that your files must be smaller than 2MB. Of course you could increase this limit, but the whole process is rather inefficient, since for example to insert a file you would have to:
Read the entire file into memory
Transform the file in a query (which usually means hex encoding it - thus doubling the size from the start)
Executing the generated query (which itself means - for the database - that it has to parse it)
I would go with a simple DB and the associated files stored using a naming convention which makes them easy to find (for example based on the primary key). Of course this design is not "pure", but it will perform much better and is also easier to use.
why are you wasting time emulating something that the filesystem should be able to handle? more storage + grep is your answer.

Resources