Storing Data in Files on the Server rather than in Databases? - database

What are the problems associated with storing your Data in files rather than databases? I'm thinking in terms of something like a blog engiene. I read that MoveableType used to do this. What are the pros/cons of working this way?

Databases provide means to perform interesting queries more easily.
Examples: You would want to list the 10 most recent posts on the front page. Make an archive page that lists all articles published in a given year (taken from the url).

I think the main one is data consistency. If you keep everything together in one db table, you don't have to worry (as much) about the file being externally modified or deleted without the meta data being modified in sync. There's also the possibility of an incomplete write if the server fails while you're updating. In this case you have to take your own steps to implement transactions.
I think that with an appropriate level of care and file permissions though, these problems can be overcome.

It is much easier and more comfortable to specify access rights (to data or file) in database than to use OS specific access rights.
You can easily share data across machines and/or websites using database-stored files.
Unfortunately, it is (often) much slower to serve files stored in database.

Related

NoSQL for filesystem storage organization and replication?

We've been discussing design of a data warehouse strategy within our group for meeting testing, reproducibility, and data syncing requirements. One of the suggested ideas is to adapt a NoSQL approach using an existing tool rather than try to re-implement a whole lot of the same on a file system. I don't know if a NoSQL approach is even the best approach to what we're trying to accomplish but perhaps if I describe what we need/want you all can help.
Most of our files are large, 50+ Gig in size, held in a proprietary, third-party format. We need to be able to access each file by a name/date/source/time/artifact combination. Essentially a key-value pair style look-up.
When we query for a file, we don't want to have to load all of it into memory. They're really too large and would swamp our server. We want to be able to somehow get a reference to the file and then use a proprietary, third-party API to ingest portions of it.
We want to easily add, remove, and export files from storage.
We'd like to set up automatic file replication between two servers (we can write a script for this.) That is, sync the contents of one server with another. We don't need a distributed system where it only appears as if we have one server. We'd like complete replication.
We also have other smaller files that have a tree type relationship with the Big files. One file's content will point to the next and so on, and so on. It's not a "spoked wheel," it's a full blown tree.
We'd prefer a Python, C or C++ API to work with a system like this but most of us are experienced with a variety of languages. We don't mind as long as it works, gets the job done, and saves us time. What you think? Is there something out there like this?
Have you had a look at MongoDB's GridFS.
http://www.mongodb.org/display/DOCS/GridFS+Specification
You can query files by the default metadata, plus your own additional metadata. Files are broken out into small chunks and you can specify which portions you want. Also, files are stored in a collection (similar to a RDBMS table) and you get Mongo's replication features to boot.
Whats wrong with a proven cluster file system? Lustre and ceph are good candidates.
If you're looking for an object store, Hadoop was built with this in mind. In my experience Hadoop is a pain to work with and maintain.
For me both Lustre and Ceph has some problems that databases like Cassandra dont have. I think the core question here is what disadvantage Cassandra and other databases like it would have as a FS backend.
Performance could obviously be one. What about space usage? Consistency?

database vs. flat files

The company I work for is trying to switch a product that uses flat file format to a database format. We're handling pretty big files of data (ie: 25GB/file) and they get updated really quick. We need to run queries that randomly access the data, as well as in a contiguous way. I am trying to convince them of the advantages of using a database, but some of my colleagues seem reluctant to this. So I was wondering if you guys can help me out here with some reasons or links to posts of why we should use databases, or at least clarify why flat files are better (if they are).
Databases can handle querying
tasks, so you don't have to walk
over files manually. Databases can
handle very complicated queries.
Databases can handle indexing tasks,
so if tasks like get record with id
= x can be VERY fast
Databases can handle multiprocess/multithreaded access.
Databases can handle access from
network
Databases can watch for data
integrity
Databases can update data easily
(see 1) )
Databases are reliable
Databases can handle transactions
and concurrent access
Databases + ORMs let you manipulate
data in very programmer friendly way.
This is an answer I've already given some time ago:
It depends entirely on the
domain-specific application needs. A
lot of times direct text file/binary
files access can be extremely fast,
efficient, as well as providing you
all the file access capabilities of
your OS's file system.
Furthermore, your programming language
most likely already has a built-in
module (or is easy to make one) for
specific parsing.
If what you need is many appends
(INSERTS?) and sequential/few access
little/no concurrency, files are the
way to go.
On the other hand, when your
requirements for concurrency,
non-sequential reading/writing,
atomicity, atomic permissions, your
data is relational by the nature etc.,
you will be better off with a
relational or OO database.
There is a lot that can be
accomplished with SQLite3, which
is extremely light (under 300kb), ACID
compliant, written in C/C++, and
highly ubiquitous (if it isn't already
included in your programming language
-for example Python-, there is surely one available). It can be useful even
on db files as big as 140 terabytes, or 128 tebibytes (Link to Database Size), possible
more.
If your requirements where bigger,
there wouldn't even be a discussion,
go for a full-blown RDBMS.
As you say in a comment that "the system" is merely a bunch of scripts, then you should take a look at pgbash.
Don't build it if you can buy it.
I heard this quote recently, and it really seems fitting as a guide line. Ask yourself this... How much time was spent working on the file handling portion of your app? I suspect a fair amount of time was spent optimizing this code for performance. If you had been using a relational database all along, you would have spent considerably less time handling this portion of your application. You would have had more time for the true "business" aspect of your app.
They're faster; unless you're loading the entire flat file into memory, a database will allow faster access in almost all cases.
They're safer; databases are easier to safely backup; they have mechanisms to check for file corruption, which flat files do not. Once corruption in your flat file migrates to your backups, you're done, and you might not even know it yet.
They have more features; databases can allow many users to read/write at the same time.
They're much less complex to work with, once they're setup.
What types of files is not mentioned. If they're media files, go ahead with flat files. You probably just need a DB for tags and some way to associate the "external BLOBs" to the records in the DB. But if full text search is something you need, there's no other way to go but migrate to a full DB.
Another thing, your filesystem might provide the ceiling as far as number of physical files are concerned.
Databases all the way.
However, if you still have a need for storing files, don't have the capacity to take on a new RDBMS (like Oracle, SQLServer, etc), than look into XML.
XML is a structure file format which offers you the ability to store things as a file but give you query power over the file and data within it. XML Files are easier to read than flat files and can be easily transformed applying an XSLT for even better human-readability. XML is also a great way to transport data around if you must.
I strongly suggest a DB, but if you can't go that route, XML is an ok second.
What about a non-relational (NoSQL) database such as Amazon's SimpleDB, Tokio Cabinet, etc? I've heard that Google, Facebook, LinkedIn are using these to store their huge datasets.
Can you tell us if your data is structured, if your schema is fixed, if you need easy replicability, if access times are important, etc?
Difference between database and flat files are given below:
Database provide more flexibility whereas flat file provide less flexibility.
Database system provide data consistency whereas flat file can not provide data consistency.
Database is more secure over flat files.
Database support DML and DDL whereas flat files can not support these.
Less data redundancy in database whereas more data redundancy in flat files.
SQL ad hoc query abilities are enough of a reason for me. With a good schema and indexing on the tables, this is fast and effective and will have good performance.
Unless you are loading the files into memory each time you boot, use a database. Simple as that.
That is assuming that your colleges already have the program to handle queries to the files. If not, then use a database.
Although other answers are good, I would like to emphasize a point that was not really well talked about:
The developer's ease of use. databases are much simpler to work with! If you don't have any strong reason(s) for using files, use a database.

Storing a small number of images: blob or fs?

I'm adding some functionality to my site so that users can upload their own profile pictures, so I was wondering about whether to store them in the database as a BLOB, or put them in the file system.
I found a question similar to this here: Storing images in DB: Yea or Nay, but the answers given were geared more towards people expecting many many thousands or even millions of images, whereas I'm more concerned about small images (JPEGs up to maybe 150x150 pixels), and small numbers of them: perhaps up to one or two thousand.
What are the feelings about DB BLOB vs Filesystem for this scenario? How do clients go with caching images from the DB vs from the filesystem?
If BLOBs stored in the DB are the way to go - is there anything I should know about where to store them? Since I imagine that a majority of my users won't be uploading a picture, should I create a user_pics table to (outer) join to the regular users table when needed?
Edit: I'm reopening this question, because it's not a duplicate of those two you linked to. This question is specifically about the pros/cons of using a DB or FS for a SMALL number of images. As I said above, the other question is targeted towards people who need to store thousands upon thousands of large images.
To answer parts of your question:
How do clients go with caching images from the DB vs from the filesystem?
For a database: Have a last_modified field in your database. Use the Last-Modified HTTP header so the client's browser can cache properly. Be sure to send the appropriate responses when the browser requests for an image "if newer" (can't recall what it's called; some HTTP request header).
For a filesystem: Do the same thing, but with the file's modified time.
If BLOBs stored in the DB are the way to go - is there anything I should know about where to store them? Since I imagine that a majority of my users won't be uploading a picture, should I create a user_pics table to (outer) join to the regular users table when needed?
I would put the BLOB and related metadata in its own table, with some kind of relation between it and your user table. Doing this will make it easier to optimize the table storage method for your data, makes things tidier, and leaves room for expandability (e.g. a general "files" table).
I once faced a similar question with a small DMS for pdf files. The scenario was different from yours: A maximum of may be 100 files with sizes up to 10 MB each - not what you expect for profile pictures. But the answer a friend gave me back then applies to your case as well:
Use each storage system for what it is designed to do.
Store data in a database. Store files in a file system.
This is not the ultimate answer(*), but its a good rule of thumb for starters.
I have never heard of the Windows FS being slow and sometimes unreliable, as Aaron Digulla states in his answer. If there are such problems, this certainly needs to be factored in. But for avatar pictures, it does not strike me as important.
(*) I know, I know, 42...
DB is optimized for latency, transactions, etc.
Image storage is optimized for read latency, storage cost, etc.
A blob store is ideal for storing millions of images. I work on SeaweedFS. It was based on Facebook's design for storing their user photos.
What would be more convenient, from the perspective of serving them, writing the code to serve them, backup procedures, etc.? You want the right answer for you, not the right answer for someone else.
From my point of view anything what may be left outside of database should stay outside. It may be file system or separate tables which you do not replicate or backup every day. It makes database much lighter, it grows slower and it easier to understand and maintain.
If you are on MSSQL make sure that blobs are stored in separate data file. Not in PRIMARY as everything else.
On Windows, put as much as you can in the database. The filesystem is somewhat slow and sometimes even unreliable.
On Linux, you have more options. Here, you should consider moving big files into a filesystem and just keep the name in the DB. If you use a modern filesystem like Ext3 or ReiseFS, you can even create many small files with pretty good performance.
You also need to take into account how you can access the data. If you have everything in the DB, you have one access path, need not worry about another set of permissions, but you have to deal with the extra complexity of reading/writing BLOBs. In many DBs, BLOBs can't be searched.
On the filesystem, you can run other tools on your data which isn't possible if the files are stored in a DB.
I would store them in the database:
Backup/restore is easy (if you backup files and also the database, point-in-time recovery is more complicated)
Transactions in the db mean you should never end up pointing at a file-name that is not there
Less chance someone is going to figure out a sneaky way of putting a script onto your server via a dodgy image upload hack
Since you are talking about a small number of images, ease of use/administration should take preference over performance issues which are debated in the linked questions.
I think there is a managability advantage storing them in the database; they can be backed up and restored consistently with the other data - you won't forget to delete obsolete ones (well, you might, but it's a bit less likely), and if you migrate the database to another machine, the images go with it.

How do you handle small sets of data?

With really small sets of data, the policy where I work is generally to stick them into text files, but in my experience this can be a development headache. Data generally comes from the database and when it doesn't, the process involved in setting it/storing it is generally hidden in the code. With the database you can generally see all the data available to you and the ways with which it relates to other data.
Sometimes for really small sets of data I just store them in an internal data structure in the code (like A Perl hash) but then when a change is needed, it's in the hands of a developer.
So how do you handle small sets of infrequently changed data? Do you have set criteria of when to use a database table or a text file or..?
I'm tempted to just use a database table for absolutely everything but I'm not sure if there are any implications to this.
Edit: For context:
I've been asked to put a new contact form on the website for a handful of companies, with more to be added occasionally in the future. Except, companies don't have contact email addresses.. the users inside these companies do (as they post jobs through their own accounts). Now though, we want a "speculative application" type functionality and the form needs an email address to send these applications to. But we also don't want to put an email address as a property in the form or else spammers can just use it as an open email gateway. So clearly, we need an ID -> contact_email type relationship with companies.
SO, I can either add a column to a table with millions of rows which will be used, literally, about 20 times OR create a new table that at most is going to hold about 20 rows. Typically how we handle this in the past is just to create a nasty text file and read it from there. But this creates maintenance nightmares and these text files are frequently looked over when data that they depend on changes. Perhaps this is a fault with the process, but I'm just interested in hearing views on this.
Put it in the database. If it changes infrequently, cache it in your middle tier.
The example that springs to mind immediately is what is appropriate to have stored as an enumeration and what is appropriate to have stored in a "lookup" database table.
I tend to "draw the line" with the rule that if it will result in a column in the database containing a "magic number" that maps to an enumeration value, then the enumeration should really exist as a lookup table. If it's unrelated to the data stored in the database (eg. Application configuration data rather than user generated data), then it's an enumeration all the way.
Surely it depends on the user of the software tool you've developed to consume the set of data, regardless of size?
It might just be that they know Excel, so your tool would have to parse a .csv file that they create.
If it's written for the developers, then who cares what you use. I'm not a fan of cluttering databases with minor or transient data however.
We have a standard config file format (key:value) and a class to handle it. We just use that on all projects. Mostly we're just setting persistent properties for our applications (mobile phone development) so that's an appropriate thing to do. YMMV
In cases where the program accesses a database, I'll store everything in there: easier for backup and moving data around.
For small programs without database access I store my data in the .net settings, which are stored in an xml file - of course this is a feature of c#, so it might not apply to you.
Anyway, I make sure to store all data in one place. Usually a database.
Have you considered sqlite ? It's file-based, which addresses your feeling that "just a file might do" (zero configuration), but it's a perfectly good database and scales remarkably well. It supports a number of APIs and there are numerous front ends for administering it.
If these are small config-like data, i use some simple and common format. ini, json and yaml are usually ok. Java and .NET fans also like XML. in short, use something that you can easily read to an in-memory object and forget about it.
I would add it to the database in the main table:
Backup and recovery (you do want to recover this text file, right?)
Adhoc querying (since you can do it will a SQL tool and join it to the other database data)
If the database column is empty the store requirements for it should be minimal (nothing if it's a NULL column at the end of the table in Oracle)
It will be easier if you want to have multiple application servers as you will not need to keep multiple copies of some extra config file around
Putting it into a little child table only complicates the design without giving any real benefits
You may well already be going to that same row in the database as part of your processing anyway, so performance is not likely to be a problem. If you are not, you could cache it in memory.

Storing a file in a database as opposed to the file system?

Generally, how bad of a performance hit is storing a file in a database (specifically mssql) as opposed to the file system? I can't come up with a reason outside of application portability that I would want to store my files as varbinaries in SQL Server.
Have a look at this answer:
Storing Images in DB - Yea or Nay?
Essentially, the space and performance hit can be quite big, depending on the number of users. Also, keep in mind that Web servers are cheap and you can easily add more to balance the load, whereas the database is the most expensive and hardest to scale part of a web architecture usually.
There are some opposite examples (e.g., Microsoft Sharepoint), but usually, storing files in the database is not a good idea.
Unless possibly you write desktop apps and/or know roughly how many users you will ever have, but on something as random and unexpectable like a public web site, you may pay a high price for storing files in the database.
If you can move to SQL Server 2008, you can take advantage of the FILESTREAM support which gives you the best of both - the files are stored in the filesystem, but the database integration is much better than just storing a filepath in a varchar field. Your query can return a standard .NET file stream, which makes the integration a lot simpler.
Getting Started with FILESTREAM Storage
I'd say, it depends on your situation. For example, I work in local government, and we have lots of images like mugshots, etc. We don't have a high number of users, but we need to have good security and auditing around the data. The database is a better solution for us since it makes this easier and we aren't going to run into scaling problems.
What's the question here?
Modern DBMS SQL2008 have a variety of ways of dealing with BLOBs which aren't just sticking in them in a table. There are pros and cons, of course, and you might need to think about it a little deeper.
This is an interesting paper, by the late (?) Jim Gray
To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem
In my own experience, it is always better to store files as files. The reason is that the filesystem is optimised for file storeage, whereas a database is not. Of course, there are some exceptions (e.g. the much heralded next-gen MS filesystem is supposed to be built on top of SQL server), but in general that's my rule.
While performance is an issue, I think modern database designs have made it much less of an issue for small files.
Performance aside, it also depends on just how tightly-coupled the data is. If the file contains data that is closely related to the fields of the database, then it conceptually belongs close to it and may be stored in a blob. If it contains information which could potentially relate to multiple records or may have some use outside of the context of the database, then it belongs outside. For example, an image on a web page is fetched on a separate request from the page that links to it, so it may belong outside (depending on the specific design and security considerations).
Our compromise, and I don't promise it's the best, has been to store smallish XML files in the database but images and other files outside it.
We made the decision to store as varbinary for http://www.freshlogicstudios.com/Products/Folders/ halfway expecting performance issues. I can say that we've been pleasantly surprised at how well it's worked out.
I agree with #ZombieSheep.
Just one more thing - I generally don't think that databases actually need be portable because you miss all the features your DBMS vendor provides. I think that migrating to another database would be the last thing one would consider. Just my $.02
The overhead of having to parse a blob (image) into a byte array and then write it to disk in the proper file name and then reading it is enough of an overhead hit to discourage you from doing this too often, especially if the files are rather large.
Not to be vague or anything but I think the type of 'file' you will be storing is one of the biggest determining factors. If you essentially talking about a large text field which could be stored as file my preference would be for db storage.
Interesting topic.
There is no absolutely one correct answer to this question.
There are few key elements to consider:
What’s your database engine?
What’s the route of file from database to end user and/or backwards?
What are the security requirements?
If files are meant for public audience and accessible via website, you shouldn’t even consider storing files in database. Use some smart indexing for files instead.
If files are containing highly sensitive information, then it might be worth of storing these into database. But you have to implement proper safe gateways too.
If performance is crucial, it’s better do not store files in database.
Backup and restoring and migrating of database might become a nightmare if database grows big just because of files. If you are DBA, then you would like to kill the person who “invented” an idea to put files into database.
I recommend to use storing files into database at last option, when there is absolutely no any better alternative available.

Resources