Store / read binary files: which is the most performant approach? - database

I know SQL or NoSQL questions have been asked and discussed over and over again.
I have a very specific case on which I want to decide upon one or the other way.
I need to store documents and retrieve them in an efficient way. The structure / flexibility of the structure is not an issue here - only the storage and more importantly, the reading of the binary file itself.
Is one or the other method a preferred choice for the specific use-case and, if so, could someone explain why?
UPDATE
I realize the question is improperly formulated. The question is not related to structure, so SQL vs NoSQL isn't the right topic here. Apologies.
Say I want to create a database as a file storage. As said, storing absolute file paths isn't an option here. The question is whether there is a known DB server optimized for that.

SQL and NoSQL differ in how they structure and query the data.
However, binary files to not have any structure, and are written and queried as a single blob.
Therefore, the differences between SQL and NoSQL do not matter.
In other words, there might be big differences how specific databases handle blobs, but those differences would not be related to whether a DB uses SQL or not.
(But it should be noted that there are certain NoSQL databases that are particularly optimized for storing binary files; they are called "file systems".)

Related

What is the most effective method for handling large scale dynamic data for recommendation system?

We re thinking on a recommendation system based on large scale data but also looking for a professional way to keeping a dynamic DB structure for working in faster manner. We consider some of the alternative approaches. One is to keep in a normal SQL database but it would be slower compared to using normal file structure. Second is to use nosql graph model DB but it is also not compatible with the algorithms we use since we continuously pull al the data into a matrix. Final approach we think is to use normal files to keep the data but it is harder to keep track and watch the changes since no query method or the editor. Hence there are different methods and the pros and cons. What ll be the your choice and why?
I'm not sure why you mention "files" and "file structure" so many times, so maybe I'm missing something, but for efficient data processing, you obviously don't want to store things in files. It is expensive to read/write data to disk and it's hard to find something to query files in a file system that is efficient and flexible.
I suppose I'd start with a product that already does recommendations:
http://mahout.apache.org/
You can pick from various algorithms to run on your data for producing recommendations.
If you want to do it yourself, maybe a hybrid approach would work? You could still use a graph database to represent relationships, but then each node/vertex could be a pointer to a document database or a relational database where a more "full" representation of the data would exist.

Flexible way/database to keep a very large amount of data

I have a very large SQLite database (~10MB). Currently I'm using this database with a custom PHP interface at localhost, but I'm looking for a different storage and a different interface as well. The data inside does not change too often, so reading data has the priority.
I'm not very happy with the current setup, because the database schema is "custom" and I don't want to need to change it in the future (because maybe I will find a "better way"). I think it's not standard and that this may bother me in the future.
My question is: what is a good way to store data that is very flexible? I'm looking for a standard data storage that I will to fill just one time. Is SQLite still the best choice?
What is inside this database:
I'm going to describe in general how's the structure.
Table A is full of questions;
Table B is full of answers. Each question has different answers;
Tables from C are full of data regarding answer X of question Y. Each answer has a lot of different metadata available.
What is extremely important:
fast queries;
portability through *NIX without having to install a lot of software;
not too complex queries, bit still I will need some joins;
don't have to change the schema every two months.
What is not important:
frequent updates (I update the database once a year, more or less).
I thought of going with files and folders, but I'm not sure it's the best thing.
Your problem can be well modelled as a Graph Problem. What about a Graph Based Database like Orient DB or Neo4j? In a portion of Graph there can be 2 nodes: One for the question say A, one for the answer say B and the edge connecting them can have your table c data or metadata. Each answer node can have multiple incoming edges from several question nodes and there can be multiple answers to a single question. You dont require any schema, the databases I have mentioned are NoSQL databases. Search would be extremely fast as even they can be integrated with Lucene Search Engine for full text searches and I believe since you have Q&A so you would require Full Text Searches.
Here is an example:

Database design to manage data from Serial Port?

I'm writing codes to capture Serial Port data (e.g. micrometer) and do the following
real time display and graphing
delete/modify/replace existing data by focus and re-measurement,
save data somewhere for additional statistical analysis (e.g. in Excel). So out put as .csv is also an option
Because each measurement may capture hundreds to thousands of data (measurement points), I'm not sure how shall go about to design my database - shall I create a new row for each data received, or shall I push all data into an array and store as a super long string separated by a comma into the database? For such application, do I need Server 2008 or would Server 2008 Express will be sufficient. What are their pros/cons in terms of performance?
Is it possible to create such as application where client won't need to install sql server?
To answer your last question first, look at SQLite because it is free open source, and I'm pretty sure their license is null. Plus nothing to install for your users. SQLite can be compiled with your code.
To answer your primary question, I would encourage using a database only if it really makes sense. Are you going to be running queries against the data, or are you just looking to store and retrieve entries by some sort of identifier? I would discourage storing it as a super long comma separated string, but instead look into the BLOB type. With BLOBs, you can put datatypes into the database so you can easily get them back out, plus I believe it is much more efficient that way. I would only suggest using TEXT type if you need to do some sort of text query on it. For example fulltext searches
If you decide that the giant string approach makes sense for your application, you would probably just be better off using text files. Relational databases provide benefit only when your data is structured. If you go with a database it should be because you are storing discreet values in rows and columns.
An out of process database strikes me as a mismatch for this project. Have you considered SQLite or some other compile-or-link in solution? You'll probably get faster throughput of those large measurement vectors with something that runs in process.
What do you want (or what do you think you want) a relational database for?
Based on the simple requirements you expressed here (and assuming there are none other), I would suggest flat files (this could be either plain text, one sample per line, or binary). But I really don't think I have enough information to make a decision on your behalf. Some key questions are:
Do you need to share the data with others, who uses the data, where, over the network? (this will dictate the format you save the data in and a few other things)
How much data do you need to store, how fast does it get measured and over what period of time to measure and store this data? (this will tell you whether to use compression, or some other space saving scheme or not, this will also give you pointers to resource requirements)
How easily retrievable does it need to be? (this will give pointers as to what data management you need: how to name files, where to store them...)
What sort of analysis do you need to do on it? (do you want to analyse several files together, just one file, data from a given day, data for a given port... ?)
... and many more such questions.
Depending on answers to these questions you might be happy with an old 386 computer, or you might need a modern 8-core.

database vs. flat files

The company I work for is trying to switch a product that uses flat file format to a database format. We're handling pretty big files of data (ie: 25GB/file) and they get updated really quick. We need to run queries that randomly access the data, as well as in a contiguous way. I am trying to convince them of the advantages of using a database, but some of my colleagues seem reluctant to this. So I was wondering if you guys can help me out here with some reasons or links to posts of why we should use databases, or at least clarify why flat files are better (if they are).
Databases can handle querying
tasks, so you don't have to walk
over files manually. Databases can
handle very complicated queries.
Databases can handle indexing tasks,
so if tasks like get record with id
= x can be VERY fast
Databases can handle multiprocess/multithreaded access.
Databases can handle access from
network
Databases can watch for data
integrity
Databases can update data easily
(see 1) )
Databases are reliable
Databases can handle transactions
and concurrent access
Databases + ORMs let you manipulate
data in very programmer friendly way.
This is an answer I've already given some time ago:
It depends entirely on the
domain-specific application needs. A
lot of times direct text file/binary
files access can be extremely fast,
efficient, as well as providing you
all the file access capabilities of
your OS's file system.
Furthermore, your programming language
most likely already has a built-in
module (or is easy to make one) for
specific parsing.
If what you need is many appends
(INSERTS?) and sequential/few access
little/no concurrency, files are the
way to go.
On the other hand, when your
requirements for concurrency,
non-sequential reading/writing,
atomicity, atomic permissions, your
data is relational by the nature etc.,
you will be better off with a
relational or OO database.
There is a lot that can be
accomplished with SQLite3, which
is extremely light (under 300kb), ACID
compliant, written in C/C++, and
highly ubiquitous (if it isn't already
included in your programming language
-for example Python-, there is surely one available). It can be useful even
on db files as big as 140 terabytes, or 128 tebibytes (Link to Database Size), possible
more.
If your requirements where bigger,
there wouldn't even be a discussion,
go for a full-blown RDBMS.
As you say in a comment that "the system" is merely a bunch of scripts, then you should take a look at pgbash.
Don't build it if you can buy it.
I heard this quote recently, and it really seems fitting as a guide line. Ask yourself this... How much time was spent working on the file handling portion of your app? I suspect a fair amount of time was spent optimizing this code for performance. If you had been using a relational database all along, you would have spent considerably less time handling this portion of your application. You would have had more time for the true "business" aspect of your app.
They're faster; unless you're loading the entire flat file into memory, a database will allow faster access in almost all cases.
They're safer; databases are easier to safely backup; they have mechanisms to check for file corruption, which flat files do not. Once corruption in your flat file migrates to your backups, you're done, and you might not even know it yet.
They have more features; databases can allow many users to read/write at the same time.
They're much less complex to work with, once they're setup.
What types of files is not mentioned. If they're media files, go ahead with flat files. You probably just need a DB for tags and some way to associate the "external BLOBs" to the records in the DB. But if full text search is something you need, there's no other way to go but migrate to a full DB.
Another thing, your filesystem might provide the ceiling as far as number of physical files are concerned.
Databases all the way.
However, if you still have a need for storing files, don't have the capacity to take on a new RDBMS (like Oracle, SQLServer, etc), than look into XML.
XML is a structure file format which offers you the ability to store things as a file but give you query power over the file and data within it. XML Files are easier to read than flat files and can be easily transformed applying an XSLT for even better human-readability. XML is also a great way to transport data around if you must.
I strongly suggest a DB, but if you can't go that route, XML is an ok second.
What about a non-relational (NoSQL) database such as Amazon's SimpleDB, Tokio Cabinet, etc? I've heard that Google, Facebook, LinkedIn are using these to store their huge datasets.
Can you tell us if your data is structured, if your schema is fixed, if you need easy replicability, if access times are important, etc?
Difference between database and flat files are given below:
Database provide more flexibility whereas flat file provide less flexibility.
Database system provide data consistency whereas flat file can not provide data consistency.
Database is more secure over flat files.
Database support DML and DDL whereas flat files can not support these.
Less data redundancy in database whereas more data redundancy in flat files.
SQL ad hoc query abilities are enough of a reason for me. With a good schema and indexing on the tables, this is fast and effective and will have good performance.
Unless you are loading the files into memory each time you boot, use a database. Simple as that.
That is assuming that your colleges already have the program to handle queries to the files. If not, then use a database.
Although other answers are good, I would like to emphasize a point that was not really well talked about:
The developer's ease of use. databases are much simpler to work with! If you don't have any strong reason(s) for using files, use a database.

Storing a file in a database as opposed to the file system?

Generally, how bad of a performance hit is storing a file in a database (specifically mssql) as opposed to the file system? I can't come up with a reason outside of application portability that I would want to store my files as varbinaries in SQL Server.
Have a look at this answer:
Storing Images in DB - Yea or Nay?
Essentially, the space and performance hit can be quite big, depending on the number of users. Also, keep in mind that Web servers are cheap and you can easily add more to balance the load, whereas the database is the most expensive and hardest to scale part of a web architecture usually.
There are some opposite examples (e.g., Microsoft Sharepoint), but usually, storing files in the database is not a good idea.
Unless possibly you write desktop apps and/or know roughly how many users you will ever have, but on something as random and unexpectable like a public web site, you may pay a high price for storing files in the database.
If you can move to SQL Server 2008, you can take advantage of the FILESTREAM support which gives you the best of both - the files are stored in the filesystem, but the database integration is much better than just storing a filepath in a varchar field. Your query can return a standard .NET file stream, which makes the integration a lot simpler.
Getting Started with FILESTREAM Storage
I'd say, it depends on your situation. For example, I work in local government, and we have lots of images like mugshots, etc. We don't have a high number of users, but we need to have good security and auditing around the data. The database is a better solution for us since it makes this easier and we aren't going to run into scaling problems.
What's the question here?
Modern DBMS SQL2008 have a variety of ways of dealing with BLOBs which aren't just sticking in them in a table. There are pros and cons, of course, and you might need to think about it a little deeper.
This is an interesting paper, by the late (?) Jim Gray
To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem
In my own experience, it is always better to store files as files. The reason is that the filesystem is optimised for file storeage, whereas a database is not. Of course, there are some exceptions (e.g. the much heralded next-gen MS filesystem is supposed to be built on top of SQL server), but in general that's my rule.
While performance is an issue, I think modern database designs have made it much less of an issue for small files.
Performance aside, it also depends on just how tightly-coupled the data is. If the file contains data that is closely related to the fields of the database, then it conceptually belongs close to it and may be stored in a blob. If it contains information which could potentially relate to multiple records or may have some use outside of the context of the database, then it belongs outside. For example, an image on a web page is fetched on a separate request from the page that links to it, so it may belong outside (depending on the specific design and security considerations).
Our compromise, and I don't promise it's the best, has been to store smallish XML files in the database but images and other files outside it.
We made the decision to store as varbinary for http://www.freshlogicstudios.com/Products/Folders/ halfway expecting performance issues. I can say that we've been pleasantly surprised at how well it's worked out.
I agree with #ZombieSheep.
Just one more thing - I generally don't think that databases actually need be portable because you miss all the features your DBMS vendor provides. I think that migrating to another database would be the last thing one would consider. Just my $.02
The overhead of having to parse a blob (image) into a byte array and then write it to disk in the proper file name and then reading it is enough of an overhead hit to discourage you from doing this too often, especially if the files are rather large.
Not to be vague or anything but I think the type of 'file' you will be storing is one of the biggest determining factors. If you essentially talking about a large text field which could be stored as file my preference would be for db storage.
Interesting topic.
There is no absolutely one correct answer to this question.
There are few key elements to consider:
What’s your database engine?
What’s the route of file from database to end user and/or backwards?
What are the security requirements?
If files are meant for public audience and accessible via website, you shouldn’t even consider storing files in database. Use some smart indexing for files instead.
If files are containing highly sensitive information, then it might be worth of storing these into database. But you have to implement proper safe gateways too.
If performance is crucial, it’s better do not store files in database.
Backup and restoring and migrating of database might become a nightmare if database grows big just because of files. If you are DBA, then you would like to kill the person who “invented” an idea to put files into database.
I recommend to use storing files into database at last option, when there is absolutely no any better alternative available.

Resources