Which is faster , interacting with a database or using a file system for input output - database

I was wondering what threshold of data volume may determine whether to use a database or a simple file I/O, assuming that fresh data needs to be handled quite frequently.
Edit: There is no multi-threading in my application. Data needs to be stored and then retrieved sequentially and at this point I am not really worried about anyone else accessing the data/data safety.
Given this backdrop is there still any advantage to using databases over files?

It depends and you probably should consider other factors as well.
If you use a database, there is an overhead for transactions, security, index management etc. on the one hand. On the other hand you can get caching (which could significantly speed up your application) and better performance for random access, if you have a lot of data. In a multithreaded environment I suggest using a database because of a property implemented locking mechanism.
Flat files are OK for really simple and small data. Do you really need to open and close them so often?

If you have indexes on your table correctly then I think it would be better to use database instead of file system to get a better performance. Also to include that if your data in the database is going to be million of records then also the performance will not be affected when compared to file system with that much amount of data.

Probably a database is prefered and in this case id suggest to use sqlite database insted of sql server and mysql as data is small.

In this case I would say DB. You are writing and reading and thats what DBs are good at.
On the flip side if you are holding a tiny amount of data thats alot of over head for not much data
also depends on licensing etc. a file will be alot quicker

Related

large databases in sqlite - file size considerations?

I'm using a sqlite db which is very convenient and seems to meet all of my needs at this point.
Currently my db size is <50MB, but I now need to add a new table which will store large text blobs, which will cause the db to reach up to 5GB within the next year.
Would sqlite be able to deal with a 5GB db size? Any caveats to that, compared with say mysql?
I'm not a huge expert on databases, but most of the DB-related work I've done used SQLite. In my experience, making the database larger in-itself shouldn't incur a large performance hit. Naturally you'll have more data, so prepare to spend more time querying it!
Consider this thought experiment: you have a table named mydata you use all the time in the DB. Now, you add an unrelated table otherdata. Your queries for mydata don't depend on the information in otherdata. Even if you shove GBs of data into otherdata, you won't feel any real performance hit in your usage of mydata.
AFAIK, the architecture of SQLite supports this claim.
SQLite should be just fine for what you want to do. Size really isn't a concern. As long as your data file can reside on the same computer that's making the call, you should be just fine. If you put it on the network, that's ok, but multi-user access is subject to the bugs of the operating system when it comes to locking records, etc. Per comparing with mysql, since you've eliminated the server, you've also eliminated the network traffic associated with the data retrieval. this should speed things up.
-don
As stated in Sqlite FAQS , FAQ
look at point 12 , it says max limit of sqlite db can be upto 140 TB!!
I find using indexes will save your time a lot, you can have a try!

Why would someone need an in-memory database?

I read that a few databases can be used in-memory but can't think of reason why someone would want to use this feature. I always use a database to persist data and memory caches for fast access.
Cache is also a kind of database, like a file system is. 'Memory cache' is just a specific application of an in-memory database and some in-memory databases are specialized as memory caches.
Other uses of in-memory databases have already been included in other answers, but let me enumerate the uses too:
Memory cache. Usually a database system specialized for that use (and probably known as 'a memory cache' rather than 'a database') will be used.
Testing database-related code. In this case often an 'in-memory' mode of some generic database system will be used, but also a dedicated 'in-memory' database may be used to replace other 'on-disk' database for faster testing.
Sophisticated data manipulation. In-memory SQL databases are often used this way. SQL is a great tool for data manipulation and sometimes there is no need to write the data on disk while computing the final result.
Storing of transient runtime state. There are application that need to store their state in some kind of database but do not need to persist that over application restart. Think of some kind of process manager – it needs to keep track of sub-processes running, but that data is only valid as long as the application and the sub-processes run.
A common use case is to run unit/integration tests.
You don't really care about persisting data between each test run and you want tests to run as quickly as possible (to encourage people to do them often). Hosting a database in process gives you very quick access to the data.
Does your memory cache have SQL support?
How about you consider the in-memory database as a really clever cache?
That does leave questions of how the in-memory database gets populated and how updated are managed and consistency is preserved across multiple instances.
Searching for something among 100000 elements is slow if you don't use tricks like indexes. Those tricks are already implemented in a database engine (be it persistent or in-memory).
A in-memory database might offer a more efficient search feature than what you might be able to implement yourself quickly over self-written structures.
In-memory databases are roughly at least an order of magnitude faster than traditional RDBMS for general purpose (read side) queries. Most are disk backed providing the very same consistency as a normal RDBMS - only catch the entire dataset must fit into RAM.
The core idea is disk backed storage has huge random access penalties which does not apply to DRAM. Data can be index/organized in a random access optimized way not feasible using traditional RDBMS data caching schemes.
Applications, which require real time responses would like to use an in memory database, perhaps application to control aircraft, plants where the response time is critical
An in memory database is also useful in game programming. You can store data in an in memory database which is much faster than permanent databases.
They are used as an advanced data structure to store, query and modify runtime data.
You may need a database if several different applications are going to access the dataset. A database has a consistent interface for accessing / modifying data, which your hash table (or whatever else you use) won't have.
If a single program is dealing with the data, then it's reasonable to just use a data structure in whatever language you are using though.
In-memory database is better than performing database caching.
Database caching works similar to in-memory databases when it comes to READ operations.
On the other hand, when it comes to WRITE operations, in-memory databases are faster when compared to database caches, where the data is persisted onto disk (which leads to IO overhead).
Also, with database caching you can end with cache misses but you will never end up with cache misses when using in-memory databases.
Given their speed and the declining price of RAM, it’s likely that in-memory databases will become the dominant technology in the future. There are already some that have developed sophisticated features like SQL queries, secondary indexes, and engines for processing datasets larger than RAM.

File Read/Write vs Database Read/Write

Which is more expensive to do in terms of resources and efficiency, File read/write operation or Database Read/Write operation
I was initially going to say database read/write, hands down, as it would include the requisite file io on top of the DB overhead, but then realized its not that simple. If you have your entire DB loaded into memory, reads would be nearly instantaneous as there's no file IO involved.
Writes would, in general, be faster too, as the DB engine doesn't have to wait for the file IO to complete before returning since they can take a "lazy write" approach.
A poorly tuned database, on the other hand, will be orders of magnitude slower than any file based IO. DB tuning matters. A lot.
This is kind of a loaded question. What size files are we talking about? Gigabytes? Also, what type and size of DB? I often use a combination. Do you want to control any data level integrity? If so, you might want to leave that to the DB otherwise you have to control all that at the application level.
There are so many factors to make a good decision on this. For example, when I am creating temporary data that I don't want persisted I use File, but if I am using data I want persisted or backed up, then I use a DB.
This coupled with the architecture is important. If hardware, licensing, or facility is an issue then maybe you don't need the infrastructure of DB servers etc. But if you have the resources then adding a DB layer might be the right choice.
There's no simple answer. With any database you have the overhead of having it running all the time. But then when you access it is generally much faster than accessing a file. If you are talking about just a handful of accesses you won't notice much of difference. But when it gets to hundreds, thousands, and millions of accesses per minute the database will be much faster. And as Tim noted above, a poorly tuned database can be much slower than accessing a flat file.

database vs. flat files

The company I work for is trying to switch a product that uses flat file format to a database format. We're handling pretty big files of data (ie: 25GB/file) and they get updated really quick. We need to run queries that randomly access the data, as well as in a contiguous way. I am trying to convince them of the advantages of using a database, but some of my colleagues seem reluctant to this. So I was wondering if you guys can help me out here with some reasons or links to posts of why we should use databases, or at least clarify why flat files are better (if they are).
Databases can handle querying
tasks, so you don't have to walk
over files manually. Databases can
handle very complicated queries.
Databases can handle indexing tasks,
so if tasks like get record with id
= x can be VERY fast
Databases can handle multiprocess/multithreaded access.
Databases can handle access from
network
Databases can watch for data
integrity
Databases can update data easily
(see 1) )
Databases are reliable
Databases can handle transactions
and concurrent access
Databases + ORMs let you manipulate
data in very programmer friendly way.
This is an answer I've already given some time ago:
It depends entirely on the
domain-specific application needs. A
lot of times direct text file/binary
files access can be extremely fast,
efficient, as well as providing you
all the file access capabilities of
your OS's file system.
Furthermore, your programming language
most likely already has a built-in
module (or is easy to make one) for
specific parsing.
If what you need is many appends
(INSERTS?) and sequential/few access
little/no concurrency, files are the
way to go.
On the other hand, when your
requirements for concurrency,
non-sequential reading/writing,
atomicity, atomic permissions, your
data is relational by the nature etc.,
you will be better off with a
relational or OO database.
There is a lot that can be
accomplished with SQLite3, which
is extremely light (under 300kb), ACID
compliant, written in C/C++, and
highly ubiquitous (if it isn't already
included in your programming language
-for example Python-, there is surely one available). It can be useful even
on db files as big as 140 terabytes, or 128 tebibytes (Link to Database Size), possible
more.
If your requirements where bigger,
there wouldn't even be a discussion,
go for a full-blown RDBMS.
As you say in a comment that "the system" is merely a bunch of scripts, then you should take a look at pgbash.
Don't build it if you can buy it.
I heard this quote recently, and it really seems fitting as a guide line. Ask yourself this... How much time was spent working on the file handling portion of your app? I suspect a fair amount of time was spent optimizing this code for performance. If you had been using a relational database all along, you would have spent considerably less time handling this portion of your application. You would have had more time for the true "business" aspect of your app.
They're faster; unless you're loading the entire flat file into memory, a database will allow faster access in almost all cases.
They're safer; databases are easier to safely backup; they have mechanisms to check for file corruption, which flat files do not. Once corruption in your flat file migrates to your backups, you're done, and you might not even know it yet.
They have more features; databases can allow many users to read/write at the same time.
They're much less complex to work with, once they're setup.
What types of files is not mentioned. If they're media files, go ahead with flat files. You probably just need a DB for tags and some way to associate the "external BLOBs" to the records in the DB. But if full text search is something you need, there's no other way to go but migrate to a full DB.
Another thing, your filesystem might provide the ceiling as far as number of physical files are concerned.
Databases all the way.
However, if you still have a need for storing files, don't have the capacity to take on a new RDBMS (like Oracle, SQLServer, etc), than look into XML.
XML is a structure file format which offers you the ability to store things as a file but give you query power over the file and data within it. XML Files are easier to read than flat files and can be easily transformed applying an XSLT for even better human-readability. XML is also a great way to transport data around if you must.
I strongly suggest a DB, but if you can't go that route, XML is an ok second.
What about a non-relational (NoSQL) database such as Amazon's SimpleDB, Tokio Cabinet, etc? I've heard that Google, Facebook, LinkedIn are using these to store their huge datasets.
Can you tell us if your data is structured, if your schema is fixed, if you need easy replicability, if access times are important, etc?
Difference between database and flat files are given below:
Database provide more flexibility whereas flat file provide less flexibility.
Database system provide data consistency whereas flat file can not provide data consistency.
Database is more secure over flat files.
Database support DML and DDL whereas flat files can not support these.
Less data redundancy in database whereas more data redundancy in flat files.
SQL ad hoc query abilities are enough of a reason for me. With a good schema and indexing on the tables, this is fast and effective and will have good performance.
Unless you are loading the files into memory each time you boot, use a database. Simple as that.
That is assuming that your colleges already have the program to handle queries to the files. If not, then use a database.
Although other answers are good, I would like to emphasize a point that was not really well talked about:
The developer's ease of use. databases are much simpler to work with! If you don't have any strong reason(s) for using files, use a database.

Storing a file in a database as opposed to the file system?

Generally, how bad of a performance hit is storing a file in a database (specifically mssql) as opposed to the file system? I can't come up with a reason outside of application portability that I would want to store my files as varbinaries in SQL Server.
Have a look at this answer:
Storing Images in DB - Yea or Nay?
Essentially, the space and performance hit can be quite big, depending on the number of users. Also, keep in mind that Web servers are cheap and you can easily add more to balance the load, whereas the database is the most expensive and hardest to scale part of a web architecture usually.
There are some opposite examples (e.g., Microsoft Sharepoint), but usually, storing files in the database is not a good idea.
Unless possibly you write desktop apps and/or know roughly how many users you will ever have, but on something as random and unexpectable like a public web site, you may pay a high price for storing files in the database.
If you can move to SQL Server 2008, you can take advantage of the FILESTREAM support which gives you the best of both - the files are stored in the filesystem, but the database integration is much better than just storing a filepath in a varchar field. Your query can return a standard .NET file stream, which makes the integration a lot simpler.
Getting Started with FILESTREAM Storage
I'd say, it depends on your situation. For example, I work in local government, and we have lots of images like mugshots, etc. We don't have a high number of users, but we need to have good security and auditing around the data. The database is a better solution for us since it makes this easier and we aren't going to run into scaling problems.
What's the question here?
Modern DBMS SQL2008 have a variety of ways of dealing with BLOBs which aren't just sticking in them in a table. There are pros and cons, of course, and you might need to think about it a little deeper.
This is an interesting paper, by the late (?) Jim Gray
To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem
In my own experience, it is always better to store files as files. The reason is that the filesystem is optimised for file storeage, whereas a database is not. Of course, there are some exceptions (e.g. the much heralded next-gen MS filesystem is supposed to be built on top of SQL server), but in general that's my rule.
While performance is an issue, I think modern database designs have made it much less of an issue for small files.
Performance aside, it also depends on just how tightly-coupled the data is. If the file contains data that is closely related to the fields of the database, then it conceptually belongs close to it and may be stored in a blob. If it contains information which could potentially relate to multiple records or may have some use outside of the context of the database, then it belongs outside. For example, an image on a web page is fetched on a separate request from the page that links to it, so it may belong outside (depending on the specific design and security considerations).
Our compromise, and I don't promise it's the best, has been to store smallish XML files in the database but images and other files outside it.
We made the decision to store as varbinary for http://www.freshlogicstudios.com/Products/Folders/ halfway expecting performance issues. I can say that we've been pleasantly surprised at how well it's worked out.
I agree with #ZombieSheep.
Just one more thing - I generally don't think that databases actually need be portable because you miss all the features your DBMS vendor provides. I think that migrating to another database would be the last thing one would consider. Just my $.02
The overhead of having to parse a blob (image) into a byte array and then write it to disk in the proper file name and then reading it is enough of an overhead hit to discourage you from doing this too often, especially if the files are rather large.
Not to be vague or anything but I think the type of 'file' you will be storing is one of the biggest determining factors. If you essentially talking about a large text field which could be stored as file my preference would be for db storage.
Interesting topic.
There is no absolutely one correct answer to this question.
There are few key elements to consider:
What’s your database engine?
What’s the route of file from database to end user and/or backwards?
What are the security requirements?
If files are meant for public audience and accessible via website, you shouldn’t even consider storing files in database. Use some smart indexing for files instead.
If files are containing highly sensitive information, then it might be worth of storing these into database. But you have to implement proper safe gateways too.
If performance is crucial, it’s better do not store files in database.
Backup and restoring and migrating of database might become a nightmare if database grows big just because of files. If you are DBA, then you would like to kill the person who “invented” an idea to put files into database.
I recommend to use storing files into database at last option, when there is absolutely no any better alternative available.

Resources