File Read/Write vs Database Read/Write - database

Which is more expensive to do in terms of resources and efficiency, File read/write operation or Database Read/Write operation

I was initially going to say database read/write, hands down, as it would include the requisite file io on top of the DB overhead, but then realized its not that simple. If you have your entire DB loaded into memory, reads would be nearly instantaneous as there's no file IO involved.
Writes would, in general, be faster too, as the DB engine doesn't have to wait for the file IO to complete before returning since they can take a "lazy write" approach.
A poorly tuned database, on the other hand, will be orders of magnitude slower than any file based IO. DB tuning matters. A lot.

This is kind of a loaded question. What size files are we talking about? Gigabytes? Also, what type and size of DB? I often use a combination. Do you want to control any data level integrity? If so, you might want to leave that to the DB otherwise you have to control all that at the application level.
There are so many factors to make a good decision on this. For example, when I am creating temporary data that I don't want persisted I use File, but if I am using data I want persisted or backed up, then I use a DB.
This coupled with the architecture is important. If hardware, licensing, or facility is an issue then maybe you don't need the infrastructure of DB servers etc. But if you have the resources then adding a DB layer might be the right choice.

There's no simple answer. With any database you have the overhead of having it running all the time. But then when you access it is generally much faster than accessing a file. If you are talking about just a handful of accesses you won't notice much of difference. But when it gets to hundreds, thousands, and millions of accesses per minute the database will be much faster. And as Tim noted above, a poorly tuned database can be much slower than accessing a flat file.

Related

Write performance between Filesystem and Database

I have a very simple program for data acquisition. The data comes frequently (around 5200 Hz). One piece of data has around 24 kB, so it is around 122 MB/s.
What would be more efficient only for storing this data? Saving it in raw binary files, or use the database? If the database, then which? SQLite, or maybe some other?
The database, of course, is more tempting, because when saving it to file I would have to separate them by delimiters (data can have different sizes), also processing data would be much easier with the database. I'm not sure about database performance compared to files though, I couldn't find any specific pieces of information about it.
[EDIT]
I am using Linux based OS and SSD disk which supports writing up to 350 MB/s. Data will be acquired with that frequency all the time (with a small service break every day to transfer the data to another machine)
The file system is useful if you are looking for a particular file, as operating systems maintain a sort of index. However, the contents of a txt file won't be indexed, which is one of the main advantages of a database.
Another point is understanding the relational model meaning how you design your database, so that data doesn't need to be repeated over and over.
Moreover understanding types is inportant as well. If you have a txt file, you'll need to parse numbers, dates, etc.
For the performance point of view I would say that DB are slower to start (is usually faster to open a file than open a connection to a db). However once they are open I can guarantee that DB is faster then XML or whatever file you are thinking to use. BTW this is the main purpose of a database: manage huge amount of data, filesystems are made for storing files.
Last points for DB is that they usually can handle multi-threading and concurrency problems, which a file cannot and last but not least important in a database you cannot delete a file by mistake and loose your data
So my choice would be a DB and anway I hope that providing you some info you can decide what is best for you
-- UPDATE --
Since you your needs are more specific now I tried to dig deeper: I found some solutions that could be interesting for you however I don't have experience in any of them to provide you a personal suggestion about them:
SharedHashFile: SharedHashFile is a lightweight NoSQL key value store / hash table, a zero-copy IPC queue, & a multiplexed IPC logging library written in C for Linux. There is no server process. Data is read and written directly from/to shared memory or SSD; no sockets are used between SharedHashFile and the application program. APIs for C, C++, & nodejs. However keep an eye out for issues because this project seems to be no longer maintained on Github
WhiteDB another NoSql database that claims to be really fast, go to the speed section of their website to consult it
Symas an extraordinarily fast, memory-efficient database
Just take a look at them and if you ever use them just provide here a feedback for the community

Database vs File system storage

Database ultimately stores the data in files, whereas File system also stores the data in files. In this case what is the difference between DB and File System. Is it in the way it is retrieved or anything else?
A database is generally used for storing related, structured data, with well defined data formats, in an efficient manner for insert, update and/or retrieval (depending on application).
On the other hand, a file system is a more unstructured data store for storing arbitrary, probably unrelated data. The file system is more general, and databases are built on top of the general data storage services provided by file systems. [Quora]
The file system is useful if you are looking for a particular file, as operating systems maintain a sort of index. However, the contents of a txt file won't be indexed, which is one of the main advantages of a database.
For very complex operations, the filesystem is likely to be very slow.
Main RDBMS advantages:
Tables are related to each other
SQL query/data processing language
Transaction processing addition to SQL (Transact-SQL)
Server-client implementation with server-side objects like stored procedures, functions, triggers, views, etc.
Advantage of the File System over Data base Management System is:
When handling small data sets with arbitrary, probably unrelated data, file is more efficient than database.
For simple operations, read, write, file operations are faster and simple.
You can find n number of difference over internet.
"They're the same"
Yes, storing data is just storing data. At the end of the day, you have files. You can store lots of stuff in lots of files & folders, there are situations where this will be the way. There is a well-known versioning solution (svn) that finally ended up using a filesystem-based model to store data, ditching their BerkeleyDB. Rare but happens. More info.
"They're quite different"
In a database, you have options you don't have with files. Imagine a textfile (something like tsv/csv) with 99999 rows. Now try to:
Insert a column. It's painful, you have to alter each row and read+write the whole file.
Find a row. You either scan the whole file or build an index yourself.
Delete a row. Find row, then read+write everything after it.
Reorder columns. Again, full read+write.
Sort rows. Full read, some kind of sort - then do it next time all over.
There are lots of other good points but these are the first mountains you're trying to climb when you think of a file based db alternative. Those guys programmed all this for you, it's yours to use; think of the likely (most frequent) scenarios, enumerate all possible actions you want to perform on your data, and decide which one works better for you. Think in benefits, not fashion.
Again, if you're storing JPG pictures and only ever look for them by one key (their id maybe?), a well-thought filesystem storage is better. Filesystems, btw, are close to databases today, as many of them use a balanced tree approach, so on a BTRFS you can just put all your pictures in one folder - and the OS will silently implement something like an early SQL query each time you access your files.
So, database or files?...
Let's see a few typical examples when one is better than the other. (These are no complete lists, surely you can stuff in a lot more on both sides.)
DB tables are much better when:
You want to store many rows with the exact same structure (no block waste)
You need lightning-fast lookup / sorting by more than one value (indexed tables)
You need atomic transactions (data safety)
Your users will read/write the same data all the time (better locking)
Filesystem is way better if:
You like to use version control on your data (a nightmare with dbs)
You have big chunks of data that grow frequently (typically, logfiles)
You want other apps to access your data without API (like text editors)
You want to store lots of binary content (pictures or mp3s)
TL;DR
Programming rarely says "never" or "always". Those who say "database always wins" or "files always win" probably just don't know enough. Think of the possible actions (now + future), consider both ways, and choose the fastest / most efficient for the case. That's it.
Something one should be aware of is that Unix has what is called an inode limit. If you are storing millions of records then this can be a serious problem. You should run df -i to view the % used as effectively this is a filesystem file limit - EVEN IF you have plenty of disk space.
The difference between file processing system and database management system is as follow:
A file processing system is a collection of programs that store and manage files in computer hard-disk. On the other hand, A database management system is collection of programs that enables to create and maintain a database.
File processing system has more data redundancy, less data redundancy in dbms.
File processing system provides less flexibility in accessing data, whereas dbms has more flexibility in accessing data.
File processing system does not provide data consistency, whereas dbms provides data consistency through normalization.
File processing system is less complex, whereas dbms is more complex.
Context: I've written a filesystem that has been running in production for 7 years now. [1]
The key difference between a filesystem and a database is that the filesystem API is part of the OS, thus filesystem implementations have to implement that API and thus follow certain rules, whereas databases are built by 3rd parties having complete freedom.
Historically, databases where created when the filesystem provided by the OS were not good enough for the problem at hand. Just think about it: if you had special requirements, you couldn't just call Microsoft or Apple to redesign their filesystem API. You would either go ahead and write your own storage software or you would look around for existing alternatives. So the need created a market for 3rd party data storage software which ended up being called databases. That's about it.
While it may seem that filesystems have certain rules like having files and directories, this is not true. The biggest operating systems work like that but there are many mall small OSs that work differently. It's certainly not a hard requirement. (Just remember, to build a new filesystem, you also need to write a new OS, which will make adoption quite a bit harder. Why not focus on just the storage engine and call it a database instead?)
In the end, both databases and filesystems come in all shapes and sizes. Transactional, relational, hierarchical, graph, tabled; whatever you can think of.
[1] I've worked on the Boomla Filesystem which is the storage system behind the Boomla OS & Web Application Platform.
The main differences between the Database and File System storage is:
The database is a software application used to insert, update and delete
data while the file system is a software used to add, update and delete
files.
Saving the files and retrieving is simpler in file system
while SQL needs to be learn to perform any query on the database to
get (SELECT), add (INSERT) and update the data.
Database provides a proper data recovery process while file system did not.
In terms of security the database is more secure then the file system (usually).
The migration process is very easy in File system just copy and paste into the target
while for database this task is not as simple.

Which is faster , interacting with a database or using a file system for input output

I was wondering what threshold of data volume may determine whether to use a database or a simple file I/O, assuming that fresh data needs to be handled quite frequently.
Edit: There is no multi-threading in my application. Data needs to be stored and then retrieved sequentially and at this point I am not really worried about anyone else accessing the data/data safety.
Given this backdrop is there still any advantage to using databases over files?
It depends and you probably should consider other factors as well.
If you use a database, there is an overhead for transactions, security, index management etc. on the one hand. On the other hand you can get caching (which could significantly speed up your application) and better performance for random access, if you have a lot of data. In a multithreaded environment I suggest using a database because of a property implemented locking mechanism.
Flat files are OK for really simple and small data. Do you really need to open and close them so often?
If you have indexes on your table correctly then I think it would be better to use database instead of file system to get a better performance. Also to include that if your data in the database is going to be million of records then also the performance will not be affected when compared to file system with that much amount of data.
Probably a database is prefered and in this case id suggest to use sqlite database insted of sql server and mysql as data is small.
In this case I would say DB. You are writing and reading and thats what DBs are good at.
On the flip side if you are holding a tiny amount of data thats alot of over head for not much data
also depends on licensing etc. a file will be alot quicker

large databases in sqlite - file size considerations?

I'm using a sqlite db which is very convenient and seems to meet all of my needs at this point.
Currently my db size is <50MB, but I now need to add a new table which will store large text blobs, which will cause the db to reach up to 5GB within the next year.
Would sqlite be able to deal with a 5GB db size? Any caveats to that, compared with say mysql?
I'm not a huge expert on databases, but most of the DB-related work I've done used SQLite. In my experience, making the database larger in-itself shouldn't incur a large performance hit. Naturally you'll have more data, so prepare to spend more time querying it!
Consider this thought experiment: you have a table named mydata you use all the time in the DB. Now, you add an unrelated table otherdata. Your queries for mydata don't depend on the information in otherdata. Even if you shove GBs of data into otherdata, you won't feel any real performance hit in your usage of mydata.
AFAIK, the architecture of SQLite supports this claim.
SQLite should be just fine for what you want to do. Size really isn't a concern. As long as your data file can reside on the same computer that's making the call, you should be just fine. If you put it on the network, that's ok, but multi-user access is subject to the bugs of the operating system when it comes to locking records, etc. Per comparing with mysql, since you've eliminated the server, you've also eliminated the network traffic associated with the data retrieval. this should speed things up.
-don
As stated in Sqlite FAQS , FAQ
look at point 12 , it says max limit of sqlite db can be upto 140 TB!!
I find using indexes will save your time a lot, you can have a try!

Why would someone need an in-memory database?

I read that a few databases can be used in-memory but can't think of reason why someone would want to use this feature. I always use a database to persist data and memory caches for fast access.
Cache is also a kind of database, like a file system is. 'Memory cache' is just a specific application of an in-memory database and some in-memory databases are specialized as memory caches.
Other uses of in-memory databases have already been included in other answers, but let me enumerate the uses too:
Memory cache. Usually a database system specialized for that use (and probably known as 'a memory cache' rather than 'a database') will be used.
Testing database-related code. In this case often an 'in-memory' mode of some generic database system will be used, but also a dedicated 'in-memory' database may be used to replace other 'on-disk' database for faster testing.
Sophisticated data manipulation. In-memory SQL databases are often used this way. SQL is a great tool for data manipulation and sometimes there is no need to write the data on disk while computing the final result.
Storing of transient runtime state. There are application that need to store their state in some kind of database but do not need to persist that over application restart. Think of some kind of process manager – it needs to keep track of sub-processes running, but that data is only valid as long as the application and the sub-processes run.
A common use case is to run unit/integration tests.
You don't really care about persisting data between each test run and you want tests to run as quickly as possible (to encourage people to do them often). Hosting a database in process gives you very quick access to the data.
Does your memory cache have SQL support?
How about you consider the in-memory database as a really clever cache?
That does leave questions of how the in-memory database gets populated and how updated are managed and consistency is preserved across multiple instances.
Searching for something among 100000 elements is slow if you don't use tricks like indexes. Those tricks are already implemented in a database engine (be it persistent or in-memory).
A in-memory database might offer a more efficient search feature than what you might be able to implement yourself quickly over self-written structures.
In-memory databases are roughly at least an order of magnitude faster than traditional RDBMS for general purpose (read side) queries. Most are disk backed providing the very same consistency as a normal RDBMS - only catch the entire dataset must fit into RAM.
The core idea is disk backed storage has huge random access penalties which does not apply to DRAM. Data can be index/organized in a random access optimized way not feasible using traditional RDBMS data caching schemes.
Applications, which require real time responses would like to use an in memory database, perhaps application to control aircraft, plants where the response time is critical
An in memory database is also useful in game programming. You can store data in an in memory database which is much faster than permanent databases.
They are used as an advanced data structure to store, query and modify runtime data.
You may need a database if several different applications are going to access the dataset. A database has a consistent interface for accessing / modifying data, which your hash table (or whatever else you use) won't have.
If a single program is dealing with the data, then it's reasonable to just use a data structure in whatever language you are using though.
In-memory database is better than performing database caching.
Database caching works similar to in-memory databases when it comes to READ operations.
On the other hand, when it comes to WRITE operations, in-memory databases are faster when compared to database caches, where the data is persisted onto disk (which leads to IO overhead).
Also, with database caching you can end with cache misses but you will never end up with cache misses when using in-memory databases.
Given their speed and the declining price of RAM, it’s likely that in-memory databases will become the dominant technology in the future. There are already some that have developed sophisticated features like SQL queries, secondary indexes, and engines for processing datasets larger than RAM.

Resources