Why would someone need an in-memory database? - database

I read that a few databases can be used in-memory but can't think of reason why someone would want to use this feature. I always use a database to persist data and memory caches for fast access.

Cache is also a kind of database, like a file system is. 'Memory cache' is just a specific application of an in-memory database and some in-memory databases are specialized as memory caches.
Other uses of in-memory databases have already been included in other answers, but let me enumerate the uses too:
Memory cache. Usually a database system specialized for that use (and probably known as 'a memory cache' rather than 'a database') will be used.
Testing database-related code. In this case often an 'in-memory' mode of some generic database system will be used, but also a dedicated 'in-memory' database may be used to replace other 'on-disk' database for faster testing.
Sophisticated data manipulation. In-memory SQL databases are often used this way. SQL is a great tool for data manipulation and sometimes there is no need to write the data on disk while computing the final result.
Storing of transient runtime state. There are application that need to store their state in some kind of database but do not need to persist that over application restart. Think of some kind of process manager – it needs to keep track of sub-processes running, but that data is only valid as long as the application and the sub-processes run.

A common use case is to run unit/integration tests.
You don't really care about persisting data between each test run and you want tests to run as quickly as possible (to encourage people to do them often). Hosting a database in process gives you very quick access to the data.

Does your memory cache have SQL support?
How about you consider the in-memory database as a really clever cache?
That does leave questions of how the in-memory database gets populated and how updated are managed and consistency is preserved across multiple instances.

Searching for something among 100000 elements is slow if you don't use tricks like indexes. Those tricks are already implemented in a database engine (be it persistent or in-memory).
A in-memory database might offer a more efficient search feature than what you might be able to implement yourself quickly over self-written structures.

In-memory databases are roughly at least an order of magnitude faster than traditional RDBMS for general purpose (read side) queries. Most are disk backed providing the very same consistency as a normal RDBMS - only catch the entire dataset must fit into RAM.
The core idea is disk backed storage has huge random access penalties which does not apply to DRAM. Data can be index/organized in a random access optimized way not feasible using traditional RDBMS data caching schemes.

Applications, which require real time responses would like to use an in memory database, perhaps application to control aircraft, plants where the response time is critical

An in memory database is also useful in game programming. You can store data in an in memory database which is much faster than permanent databases.

They are used as an advanced data structure to store, query and modify runtime data.

You may need a database if several different applications are going to access the dataset. A database has a consistent interface for accessing / modifying data, which your hash table (or whatever else you use) won't have.
If a single program is dealing with the data, then it's reasonable to just use a data structure in whatever language you are using though.

In-memory database is better than performing database caching.
Database caching works similar to in-memory databases when it comes to READ operations.
On the other hand, when it comes to WRITE operations, in-memory databases are faster when compared to database caches, where the data is persisted onto disk (which leads to IO overhead).
Also, with database caching you can end with cache misses but you will never end up with cache misses when using in-memory databases.

Given their speed and the declining price of RAM, it’s likely that in-memory databases will become the dominant technology in the future. There are already some that have developed sophisticated features like SQL queries, secondary indexes, and engines for processing datasets larger than RAM.

Related

Which is faster , interacting with a database or using a file system for input output

I was wondering what threshold of data volume may determine whether to use a database or a simple file I/O, assuming that fresh data needs to be handled quite frequently.
Edit: There is no multi-threading in my application. Data needs to be stored and then retrieved sequentially and at this point I am not really worried about anyone else accessing the data/data safety.
Given this backdrop is there still any advantage to using databases over files?
It depends and you probably should consider other factors as well.
If you use a database, there is an overhead for transactions, security, index management etc. on the one hand. On the other hand you can get caching (which could significantly speed up your application) and better performance for random access, if you have a lot of data. In a multithreaded environment I suggest using a database because of a property implemented locking mechanism.
Flat files are OK for really simple and small data. Do you really need to open and close them so often?
If you have indexes on your table correctly then I think it would be better to use database instead of file system to get a better performance. Also to include that if your data in the database is going to be million of records then also the performance will not be affected when compared to file system with that much amount of data.
Probably a database is prefered and in this case id suggest to use sqlite database insted of sql server and mysql as data is small.
In this case I would say DB. You are writing and reading and thats what DBs are good at.
On the flip side if you are holding a tiny amount of data thats alot of over head for not much data
also depends on licensing etc. a file will be alot quicker

When to use SQL Server or MySQL and when to manage data myself

If I have a small database with fixed table layouts can I get any kind of performance benefit from handling the data myself rather than using a backend like MySQL? What are the benefits to using such a database backend?
Handling the data yourself usually means saving them in flat file and accessing them with custom code. In this case they are easier to migrate and deploy, sometimes with better performance but not always, and performance should not be a concern for small data set much less the concern.
Using a database make it harder to corrupt the data when multiple users change the data at the same time, easier to filter/sort the data, easier for other programmers to maintain, more data tools to integrate with (charts, data forms, reports, etc.), data is less likely to be tampered by user, and your hosting company may backup database more frequently then files.
Databases provide ACID features (http://en.wikipedia.org/wiki/ACID) which are hard to deliver when rolling your own solution.
In addition, databases tend to scale to larger volumes of data with better performance characteristics.
Main benefit is automatic data handling - storing, retrieving, querying.
If you only have very small amount of simple data then you can of course just use INI files or Windows registry or just text file. This saves you from having to handle the MySQL installation and configuration and managing.
Theoretically, you could get more performance for specialized data structures.
On the other hand, you have more flexibility wrt data structures with a database.
You might want to look into the NoSQL options available.

what are the best ways to mitigate database i/o bottoleneck for large web sites?

For large web sites (traffic wise) that has alot of incoming reads and updates that end up being database I/Os, what're the best ways to mitigate the performance impact? one solution that I can think of is - for write, to cache and then do delayed write (using separate job); for read, use memcached concept. any other better solutions?
Here are the most common solutions to database performance:
Caching (Memcache, etc)
Add memory to your database
More database servers (master/slave or sharding)
Use a different database type (NoSQL, Redis, etc)
Indexes to speed up read perf. (careful, too many will affect write performance)
SSDs (fast SSDs will help a lot)
RAID
Optimize/tune SQL queries
Don't forget to optimize your queries. Most of the times it is not the disk I/O, but poorly written queries which turn out to be the bottleneck.
You can also cache query results and also entire web pages if the content isn't going to change too often.
It very much depends on the usage pattern and data type. There are really different things to do depending on whether transaction are going to be supported, whether you are interested in full consistency or "eventual consistency", how big the data is (will it all fit in huge memory?), how complex the data and queries are, the list might go on and on.... Lots of variables and only after listing all the constraints/requirements you will be able to make a proper decision. Two general advices though:
Use SSDs
Use distributed architecture with distributed "NoSQL" (key/value) approach (only if you do not have to use complex relations and transactions)
10 years ago, the standard answer - besides optimizing your particular database - was scale-out using MySQL in two ways.
Reads can be scaled out in two ways. The first is through caching, which introduces possible inconsistancies and creates a separate cache layer. Reads can also be scaled in MySQL by creating "read replicas", where any database can be queried. Any write must be applied to all servers, so replication doesn't help write throughput.
Writes are scaled through sharding. For example, imagine all users with the last name 'a' are assigned to a certain server. Now imagine a more complicated shard algorithm, where a particular row's primary ID is hashed using a hash function, and distributed to one of a pool of servers.
Facebook is one of the most advanced proponents of a sharded MySQL architecture. You can have individual tables "joined" but you have to write custom code, because you might have to hop from server to server - imagine you want to get your friend's timeline posts, you can't simply join it, you have to write some application code.
Once you shard your database, you can't do joins and range lookups become difficult. This subset is sometimes called CRUD operations, and thus MySQL is overkill. Many Chinese social networks realized this, and use sharded Redis (which is much quicker than MySQL), and have written their own shard layer and application logic layers.
Imagine the next problem in sharding - you want to add a new server, and start assigning some users to that new server.
Another approach is to use a distributed database, which generally comes under the names NoSQL or NewSQL, and have a variety of approaches. Some, like MongoDB, have a sharding system to manage this mapping, but require manual steps to add servers. Cassandra has a more flexible clustering scheme, called a chorded architecture. Systems like CouchBase and Aerospike use a random distribution mechanism that remove the need for a shard layer. Some of these databases can exceed 100,000 to 200,000 requests per second per server, with the lateral scale to add new servers - enough for very large operations. With this style of clustering, you can often get a higher level of redundancy and reliability.
Other distributed approaches represent data in a more efficient way, like a graph database. If you have a problem that is better represented as a graph, then a clustered graph database may be more appropriate.

Very large database, very small portion most being retrieved in real time

I have an interesting database problem. I have a DB that is 150GB in size. My memory buffer is 8GB.
Most of my data is rarely being retrieved, or mainly being retrieved by backend processes. I would very much prefer to keep them around because some features require them.
Some of it (namely some tables, and some identifiable parts of certain tables) are used very often in a user facing manner
How can I make sure that the latter is always being kept in memory? (there is more than enough space for these)
More info:
We are on Ruby on rails. The database is MYSQL, our tables are stored using INNODB. We are sharding the data across 2 partitions. Because we are sharding it, we store most of our data using JSON blobs, while indexing only the primary keys
Update 2
The tricky thing is that the data is actually being used for both backend processes as well as user facing features. But they are accessed far less often for the latter
Update 3
Some people are commenting than 8Gb is toy these days. I agree, but just increasing the size of the db is pure LAZINESS if there is a smarter, efficient solution
This is why we have Data Warehouses. Separate the two things into either (a) separate databases or (b) separate schema within one database.
Data that is current, for immediate access, being updated.
Data that is historical fact, for analysis, not being updated.
150Gb is not very big and a single database can handle your little bit of live data and your big bit of history.
Use a "periodic" ETL process to get things out of active database, denormalize into a star schema and load into the historical data warehouse.
If the number of columns used in the customer facing tables are small you can make indexes with all the columns being used in the queries. This doesn't mean that all the data stays in memory but it can make the queries much faster. Its trading space for response time.
This calls for memcached! I'd recommend using cache-money, a great ActiveRecord write-through caching library. The ngmoco branch has support for enabling caching per-model, so you could only cache those things you knew you wanted to keep in memory.
You could also do the caching by hand using $cache.set/get/expire calls in controller actions or model hooks.
With MySQL, proper use of the Query Cache will keep frequently queried data in memory. You can provide a hint to MySQL not to cache certain queries (e.g. from the backend processes) with the SQL_NO_CACHE keyword.
If the backend processes are accessing historical data, or accessing data for reporting purposes, certainly follow S. Lott's suggestion to create a separate data warehouse and query that instead. If a data warehouse is too much to accomplish in the short term, you can replicate your transactional database to a different server and perform queries there (a Data Warehouse gives you MUCH more flexibility and capability, so go down that path if possible)
UPDATE:
See documentation of SELECT and scroll down to SQL_NO_CACHE.
Read about the Query Cache
Ensure query_cache_type set appropriate for your needs.
UPDATE 2:
I confirmed with MySQL support that there is no mechanism to selectively cache certain tables etc. in the innodb buffer pool.
So, what is the problem?
First, 150gb is not very large today. It was 10 years ago.
Second any non-total-crap database system will utilize your memory as cache. If the cache is big enough (compared to the amount of data that is in use) it will be efficient. If not, the only thing you CAN do is get more memory (because, sorry, 8gb of memory is VERY low for a modern server - it was low 2 years ago).
You should not have to do anything for the memory to be efficiently used. At least not on a commercial level database - maybe mysql sucks, but I would not assume this.

File Read/Write vs Database Read/Write

Which is more expensive to do in terms of resources and efficiency, File read/write operation or Database Read/Write operation
I was initially going to say database read/write, hands down, as it would include the requisite file io on top of the DB overhead, but then realized its not that simple. If you have your entire DB loaded into memory, reads would be nearly instantaneous as there's no file IO involved.
Writes would, in general, be faster too, as the DB engine doesn't have to wait for the file IO to complete before returning since they can take a "lazy write" approach.
A poorly tuned database, on the other hand, will be orders of magnitude slower than any file based IO. DB tuning matters. A lot.
This is kind of a loaded question. What size files are we talking about? Gigabytes? Also, what type and size of DB? I often use a combination. Do you want to control any data level integrity? If so, you might want to leave that to the DB otherwise you have to control all that at the application level.
There are so many factors to make a good decision on this. For example, when I am creating temporary data that I don't want persisted I use File, but if I am using data I want persisted or backed up, then I use a DB.
This coupled with the architecture is important. If hardware, licensing, or facility is an issue then maybe you don't need the infrastructure of DB servers etc. But if you have the resources then adding a DB layer might be the right choice.
There's no simple answer. With any database you have the overhead of having it running all the time. But then when you access it is generally much faster than accessing a file. If you are talking about just a handful of accesses you won't notice much of difference. But when it gets to hundreds, thousands, and millions of accesses per minute the database will be much faster. And as Tim noted above, a poorly tuned database can be much slower than accessing a flat file.

Resources