High performance persistent key value store for huge amount of records - database

The scenario is about 1 billion records. Each record has 1kb data size and is store in SSD.
Which kv store can provide best random read performance? It need to reduce disk access to only 1 time per query and all of the data index will be stored in memory.
Redis is fast but it's too expensive to store 1 TB data in memory.
LevelDB reads disk several times per query.
The closest one I found is fatcache but it's not persistent. It's an SSD-backed memcached.
Any suggestions?

RocksDB might be the choice for you, which is optimized for fast storage like memory and flash-disk, and its highly customizable. If your application is read-only after initial bulk-load, then you can config RocksDB to compact everything in one single big file. In that way, reads are guaranteed to have at most single I/O. However, if your application handles both reads and writes, then in order to have at most one I/O per read, you will need to sacrifice the write performance as you need to config rocksdb to compact very often, and that hurts write performance.
Tuning guide for RocksDB can also be found here.

You may want to try RocksDB, it's a facebook library which optimized for SSD storage. You can also try Ardb, it's a redis protocol compatible NoSQL DB build on RockDB/LevelDB/LMDB.

Have you looked at aerospike ? I haven't use it, but they claim to have good performances on SSD.

LMDB is faster than RocksDB and uses 1/3rd as much memory. Also LMDb requires no tuning; RocksDB requires careful tuning of over 40 parameters to get performance that approaches LMDB's.
Also LMDB is fully transactional and 100% crash-proof, RocksDB is neither.


When should you use RocksDB or LevelDB

In the scenario where you've stored ~1TB of key/value, and need to provide a production API to 1000s of users.
Further, the store will only be used for read operations only (after 1-time initial write)
You should use rocksdb and also hold the data in your ram, as much as possible, and serve responses from there (the more the better)

Using Redis as Only and Primary Database

Here are the pros and cons of using redis.
transaction (because single threaded)
No secondary key support
No reindexing support
No reliable change streams unlike mongodb, dynamodb
Cost (since all the data has to be kept in memory, we need more and more ram)
No reliable persistence.
(PS People say that rdb snapshot serves as a back up store for redis but help me understand what happens when we have 12 gb ram and we put 13 gb of data. redis will only have 12 gb of data, but if using the rdb snapshot if I bring up another redis with 20gb ram, will it have all my data or 1 gb is gone forever. How reliable is this mechanism??)
No auto balancing of keys. (application level sharding is a must)
Considering the above pros and cons I see only few use cases where we can have redis as Only and Primary data store.
Session store.
Am I missing anything? What additional capabilities does redis miss than other databases like mongodb, mysql (Here I am only talking about being production ready and reliability not starting nosql vs sql debate :) ).
Redis in fact is not a data store, nor is intended to do what a database (or maybe file system) was designed to do. As you correctly point out, if you need to store more information than can easily fit into RAM, then an in-memory cache solution such as Redis is no longer suitable. The other big risk with trying to use Redis as you would a database is that should Redis ever go down, you would lose all your state. Note that it is actually possible to configure Redis to take periodic snapshots into a persistent layer, such as a database. But even in this case, your cache would still be vulnerable in between snapshots. To remedy this, you could increase the frequency of database snapshots, but in this limit, your Redis cache would start to behave more like a database than a fast in memory cache.
So for what might Redis be suitable? Redis might be suitable for storing things like application session state. This tends to be reasonably small amounts of information, and also there isn't too much risk should Redis go down (perhaps in the worst case certain users would have to login again).

In memory databases with LMDB

I have a project which uses BerkelyDB as a key value store for up to hundreds of millions of small records.
The way it's used is all the values are inserted into the database, and then they are iterated over using both sequential and random access, all from a single thread.
With BerkeleyDB, I can create in-memory databases that are "never intended to be preserved on disk". If the database is small enough to fit in the BerkeleyDB cache, it will never be written to disk. If it is bigger than the cache, then a temporary file will be created to hold the overflow. This option can speed things up significantly, as it prevents my application from writing gigabytes of dead data to disk when closing the database.
I have found that the BerkeleyDB write performance is too poor, even on an SSD, so I would like to switch to LMDB. However, based on the documentation, it doesn't seem like there is an option creating a non-persistent database.
What configuration/combination of options should I use to get the best performance out of LMDB if I don't care about persistence or concurrent access at all? i.e. to make it act like an "in-memory database" with temporary backing disk storage?
Just use MDB_NOSYNC and never call mdb_env_sync() yourself. You could also use MDB_WRITEMAP in addition. The OS will still eventually flush dirty pages to disk; you can play with /proc/sys/vm/dirty_ratio etc. to control that behavior.
From this post: https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/
vm.dirty_ratio is the absolute maximum amount of system memory that can be filled with dirty pages before everything must get committed to disk. When the system gets to this point all new I/O blocks until dirty pages have been written to disk.
If the dirty ratio is too small, then you will see frequent synchronous disk writes.

Storage capacity of in-memory database?

Is storage capacity of in-memory database limited to size of RAM? If yes, is there any ways to increase its capacity except for increasing RAM size. If no, please give some explanations.
As previously mentioned, in-memory storage capacity is limited by the addressable memory, not by the amount of physical memory in the system. Simon was also correct that the OS will swap memory to the page file, but you really want to avoid that. In the context of the DBMS, the OS will do a worse job of it than if you simply used a persistent database with as large of a cache as you have physical memory to support. IOW, the DBMS will manage its cache more intelligently than the OS would manage paged memory containing in-memory database content.
On a 32 bit system, each process is limited to a total of 3GB of RAM, whether you have 3GB physically or 512MB. If you have more data (including the in-mem DB) and code then will fit into physical RAM then the Page file on disc is used to swap out memory that is currently not being used. Swapping does slow everything down though. There are some tricks you can use for extending that: Memory-mapped files, /3GB switch; but these are not easy to implement.
On 64 bit machines, a processes memory limitation is huge - I forget what it is but it's up in the TB range.
VoltDB is an in-memory SQL database that runs on a cluster of 64-bit Linux servers. It has high performance durability to disk for recovery purposes, but tables, indexes and materialized views are stored 100% in-memory. A VoltDB cluster can be expanded on the fly to increase the overall available RAM and throughput capacity without any down time. In a high-availability configuration, individual nodes can also be stopped to perform maintenance such as increasing the server's RAM, and then rejoined to the cluster without any down time.
The design of VoltDB, led by Michael Stonebraker, was for a no-compromise approach to performance and scalability of OLTP transaction processing workloads with full ACID guarantees. Today these workloads are often described as Fast Data. By using main memory, and single-threaded SQL execution code distributed for parallel processing by core, the data can be accessed as fast as possible in order to minimize the execution time of transactions.
There are in-memory solutions that can work with data sets larger than RAM. Of course, this is accomplished by adding some operations on disk. Tarantool's Vinyl, for example, can work with data sets that are 10 to 1000 times the size of available RAM. Like other databases of recent vintage such as RocksDB and Bigtable, Vinyl's write algorithm uses LSM trees instead of B trees, which helps with its speed.

Will performance of a SQL server degrade if the DB can't fit in the memory?

Will the performance of a SQL server drastically degrade if the database is bigger than the RAM? Or does only the index have to fit in the memory? I know this is complex, but as a rule of thumb?
Only the working set or common data or currently used data needs to fit into the buffer cache (aka data cache). This includes indexes too.
There is also the plan cache, network buffers + other stuff too. MS have put a lot of work into memory management on SQL Server and it's works well, IMHO.
Generally, more RAM will help but it's not essential.
Yes, when indexes cant fit in the memory or when doing full table scans. Doing aggregate functions over data not in memory will also require many (and maybe random) disc reads.
For some benchmarks:
Query time will depend significantly
on whether the affected data currently
resides in memory or disk access is
required. For disk intensive
operations, the characteristics of the
disk sequential and random I/O
performance are also important.
There for, don't expect the same performance if your db size > ram size.
http://highscalability.com/ is full of examples like:
Once the database doesn't fit in RAM you hit a wall.
Or here:
Even if the DB size is just 10% bigger than RAM size this test shows a 2.6 times drop in performance.
Although, remember that this is for hot data, data that you want to query over and don't can cache. If you can, you can easily live with significant less memory.
All DB operations will have to be backed up by writing to disk, having more RAM is helpful, but not essential.
Loading the whole database into RAM is not practical. Database can be upto a Terabytes these days. There is little chance that anyone would buy so much RAM. I think performance will be optimal even if the size of the RAM available is one tenth of the size of the database.
