LevelDB Benchmark in case of Huge Data Volume - benchmarking

In case of huge amount of data is stored in LevelDB, say 1TB data size and 1G records (1K bytes per record), is there a benchmark for random record query and random write?
We want to know if LevelDB performance degrades while DB size gets larger and larger.

Some data I found about leveldb


Clarification on the Snowflake Micro Partition Size?

I need the clarification on the size of Snowflake's Micro partition size. In the snowflake official document, It is mentioned as below.
Each micro-partition contains between 50 MB and 500 MB of uncompressed data (note that the actual size in Snowflake is smaller because data is always stored compressed).
However in some places i see below statement on the micro partition size.
Snowflake also stores multiple rows together in micro-partitions, which are variable-length data blocks of around 16 Mb size
What is the size of the data that Micro-partition can hold 16 MB or (50 -500 MB), or else does each Micro-partition has data block which is of 16 MB?
The key point is compression:
Benefits of Micro-partitioning
As the name suggests, micro-partitions are small in size (50 to 500 MB, before compression), which enables extremely efficient DML and fine-grained pruning for faster queries.
Columns are also compressed individually within micro-partitions. Snowflake automatically determines the most efficient compression algorithm for the columns in each micro-partition.
The size of 50-500MB is for uncompressed data and micropartition itself holds around 16MB(after compression).

Why LMDB database taking more than actual data size?

I put around 11K key&values in LMDB database .
LMDB database file size become 21Mb.
For the same data the leveldb is taking 8Mb only (with snappy compression).
LMDB env info ,
TO check why LMDB file size is more ,I iterated through all keys & values inside
the database. The total size of all key & value is 10Mb.
But the actual size of the file is 21Mb.
Remaining file size of 11Mb (21Mb - 10Mb) used for what purpose???!!.
If i compress data before put operation ,only 2Mb got reduced
Why LMDB database file size is more than actual data size?
Any way to shrink it ?
The database is bigger than the original file because lmdb requires to do some bookeeping to keep the data sorted. Also, there is an overhead because even if your record (key + value) is say 1kb lmdb allocates a fixed size of space to store those. I don't know the actual value. But this overhead is always expected.
Compression doesn't work well on small records.
lmdb doesn't support prefix or block compression. Your best bet is to use a key-value store that does, like wiredtiger.

how to calculate row size of an unstructured data?

In classical RDBMS it' relatively easy to calculate maximum row size by adding max size of each field defined within a table. This value multiplied by predicted number of rows will give max table size excluding indexes, logs etc.
Today in the era of structured way of storing unstructured data it's relatively hard to tell what will be the optimal table size.
Is there any way to calculate or predict table or even database growth and storage requirements without sample data load ?
What are your ways of calculating row size and planning storage capacity for unstructured database ?
It is pretty much the same. Find the average size of data you need to persist and multiply it with your estimated transaction count per time unit.
Database engines may allocate datafile chunks exponentially (first 16mb then 32mb etc.) so you need to know about the workings of your dbms engine to translate the data size to physical storage space size.

Which NoSQL Database for Mostly Writing

I'm working on a system that will generate and store large amounts of data to disk. A previously developed system at the company used ordinary files to store its data but for several reasons it became very hard to manage.
I believe NoSQL databases are good solutions for us. What we are going to store is generally documents (usually around 100K but occasionally can be much larger or smaller) annotated with some metadata. Query performance is not top priority. The priority is writing in a way that I/O becomes as small a hassle as possible. The rate of data generation is about 1Gbps, but we might be moving on 10Gbps (or even more) in the future.
My other requirement is the availability of a (preferably well documented) C API. I'm currently testing MongoDB. Is this a good choice? If not, what other database system can I use?
The rate of data generation is about 1Gbps,... I'm currently testing MongoDB. Is this a good choice?
OK, so just to clarify, your data rate is ~1 gigaBYTE per 10 seconds. So you are filling a 1TB hard drive every 20 minutes or so?
MongoDB has pretty solid write rates, but it is ideally used in situations with a reasonably low RAM to Data ratio. You want to keep at least primary indexes in memory along with some data.
In my experience, you want about 1GB of RAM for every 5-10GB of Data. Beyond that number, read performance drops off dramatically. Once you get to 1GB of RAM for 100GB of data, even adding new data can be slow as the index stops fitting in RAM.
The big key here is:
What queries are you planning to run and how does MongoDB make running these queries easier?
Your data is very quickly going to occupy enough space that basically every query will just be going to disk. Unless you have a very specific indexing and sharding strategy, you end up just doing disk scans.
Additionally, MongoDB does not support compression. So you will be using lots of disk space.
If not, what other database system can I use?
Have you considered compressed flat files? Or possibly a big data Map/Reduce system like Hadoop (I know Hadoop is written in Java)
If C is key requirement, maybe you want to look at Tokyo/Kyoto Cabinet?
EDIT: more details
MongoDB does not support full-text search. You will have to look to other tools (Sphinx/Solr) for such things.
Larges indices defeat the purpose of using an index.
According to your numbers, you are writing 10M documents / 20 mins or about 30M / hour. Each document needs about 16+ bytes for an index entry. 12 bytes for ObjectID + 4 bytes for pointer into the 2GB file + 1 byte for pointer to file + some amount of padding.
Let's say that every index entry needs about 20 bytes, then your index is growing at 600MB / hour or 14.4GB / day. And that's just the default _id index.
After 4 days, your main index will no longer fit into RAM and your performance will start to drop off dramatically. (this is well-documented under MongoDB)
So it's going to be really important to figure out which queries you want to run.
Have a look at Cassandra. It executes writes are much faster than reads. Probably, that's what you're looking for.

How is data compression more effective than indexing for search performance?

For our application, we keep large amounts of data indexed by three integer columns (source, type and time). Loading significant chunks of that data can take some time and we have implemented various measures to reduce the amount of data that has to be searched and loaded for larger queries, such as storing larger granularities for queries that don't require a high resolution (time-wise).
When searching for data in our backup archives, where the data is stored in bzipped text files, but has basically the same structure, I noticed that it is significantly faster to untar to stdout and pipe it through grep than to untar it to disk and grep the files. In fact, the untar-to-pipe was even noticeably faster than just grepping the uncompressed files (i. e. discounting the untar-to-disk).
This made me wonder if the performance impact of disk I/O is actually much heavier than I thought. So here's my question:
Do you think putting the data of multiple rows into a (compressed) blob field of a single row and search for single rows on the fly during extraction could be faster than searching for the same rows via the table index?
For example, instead of having this table
CREATE TABLE data ( `source` INT, `type` INT, `timestamp` INT, `value` DOUBLE);
I would have
CREATE TABLE quickdata ( `source` INT, `type` INT, `day` INT, `dayvalues` BLOB );
with approximately 100-300 rows in data for each row in quickdata and searching for the desired timestamps on the fly during decompression and decoding of the blob field.
Does this make sense to you? What parameters should I investigate? What strings might be attached? What DB features (any DBMS) exist to achieve similar effects?
This made me wonder if the performance impact of disk I/O is actually much heavier than I thought.
Definitely. If you have to go to disk, the performance hit is many orders of magnitude greater than memory. This reminds me of the classic Jim Gray paper, Distributed Computing Economics:
Computing economics are changing. Today there is rough price parity between (1) one database access, (2) ten bytes of network traffic, (3) 100,000 instructions, (4) 10 bytes of disk storage, and (5) a megabyte of disk bandwidth. This has implications for how one structures Internet-scale distributed computing: one puts computing as close to the data as possible in order to avoid expensive network traffic.
The question, then, is how much data do you have and how much memory can you afford?
And if the database gets really big -- as in nobody could ever afford that much memory, even in 20 years -- you need clever distributed database systems like Google's BigTable or Hadoop.
I made a similar discovery when working within Python on a database: the cost of accessing a disk is very, very high. It turned out to be much faster (ie nearly two orders of magnitude) to request a whole chunk of data and iterate through it in python than it was to create seven queries that were narrower. (One per day in question for the data)
It blew out even further when I was getting hourly data. 24x7 lots of queries it lots!