Which NoSQL Database for Mostly Writing - c

I'm working on a system that will generate and store large amounts of data to disk. A previously developed system at the company used ordinary files to store its data but for several reasons it became very hard to manage.
I believe NoSQL databases are good solutions for us. What we are going to store is generally documents (usually around 100K but occasionally can be much larger or smaller) annotated with some metadata. Query performance is not top priority. The priority is writing in a way that I/O becomes as small a hassle as possible. The rate of data generation is about 1Gbps, but we might be moving on 10Gbps (or even more) in the future.
My other requirement is the availability of a (preferably well documented) C API. I'm currently testing MongoDB. Is this a good choice? If not, what other database system can I use?

The rate of data generation is about 1Gbps,... I'm currently testing MongoDB. Is this a good choice?
OK, so just to clarify, your data rate is ~1 gigaBYTE per 10 seconds. So you are filling a 1TB hard drive every 20 minutes or so?
MongoDB has pretty solid write rates, but it is ideally used in situations with a reasonably low RAM to Data ratio. You want to keep at least primary indexes in memory along with some data.
In my experience, you want about 1GB of RAM for every 5-10GB of Data. Beyond that number, read performance drops off dramatically. Once you get to 1GB of RAM for 100GB of data, even adding new data can be slow as the index stops fitting in RAM.
The big key here is:
What queries are you planning to run and how does MongoDB make running these queries easier?
Your data is very quickly going to occupy enough space that basically every query will just be going to disk. Unless you have a very specific indexing and sharding strategy, you end up just doing disk scans.
Additionally, MongoDB does not support compression. So you will be using lots of disk space.
If not, what other database system can I use?
Have you considered compressed flat files? Or possibly a big data Map/Reduce system like Hadoop (I know Hadoop is written in Java)
If C is key requirement, maybe you want to look at Tokyo/Kyoto Cabinet?
EDIT: more details
MongoDB does not support full-text search. You will have to look to other tools (Sphinx/Solr) for such things.
Larges indices defeat the purpose of using an index.
According to your numbers, you are writing 10M documents / 20 mins or about 30M / hour. Each document needs about 16+ bytes for an index entry. 12 bytes for ObjectID + 4 bytes for pointer into the 2GB file + 1 byte for pointer to file + some amount of padding.
Let's say that every index entry needs about 20 bytes, then your index is growing at 600MB / hour or 14.4GB / day. And that's just the default _id index.
After 4 days, your main index will no longer fit into RAM and your performance will start to drop off dramatically. (this is well-documented under MongoDB)
So it's going to be really important to figure out which queries you want to run.

Have a look at Cassandra. It executes writes are much faster than reads. Probably, that's what you're looking for.

Related

Need a highly compressed datastore for Crawl data and log data

I have to store a lot of crawl and log data in a Datastore with an efficient compression ratio.
So far I tried and installed Cassandra, Couchbase, Mysql and an FlatFile format and read the architectual overview of Google Big Table, Hypertable and the LevelDB File Layout.
Cassandra and Couchbase are about 1/5 in disk size of the uncompressed Mysql Database, but I want better results.
So I need a Simple Data Store with high compression features as in vertica, teradata, oracle and sqlserver products. (Page level compression)
The actual flatFile dataSet looks like
/oil_type/gas_station/2014-03/2014-03-05-23.csv
/oil_type/gas_station/2014-03/2014-03-06-00.csv
/oil_type/gas_station/2014-03/2014-03-06-01.csv
Per File are about 400 high redundant entries each about 5kb
A File can be compressed from 1722 KB to 39 KB so an compression ratio of 44:1 up to 100:1 depending on the compression chunk size should be possible.
Defining the use case:
I have to poll all relevant gas_station webpages/apis every 30 seconds to get up to the minute pricing information, because it is not possible to write a parser for every gas station, a generic solution is required for index creation. With a database holding all craweld gas station pages a generic parser can easily be developed and backtestet. With this raw data model data loss through broken specific converters should be avoided.
With keys like "oil_type-gas_station-timestamp-content", its easy and efficient to compare two gas_station pricings over time. For reading a Time Serie that is smaller then the compression chunk size only 2 to 4 chunks should be decompressed.
So the following features are optimal:
SSTables
Configurable Compression Options (Level,Compression Engine,Chunk Size (from 64kb to 10 MB))
Range Scans
Java Bindings
column datasore for better compression
Nice to have:
Replication
Multi Master
write quorum of 1
Forward and backward iteration over the data. (to compare two time series)
configurable replica distribution
few dependencies
Question:
Wich free Database is able to hold archived data of high redundant crawl data (only a few bytes change) , compresses good and does not use too much time to query a random record.
(In opposit to the mysql archive format, that has to decompress the whole table until the requested row)
Maybe there is a log database, that is able to index a lot of log lines and compresses them internaly? (scope of logstash, fluentd, flume)
If someone would know some benchmarks, numbers on this topic it would help a lot, to evaluate the right technology.
I am glad for your help!
Assuming you are in a multithreaded environment, possible multi-process, LevelDB is NOT a good idea.
Cassandra is written in Java, therefore you'll see excessive memory consumption when handling a big load of big files, at least without tweaking the JVM. Also, since it's written in Java, it probably won't be fast enough for really good compression.
I use HyperTable on my Linux-box to store photos and movies.
You can use HyperTable from any language with Thrift-support.
Additionally, if you need it, you can use the C++ drivers, for extra-speed.
One thing that is nice about HyperTable is that it doesn't add a dependency on Java, since it's written in C++, which also means it's blazing fast and not garbage-collected (no memory overhead).
Hypertable does have a Java client however, out of the box.
I use my own C# Thrift-client, which I ported from Java.
See >here< for the code.
Since HyperTable operates on Byte-Arrays, you can simply put your file in the thrift-client as byte-array, and HyperTable will compress it for you automagically, if you have told it to do so in the column definition.
You could also try MongoDb, if you absolutely want to.
Mongo actually derives from humongous, by the way.
However, I must say I have never "really" used it.

How to store 7.3 billion rows of market data (optimized to be read)?

I have a dataset of 1 minute data of 1000 stocks since 1998, that total around (2012-1998)*(365*24*60)*1000 = 7.3 Billion rows.
Most (99.9%) of the time I will perform only read requests.
What is the best way to store this data in a db?
1 big table with 7.3B rows?
1000 tables (one for each stock symbol) with 7.3M rows each?
any recommendation of database engine? (I'm planning to use Amazon RDS' MySQL)
I'm not used to deal with datasets this big, so this is an excellent opportunity for me to learn. I will appreciate a lot your help and advice.
Edit:
This is a sample row:
'XX', 20041208, 938, 43.7444, 43.7541, 43.735, 43.7444, 35116.7, 1, 0, 0
Column 1 is the stock symbol, column 2 is the date, column 3 is the minute, the rest are open-high-low-close prices, volume, and 3 integer columns.
Most of the queries will be like "Give me the prices of AAPL between April 12 2012 12:15 and April 13 2012 12:52"
About the hardware: I plan to use Amazon RDS so I'm flexible on that
So databases are for situations where you have a large complicated schema that is constantly changing. You only have one "table" with a hand-full of simple numeric fields. I would do it this way:
Prepare a C/C++ struct to hold the record format:
struct StockPrice
{
char ticker_code[2];
double stock_price;
timespec when;
etc
};
Then calculate sizeof(StockPrice[N]) where N is the number of records. (On a 64-bit system) It should only be a few hundred gig, and fit on a $50 HDD.
Then truncate a file to that size and mmap (on linux, or use CreateFileMapping on windows) it into memory:
//pseduo-code
file = open("my.data", WRITE_ONLY);
truncate(file, sizeof(StockPrice[N]));
void* p = mmap(file, WRITE_ONLY);
Cast the mmaped pointer to StockPrice*, and make a pass of your data filling out the array. Close the mmap, and now you will have your data in one big binary array in a file that can be mmaped again later.
StockPrice* stocks = (StockPrice*) p;
for (size_t i = 0; i < N; i++)
{
stocks[i] = ParseNextStock(stock_indata_file);
}
close(file);
You can now mmap it again read-only from any program and your data will be readily available:
file = open("my.data", READ_ONLY);
StockPrice* stocks = (StockPrice*) mmap(file, READ_ONLY);
// do stuff with stocks;
So now you can treat it just like an in-memory array of structs. You can create various kinds of index data structures depending on what your "queries" are. The kernel will deal with swapping the data to/from disk transparently so it will be insanely fast.
If you expect to have a certain access pattern (for example contiguous date) it is best to sort the array in that order so it will hit the disk sequentially.
I have a dataset of 1 minute data of 1000 stocks [...] most (99.9%) of the time I will perform only read requests.
Storing once and reading many times time-based numerical data is a use case termed "time series". Other common time series are sensor data in the Internet of Things, server monitoring statistics, application events etc.
This question was asked in 2012, and since then, several database engines have been developing features specifically for managing time series. I've had great results with the InfluxDB, which is open sourced, written in Go, and MIT-licensed.
InfluxDB has been specifically optimized to store and query time series data. Much more so than Cassandra, which is often touted as great for storing time series:
Optimizing for time series involved certain tradeoffs. For example:
Updates to existing data are a rare occurrence and contentious updates never happen. Time series data is predominantly new data that is never updated.
Pro: Restricting access to updates allows for increased query and write performance
Con: Update functionality is significantly restricted
In open sourced benchmarks,
InfluxDB outperformed MongoDB in all three tests with 27x greater write throughput, while using 84x less disk space, and delivering relatively equal performance when it came to query speed.
Queries are also very simple. If your rows look like <symbol, timestamp, open, high, low, close, volume>, with InfluxDB you can store just that, then query easily. Say, for the last 10 minutes of data:
SELECT open, close FROM market_data WHERE symbol = 'AAPL' AND time > '2012-04-12 12:15' AND time < '2012-04-13 12:52'
There are no IDs, no keys, and no joins to make. You can do a lot of interesting aggregations. You don't have to vertically partition the table as with PostgreSQL, or contort your schema into arrays of seconds as with MongoDB. Also, InfluxDB compresses really well, while PostgreSQL won't be able to perform any compression on the type of data you have.
Tell us about the queries, and your hardware environment.
I would be very very tempted to go NoSQL, using Hadoop or something similar, as long as you can take advantage of parallelism.
Update
Okay, why?
First of all, notice that I asked about the queries. You can't -- and we certainly can't -- answer these questions without knowing what the workload is like. (I'll co-incidentally have an article about this appearing soon, but I can't link it today.) But the scale of the problem makes me think about moving away from a Big Old Database because
My experience with similar systems suggests the access will either be big sequential (computing some kind of time series analysis) or very very flexible data mining (OLAP). Sequential data can be handled better and faster sequentially; OLAP means computing lots and lots of indices, which either will take lots of time or lots of space.
If You're doing what are effectively big runs against many data in an OLAP world, however, a column-oriented approach might be best.
If you want to do random queries, especially making cross-comparisons, a Hadoop system might be effective. Why? Because
you can better exploit parallelism on relatively small commodity hardware.
you can also better implement high reliability and redundancy
many of those problems lend themselves naturally to the MapReduce paradigm.
But the fact is, until we know about your workload, it's impossible to say anything definitive.
Okay, so this is somewhat away from the other answers, but... it feels to me like if you have the data in a file system (one stock per file, perhaps) with a fixed record size, you can get at the data really easily: given a query for a particular stock and time range, you can seek to the right place, fetch all the data you need (you'll know exactly how many bytes), transform the data into the format you need (which could be very quick depending on your storage format) and you're away.
I don't know anything about Amazon storage, but if you don't have anything like direct file access, you could basically have blobs - you'd need to balance large blobs (fewer records, but probably reading more data than you need each time) with small blobs (more records giving more overhead and probably more requests to get at them, but less useless data returned each time).
Next you add caching - I'd suggest giving different servers different stocks to handle for example - and you can pretty much just serve from memory. If you can afford enough memory on enough servers, bypass the "load on demand" part and just load all the files on start-up. That would simplify things, at the cost of slower start-up (which obviously impacts failover, unless you can afford to always have two servers for any particular stock, which would be helpful).
Note that you don't need to store the stock symbol, date or minute for each record - because they're implicit in the file you're loading and the position within the file. You should also consider what accuracy you need for each value, and how to store that efficiently - you've given 6SF in your question, which you could store in 20 bits. Potentially store three 20-bit integers in 64 bits of storage: read it as a long (or whatever your 64-bit integer value will be) and use masking/shifting to get it back to three integers. You'll need to know what scale to use, of course - which you could probably encode in the spare 4 bits, if you can't make it constant.
You haven't said what the other three integer columns are like, but if you could get away with 64 bits for those three as well, you could store a whole record in 16 bytes. That's only ~110GB for the whole database, which isn't really very much...
EDIT: The other thing to consider is that presumably the stock doesn't change over the weekend - or indeed overnight. If the stock market is only open 8 hours per day, 5 days per week, then you only need 40 values per week instead of 168. At that point you could end up with only about 28GB of data in your files... which sounds a lot smaller than you were probably originally thinking. Having that much data in memory is very reasonable.
EDIT: I think I've missed out the explanation of why this approach is a good fit here: you've got a very predictable aspect for a large part of your data - the stock ticker, date and time. By expressing the ticker once (as the filename) and leaving the date/time entirely implicit in the position of the data, you're removing a whole bunch of work. It's a bit like the difference between a String[] and a Map<Integer, String> - knowing that your array index always starts at 0 and goes up in increments of 1 up to the length of the array allows for quick access and more efficient storage.
It is my understanding that HDF5 was designed specifically with the time-series storage of stock data as one potential application. Fellow stackers have demonstrated that HDF5 is good for large amounts of data: chromosomes, physics.
I think any major RDBMS would handle this. At the atomic level, a one table with correct partitioning seems reasonable (partition based on your data usage if fixed - this is ikely to be either symbol or date).
You can also look into building aggregated tables for faster access above the atomic level. For example if your data is at day, but you often get data back at the wekk or even month level, then this can be pre-calculated in an aggregate table. In some databases this can be done though a cached view (various names for different DB solutions - but basically its a view on the atomic data, but once run the view is cached/hardened intoa fixed temp table - that is queried for subsequant matching queries. This can be dropped at interval to free up memory/disk space).
I guess we could help you more with some idea as to the data usage.
Here is an attempt to create a Market Data Server on top of the Microsoft SQL Server 2012 database which should be good for OLAP analysis, a free open source project:
http://github.com/kriasoft/market-data
First, there isn't 365 trading days in the year, with holidays 52 weekends (104) = say 250 x the actual hours of day market is opened like someone said, and to use the symbol as the primary key is not a good idea since symbols change, use a k_equity_id (numeric) with a symbol (char) since symbols can be like this A , or GAC-DB-B.TO , then in your data tables of price info, you have, so your estimate of 7.3 billion is vastly over calculated since it's only about 1.7 million rows per symbol for 14 years.
k_equity_id
k_date
k_minute
and for the EOD table (that will be viewed 1000x over the other data)
k_equity_id
k_date
Second, don't store your OHLC by minute data in the same DB table as and EOD table (end of day) , since anyone wanting to look at a pnf, or line chart, over a year period , has zero interest in the by the minute information.
Let me recommend that you take a look at apache solr, which I think would be ideal for your particular problem. Basically, you would first index your data (each row being a "document"). Solr is optimized for searching and natively supports range queries on dates. Your nominal query,
"Give me the prices of AAPL between April 12 2012 12:15 and April 13 2012 12:52"
would translate to something like:
?q=stock:AAPL AND date:[2012-04-12T12:15:00Z TO 2012-04-13T12:52:00Z]
Assuming "stock" is the stock name and "date" is a "DateField" created from the "date" and "minute" columns of your input data on indexing. Solr is incredibly flexible and I really can't say enough good things about it. So, for example, if you needed to maintain the fields in the original data, you can probably find a way to dynamically create the "DateField" as part of the query (or filter).
You should compare the slow solutions with a simple optimized in memory model. Uncompressed it fits in a 256 GB ram server. A snapshot fits in 32 K and you just index it positionally on datetime and stock. Then you can make specialized snapshots, as open of one often equals closing of the previous.
[edit] Why do you think it makes sense to use a database at all (rdbms or nosql)? This data doesn't change, and it fits in memory. That is not a use case where a dbms can add value.
If you have the hardware, I recommend MySQL Cluster. You get the MySQL/RDBMS interface you are so familiar with, and you get fast and parallel writes. Reads will be slower than regular MySQL due to network latency, but you have the advantage of being able to parallelize queries and reads due to the way MySQL Cluster and the NDB storage engine works.
Make sure that you have enough MySQL Cluster machines and enough memory/RAM for each of those though - MySQL Cluster is a heavily memory-oriented database architecture.
Or Redis, if you don't mind a key-value / NoSQL interface to your reads/writes. Make sure that Redis has enough memory - its super-fast for reads and writes, you can do basic queries with it (non-RDBMS though) but is also an in-memory database.
Like others have said, knowing more about the queries you will be running will help.
You will want the data stored in a columnar table / database. Database systems like Vertica and Greenplum are columnar databases, and I believe SQL Server now allows for columnar tables. These are extremely efficient for SELECTing from very large datasets. They are also efficient at importing large datasets.
A free columnar database is MonetDB.
If your use case is to simple read rows without aggregation, you can use Aerospike cluster. It's in memory database with support of file system for persistence. It's also SSD optimized.
If your use case needs aggregated data, go for Mongo DB cluster with date range sharding. You can club year vise data in shards.

Recommended Document Structure for CouchDB

We are currently considering a change from Postgres to CouchDB for a usage monitoring application. Some numbers:
Approximately 2000 connections, polled every 5 minutes, for approximately 600,000 new rows per day. In Postgres, we store this data, partitioned by day:
t_usage {service_id, timestamp, data_in, data_out}
t_usage_20100101 inherits t_usage.
t_usage_20100102 inherits t_usage. etc.
We write data with an optimistic stored proc that presumes the partition exists and creates it if necessary. We can insert very quickly.
For reading of the data, our use cases, in order of importance and current performance are:
* Single Service, Single Day Usage : Good Performance
* Multiple Services, Month Usage : Poor Performance
* Single Service, Month Usage : Poor Performance
* Multiple Services, Multiple Months : Very Poor Performance
* Multiple Services, Single Day : Good Performance
This makes sense because the partitions are optimised for days, which is by far our most important use case. However, we are looking at methods of improving the secondary requirements.
We often need to parameterise the query by hours as well, for example, only giving results between 8am and 6pm, so summary tables are of limited use. (These parameters change with enough frequency that creating multiple summary tables of data is prohibitive).
With that background, the first question is: Is CouchDB appropriate for this data? If it is, given the above use cases, how would you best model the data in CouchDB documents? Some options I've put together so far, which we are in the process of benchmarking are (_id, _rev excluded):
One Document Per Connection Per Day
{
service_id:555
day:20100101
usage: {1265248762: {in:584,out:11342}, 1265249062: {in:94,out:1242}}
}
Approximately 60,000 new documents a month. Most new data would be updates to existing documents, rather than new documents.
(Here, the objects in usage are keyed on the timestamp of the poll, and the values the bytes in and byes out).
One Document Per Connection Per Month
{
service_id:555
month:201001
usage: {1265248762: {in:584,out:11342}, 1265249062: {in:94,out:1242}}
}
Approximately 2,000 new documents a month. Moderate updates to existing documents required.
One Document Per Row of Data Collected
{
service_id:555
timestamp:1265248762
in:584
out:11342
}
{
service_id:555
timestamp:1265249062
in:94
out:1242
}
Approximately 15,000,000 new documents a month. All data would be an insert to a new document. Faster inserts, but I have questions about how efficient it's going to be after a year or 2 years with hundreds of millions of documents. The file IO would seem prohibitive (though I'm the first to admit I don't fully understand the mechanics of it).
I'm trying to approach this in a document-oriented way, though breaking the RDMS habit is difficult :) The fact you can only minimally parameterise views as well has me a bit concerned. That said, which of the above would be the most appropriate? Are there other formats that I haven't considered which will perform better?
Thanks in advance,
Jamie.
I don't think it's a horrible idea.
Let's consider your Connection/Month scenario.
Given that an entry is ~40 (that's generous) characters long, and you get ~8,200 entries per month, your final document size will be ~350K long at the end of the month.
That means, going full bore, you're be reading and writing 2000 350K documents every 5 minutes.
I/O wise, this is less than 6 MB/s, considering read and write, averaged for the 5m window of time. That's well within even low end hardware today.
However, there is another issue. When you store that document, Couch is going to evaluate its contents in order to build its view, so Couch will be parsing 350K documents. My fear is that (at last check, but it's been some time) I don't believe Couch scaled well across CPU cores, so this could easily pin the single CPU core that Couch will be using. I would like to hope that Couch can read, parse, and process 2 MB/s, but I frankly don't know. With all it's benefits, erlang isn't the best haul ass in a straight line computer language.
The final concern is keeping up with the database. This will be writing 700 MB every 5 minutes at the end of the month. With Couchs architecture (append only), you will be writing 700MB of data every 5 min, which is 8.1GB per hour, and 201GB after 24 hrs.
After DB compression, it crushes down to 700MB (for a single month), but during that process, that file will be getting big, and quite quickly.
On the retrieve side, these large documents don't scare me. Loading up a 350K JSON document, yes it's big, but it's not that big, not on modern hardware. There are avatars on bulletin boards bigger than that. So, anything you want to do regarding the activity of a connection over a month will be pretty fast I think. Across connections, obviously the more you grab, the more expensive it will get (700MB for all 2000 connections). 700MB is a real number that has real impact. Plus your process will need to be aggressive in throwing out the data you don't care about so it can throw away the chaff (unless you want to load up 700MB of heap in your report process).
Given these numbers, Connection/Day may be a better bet, as you can control the granularity a bit better. However, frankly, I would go for the coarsest document you can, because I think that gives you the best value from the database, solely because today all the head seeks and platter rotations are what kill a lot of I/O performance, many disk stream data very well. Larger documents (assuming well located data, since Couch is constantly compacted, this shouldn't be a problem) stream more than seek. Seeking in memory is "free" compared to a disk.
By all means run your own tests on our hardware, but take all these considerations to heart.
EDIT:
After more experiments...
Couple of interesting observations.
During import of large documents CPU is equally important to I/O speed. This is because of the amount of marshalling and CPU consumed by converting the JSON in to the internal model for use by the views. By using the large (350k) documents, my CPUs were pretty much maxed out (350%). In contrast, with the smaller documents, they were humming along at 200%, even though, overall, it was the same information, just chunked up differently.
For I/O, during the 350K docs, I was charting 11MB/sec, but with the smaller docs, it was only 8MB/sec.
Compaction appeared to be almost I/O bound. It's hard for me to get good numbers on my I/O potential. A copy of a cached file pushes 40+MB/sec. Compaction ran at about 8MB/sec. But that's consistent with the raw load (assuming couch is moving stuff message by message). The CPU is lower, as it's doing less processing (it's not interpreting the JSON payloads, or rebuilding the views), plus it was a single CPU doing the work.
Finally, for reading, I tried to dump out the entire database. A single CPU was pegged for this, and my I/O pretty low. I made it a point to ensure that the CouchDB file wasn't actually cached, my machine has a lot of memory, so a lot of stuff is cached. The raw dump through the _all_docs was only about 1 MB/sec. That's almost all seek and rotational delay than anything else. When I did that with the large documents, the I/O was hitting 3 MB/sec, that just shows the streaming affect I mentioned a benefit for larger documents.
And it should be noted that there are techniques on the Couch website about improving performance that I was not following. Notably I was using random IDs. Finally, this wasn't done as a gauge of what Couch's performance is, rather where the load appears to end up. The large vs small document differences I thought were interesting.
Finally, ultimate performance isn't as important as simply performing well enough for you application with your hardware. As you mentioned, you're doing you're own testing, and that's all that really matters.

Database scalability - performance vs. database size

I'm creating an app that will have to put at max 32 GB of data into my database. I am using B-tree indexing because the reads will have range queries (like from 0 < time < 1hr).
At the beginning (database size = 0GB), I will get 60 and 70 writes per millisecond. After say 5GB, the three databases I've tested (H2, berkeley DB, Sybase SQL Anywhere) have REALLY slowed down to like under 5 writes per millisecond.
Questions:
Is this typical?
Would I still see this scalability issue if I REMOVED indexing?
What are the causes of this problem?
Notes:
Each record consists of a few ints
Yes; indexing improves fetch times at the cost of insert times. Your numbers sound reasonable - without knowing more.
You can benchmark it. You'll need to have a reasonable amount of data stored. Consider whether or not to index based upon the queries - heavy fetch and light insert? index everywhere a where clause might use it. Light fetch, heavy inserts? Probably avoid indexes. Mixed workload; benchmark it!
When benchmarking, you want as real or realistic data as possible, both in volume and on data domain (distribution of data, not just all "henry smith" but all manner of names, for example).
It is typical for indexes to sacrifice insert speed for access speed. You can find that out from a database table (and I've seen these in the wild) that indexes every single column. There's nothing inherently wrong with that if the number of updates is small compared to the number of queries.
However, given that:
1/ You seem to be concerned that your writes slow down to 5/ms (that's still 5000/second),
2/ You're only writing a few integers per record; and
3/ You're queries are only based on time queries,
you may want to consider bypassing a regular database and rolling your own sort-of-database (my thoughts are that you're collecting real-time data such as device readings).
If you're only ever writing sequentially-timed data, you can just use a flat file and periodically write the 'index' information separately (say at the start of every minute).
This will greatly speed up your writes but still allow a relatively efficient read process - worst case is you'll have to find the start of the relevant period and do a scan from there.
This of course depends on my assumption of your storage being correct:
1/ You're writing records sequentially based on time.
2/ You only need to query on time ranges.
Yes, indexes will generally slow inserts down, while significantly speeding up selects (queries).
Do keep in mind that not all inserts into a B-tree are equal. It's a tree; if all you do is insert into it, it has to keep growing. The data structure allows for some padding, but if you keep inserting into it numbers that are growing sequentially, it has to keep adding new pages and/or shuffle things around to stay balanced. Make sure that your tests are inserting numbers that are well distributed (assuming that's how they will come in real life), and see if you can do anything to tell the B-tree how many items to expect from the beginning.
Totally agree with #Richard-t - it is quite common in offline/batch scenarios to remove indexes completely before bulk updates to a corpus, only to reapply them when update is complete.
The type of indices applied also influence insertion performance - for example with SQL Server clustered index update I/O is used for data distribution as well as index update, where as nonclustered indexes are updated in seperate (and therefore more expensive) I/O operations.
As with any engineering project - best advice is to measure with real datasets (skews page distribution, tearing etc.)
I think somewhere in the BDB docs they mention that page size greatly affects this behavior in btree's. Assuming you arent doing much in the way of concurrency and you have fixed record sizes, you should try increasing your page size

How is data compression more effective than indexing for search performance?

For our application, we keep large amounts of data indexed by three integer columns (source, type and time). Loading significant chunks of that data can take some time and we have implemented various measures to reduce the amount of data that has to be searched and loaded for larger queries, such as storing larger granularities for queries that don't require a high resolution (time-wise).
When searching for data in our backup archives, where the data is stored in bzipped text files, but has basically the same structure, I noticed that it is significantly faster to untar to stdout and pipe it through grep than to untar it to disk and grep the files. In fact, the untar-to-pipe was even noticeably faster than just grepping the uncompressed files (i. e. discounting the untar-to-disk).
This made me wonder if the performance impact of disk I/O is actually much heavier than I thought. So here's my question:
Do you think putting the data of multiple rows into a (compressed) blob field of a single row and search for single rows on the fly during extraction could be faster than searching for the same rows via the table index?
For example, instead of having this table
CREATE TABLE data ( `source` INT, `type` INT, `timestamp` INT, `value` DOUBLE);
I would have
CREATE TABLE quickdata ( `source` INT, `type` INT, `day` INT, `dayvalues` BLOB );
with approximately 100-300 rows in data for each row in quickdata and searching for the desired timestamps on the fly during decompression and decoding of the blob field.
Does this make sense to you? What parameters should I investigate? What strings might be attached? What DB features (any DBMS) exist to achieve similar effects?
This made me wonder if the performance impact of disk I/O is actually much heavier than I thought.
Definitely. If you have to go to disk, the performance hit is many orders of magnitude greater than memory. This reminds me of the classic Jim Gray paper, Distributed Computing Economics:
Computing economics are changing. Today there is rough price parity between (1) one database access, (2) ten bytes of network traffic, (3) 100,000 instructions, (4) 10 bytes of disk storage, and (5) a megabyte of disk bandwidth. This has implications for how one structures Internet-scale distributed computing: one puts computing as close to the data as possible in order to avoid expensive network traffic.
The question, then, is how much data do you have and how much memory can you afford?
And if the database gets really big -- as in nobody could ever afford that much memory, even in 20 years -- you need clever distributed database systems like Google's BigTable or Hadoop.
I made a similar discovery when working within Python on a database: the cost of accessing a disk is very, very high. It turned out to be much faster (ie nearly two orders of magnitude) to request a whole chunk of data and iterate through it in python than it was to create seven queries that were narrower. (One per day in question for the data)
It blew out even further when I was getting hourly data. 24x7 lots of queries it lots!

Resources