How is data compression more effective than indexing for search performance? - database

For our application, we keep large amounts of data indexed by three integer columns (source, type and time). Loading significant chunks of that data can take some time and we have implemented various measures to reduce the amount of data that has to be searched and loaded for larger queries, such as storing larger granularities for queries that don't require a high resolution (time-wise).
When searching for data in our backup archives, where the data is stored in bzipped text files, but has basically the same structure, I noticed that it is significantly faster to untar to stdout and pipe it through grep than to untar it to disk and grep the files. In fact, the untar-to-pipe was even noticeably faster than just grepping the uncompressed files (i. e. discounting the untar-to-disk).
This made me wonder if the performance impact of disk I/O is actually much heavier than I thought. So here's my question:
Do you think putting the data of multiple rows into a (compressed) blob field of a single row and search for single rows on the fly during extraction could be faster than searching for the same rows via the table index?
For example, instead of having this table
CREATE TABLE data ( `source` INT, `type` INT, `timestamp` INT, `value` DOUBLE);
I would have
CREATE TABLE quickdata ( `source` INT, `type` INT, `day` INT, `dayvalues` BLOB );
with approximately 100-300 rows in data for each row in quickdata and searching for the desired timestamps on the fly during decompression and decoding of the blob field.
Does this make sense to you? What parameters should I investigate? What strings might be attached? What DB features (any DBMS) exist to achieve similar effects?

This made me wonder if the performance impact of disk I/O is actually much heavier than I thought.
Definitely. If you have to go to disk, the performance hit is many orders of magnitude greater than memory. This reminds me of the classic Jim Gray paper, Distributed Computing Economics:
Computing economics are changing. Today there is rough price parity between (1) one database access, (2) ten bytes of network traffic, (3) 100,000 instructions, (4) 10 bytes of disk storage, and (5) a megabyte of disk bandwidth. This has implications for how one structures Internet-scale distributed computing: one puts computing as close to the data as possible in order to avoid expensive network traffic.
The question, then, is how much data do you have and how much memory can you afford?
And if the database gets really big -- as in nobody could ever afford that much memory, even in 20 years -- you need clever distributed database systems like Google's BigTable or Hadoop.

I made a similar discovery when working within Python on a database: the cost of accessing a disk is very, very high. It turned out to be much faster (ie nearly two orders of magnitude) to request a whole chunk of data and iterate through it in python than it was to create seven queries that were narrower. (One per day in question for the data)
It blew out even further when I was getting hourly data. 24x7 lots of queries it lots!

Related

Which NoSQL Database for Mostly Writing

I'm working on a system that will generate and store large amounts of data to disk. A previously developed system at the company used ordinary files to store its data but for several reasons it became very hard to manage.
I believe NoSQL databases are good solutions for us. What we are going to store is generally documents (usually around 100K but occasionally can be much larger or smaller) annotated with some metadata. Query performance is not top priority. The priority is writing in a way that I/O becomes as small a hassle as possible. The rate of data generation is about 1Gbps, but we might be moving on 10Gbps (or even more) in the future.
My other requirement is the availability of a (preferably well documented) C API. I'm currently testing MongoDB. Is this a good choice? If not, what other database system can I use?
The rate of data generation is about 1Gbps,... I'm currently testing MongoDB. Is this a good choice?
OK, so just to clarify, your data rate is ~1 gigaBYTE per 10 seconds. So you are filling a 1TB hard drive every 20 minutes or so?
MongoDB has pretty solid write rates, but it is ideally used in situations with a reasonably low RAM to Data ratio. You want to keep at least primary indexes in memory along with some data.
In my experience, you want about 1GB of RAM for every 5-10GB of Data. Beyond that number, read performance drops off dramatically. Once you get to 1GB of RAM for 100GB of data, even adding new data can be slow as the index stops fitting in RAM.
The big key here is:
What queries are you planning to run and how does MongoDB make running these queries easier?
Your data is very quickly going to occupy enough space that basically every query will just be going to disk. Unless you have a very specific indexing and sharding strategy, you end up just doing disk scans.
Additionally, MongoDB does not support compression. So you will be using lots of disk space.
If not, what other database system can I use?
Have you considered compressed flat files? Or possibly a big data Map/Reduce system like Hadoop (I know Hadoop is written in Java)
If C is key requirement, maybe you want to look at Tokyo/Kyoto Cabinet?
EDIT: more details
MongoDB does not support full-text search. You will have to look to other tools (Sphinx/Solr) for such things.
Larges indices defeat the purpose of using an index.
According to your numbers, you are writing 10M documents / 20 mins or about 30M / hour. Each document needs about 16+ bytes for an index entry. 12 bytes for ObjectID + 4 bytes for pointer into the 2GB file + 1 byte for pointer to file + some amount of padding.
Let's say that every index entry needs about 20 bytes, then your index is growing at 600MB / hour or 14.4GB / day. And that's just the default _id index.
After 4 days, your main index will no longer fit into RAM and your performance will start to drop off dramatically. (this is well-documented under MongoDB)
So it's going to be really important to figure out which queries you want to run.
Have a look at Cassandra. It executes writes are much faster than reads. Probably, that's what you're looking for.

How to store 7.3 billion rows of market data (optimized to be read)?

I have a dataset of 1 minute data of 1000 stocks since 1998, that total around (2012-1998)*(365*24*60)*1000 = 7.3 Billion rows.
Most (99.9%) of the time I will perform only read requests.
What is the best way to store this data in a db?
1 big table with 7.3B rows?
1000 tables (one for each stock symbol) with 7.3M rows each?
any recommendation of database engine? (I'm planning to use Amazon RDS' MySQL)
I'm not used to deal with datasets this big, so this is an excellent opportunity for me to learn. I will appreciate a lot your help and advice.
Edit:
This is a sample row:
'XX', 20041208, 938, 43.7444, 43.7541, 43.735, 43.7444, 35116.7, 1, 0, 0
Column 1 is the stock symbol, column 2 is the date, column 3 is the minute, the rest are open-high-low-close prices, volume, and 3 integer columns.
Most of the queries will be like "Give me the prices of AAPL between April 12 2012 12:15 and April 13 2012 12:52"
About the hardware: I plan to use Amazon RDS so I'm flexible on that
So databases are for situations where you have a large complicated schema that is constantly changing. You only have one "table" with a hand-full of simple numeric fields. I would do it this way:
Prepare a C/C++ struct to hold the record format:
struct StockPrice
{
char ticker_code[2];
double stock_price;
timespec when;
etc
};
Then calculate sizeof(StockPrice[N]) where N is the number of records. (On a 64-bit system) It should only be a few hundred gig, and fit on a $50 HDD.
Then truncate a file to that size and mmap (on linux, or use CreateFileMapping on windows) it into memory:
//pseduo-code
file = open("my.data", WRITE_ONLY);
truncate(file, sizeof(StockPrice[N]));
void* p = mmap(file, WRITE_ONLY);
Cast the mmaped pointer to StockPrice*, and make a pass of your data filling out the array. Close the mmap, and now you will have your data in one big binary array in a file that can be mmaped again later.
StockPrice* stocks = (StockPrice*) p;
for (size_t i = 0; i < N; i++)
{
stocks[i] = ParseNextStock(stock_indata_file);
}
close(file);
You can now mmap it again read-only from any program and your data will be readily available:
file = open("my.data", READ_ONLY);
StockPrice* stocks = (StockPrice*) mmap(file, READ_ONLY);
// do stuff with stocks;
So now you can treat it just like an in-memory array of structs. You can create various kinds of index data structures depending on what your "queries" are. The kernel will deal with swapping the data to/from disk transparently so it will be insanely fast.
If you expect to have a certain access pattern (for example contiguous date) it is best to sort the array in that order so it will hit the disk sequentially.
I have a dataset of 1 minute data of 1000 stocks [...] most (99.9%) of the time I will perform only read requests.
Storing once and reading many times time-based numerical data is a use case termed "time series". Other common time series are sensor data in the Internet of Things, server monitoring statistics, application events etc.
This question was asked in 2012, and since then, several database engines have been developing features specifically for managing time series. I've had great results with the InfluxDB, which is open sourced, written in Go, and MIT-licensed.
InfluxDB has been specifically optimized to store and query time series data. Much more so than Cassandra, which is often touted as great for storing time series:
Optimizing for time series involved certain tradeoffs. For example:
Updates to existing data are a rare occurrence and contentious updates never happen. Time series data is predominantly new data that is never updated.
Pro: Restricting access to updates allows for increased query and write performance
Con: Update functionality is significantly restricted
In open sourced benchmarks,
InfluxDB outperformed MongoDB in all three tests with 27x greater write throughput, while using 84x less disk space, and delivering relatively equal performance when it came to query speed.
Queries are also very simple. If your rows look like <symbol, timestamp, open, high, low, close, volume>, with InfluxDB you can store just that, then query easily. Say, for the last 10 minutes of data:
SELECT open, close FROM market_data WHERE symbol = 'AAPL' AND time > '2012-04-12 12:15' AND time < '2012-04-13 12:52'
There are no IDs, no keys, and no joins to make. You can do a lot of interesting aggregations. You don't have to vertically partition the table as with PostgreSQL, or contort your schema into arrays of seconds as with MongoDB. Also, InfluxDB compresses really well, while PostgreSQL won't be able to perform any compression on the type of data you have.
Tell us about the queries, and your hardware environment.
I would be very very tempted to go NoSQL, using Hadoop or something similar, as long as you can take advantage of parallelism.
Update
Okay, why?
First of all, notice that I asked about the queries. You can't -- and we certainly can't -- answer these questions without knowing what the workload is like. (I'll co-incidentally have an article about this appearing soon, but I can't link it today.) But the scale of the problem makes me think about moving away from a Big Old Database because
My experience with similar systems suggests the access will either be big sequential (computing some kind of time series analysis) or very very flexible data mining (OLAP). Sequential data can be handled better and faster sequentially; OLAP means computing lots and lots of indices, which either will take lots of time or lots of space.
If You're doing what are effectively big runs against many data in an OLAP world, however, a column-oriented approach might be best.
If you want to do random queries, especially making cross-comparisons, a Hadoop system might be effective. Why? Because
you can better exploit parallelism on relatively small commodity hardware.
you can also better implement high reliability and redundancy
many of those problems lend themselves naturally to the MapReduce paradigm.
But the fact is, until we know about your workload, it's impossible to say anything definitive.
Okay, so this is somewhat away from the other answers, but... it feels to me like if you have the data in a file system (one stock per file, perhaps) with a fixed record size, you can get at the data really easily: given a query for a particular stock and time range, you can seek to the right place, fetch all the data you need (you'll know exactly how many bytes), transform the data into the format you need (which could be very quick depending on your storage format) and you're away.
I don't know anything about Amazon storage, but if you don't have anything like direct file access, you could basically have blobs - you'd need to balance large blobs (fewer records, but probably reading more data than you need each time) with small blobs (more records giving more overhead and probably more requests to get at them, but less useless data returned each time).
Next you add caching - I'd suggest giving different servers different stocks to handle for example - and you can pretty much just serve from memory. If you can afford enough memory on enough servers, bypass the "load on demand" part and just load all the files on start-up. That would simplify things, at the cost of slower start-up (which obviously impacts failover, unless you can afford to always have two servers for any particular stock, which would be helpful).
Note that you don't need to store the stock symbol, date or minute for each record - because they're implicit in the file you're loading and the position within the file. You should also consider what accuracy you need for each value, and how to store that efficiently - you've given 6SF in your question, which you could store in 20 bits. Potentially store three 20-bit integers in 64 bits of storage: read it as a long (or whatever your 64-bit integer value will be) and use masking/shifting to get it back to three integers. You'll need to know what scale to use, of course - which you could probably encode in the spare 4 bits, if you can't make it constant.
You haven't said what the other three integer columns are like, but if you could get away with 64 bits for those three as well, you could store a whole record in 16 bytes. That's only ~110GB for the whole database, which isn't really very much...
EDIT: The other thing to consider is that presumably the stock doesn't change over the weekend - or indeed overnight. If the stock market is only open 8 hours per day, 5 days per week, then you only need 40 values per week instead of 168. At that point you could end up with only about 28GB of data in your files... which sounds a lot smaller than you were probably originally thinking. Having that much data in memory is very reasonable.
EDIT: I think I've missed out the explanation of why this approach is a good fit here: you've got a very predictable aspect for a large part of your data - the stock ticker, date and time. By expressing the ticker once (as the filename) and leaving the date/time entirely implicit in the position of the data, you're removing a whole bunch of work. It's a bit like the difference between a String[] and a Map<Integer, String> - knowing that your array index always starts at 0 and goes up in increments of 1 up to the length of the array allows for quick access and more efficient storage.
It is my understanding that HDF5 was designed specifically with the time-series storage of stock data as one potential application. Fellow stackers have demonstrated that HDF5 is good for large amounts of data: chromosomes, physics.
I think any major RDBMS would handle this. At the atomic level, a one table with correct partitioning seems reasonable (partition based on your data usage if fixed - this is ikely to be either symbol or date).
You can also look into building aggregated tables for faster access above the atomic level. For example if your data is at day, but you often get data back at the wekk or even month level, then this can be pre-calculated in an aggregate table. In some databases this can be done though a cached view (various names for different DB solutions - but basically its a view on the atomic data, but once run the view is cached/hardened intoa fixed temp table - that is queried for subsequant matching queries. This can be dropped at interval to free up memory/disk space).
I guess we could help you more with some idea as to the data usage.
Here is an attempt to create a Market Data Server on top of the Microsoft SQL Server 2012 database which should be good for OLAP analysis, a free open source project:
http://github.com/kriasoft/market-data
First, there isn't 365 trading days in the year, with holidays 52 weekends (104) = say 250 x the actual hours of day market is opened like someone said, and to use the symbol as the primary key is not a good idea since symbols change, use a k_equity_id (numeric) with a symbol (char) since symbols can be like this A , or GAC-DB-B.TO , then in your data tables of price info, you have, so your estimate of 7.3 billion is vastly over calculated since it's only about 1.7 million rows per symbol for 14 years.
k_equity_id
k_date
k_minute
and for the EOD table (that will be viewed 1000x over the other data)
k_equity_id
k_date
Second, don't store your OHLC by minute data in the same DB table as and EOD table (end of day) , since anyone wanting to look at a pnf, or line chart, over a year period , has zero interest in the by the minute information.
Let me recommend that you take a look at apache solr, which I think would be ideal for your particular problem. Basically, you would first index your data (each row being a "document"). Solr is optimized for searching and natively supports range queries on dates. Your nominal query,
"Give me the prices of AAPL between April 12 2012 12:15 and April 13 2012 12:52"
would translate to something like:
?q=stock:AAPL AND date:[2012-04-12T12:15:00Z TO 2012-04-13T12:52:00Z]
Assuming "stock" is the stock name and "date" is a "DateField" created from the "date" and "minute" columns of your input data on indexing. Solr is incredibly flexible and I really can't say enough good things about it. So, for example, if you needed to maintain the fields in the original data, you can probably find a way to dynamically create the "DateField" as part of the query (or filter).
You should compare the slow solutions with a simple optimized in memory model. Uncompressed it fits in a 256 GB ram server. A snapshot fits in 32 K and you just index it positionally on datetime and stock. Then you can make specialized snapshots, as open of one often equals closing of the previous.
[edit] Why do you think it makes sense to use a database at all (rdbms or nosql)? This data doesn't change, and it fits in memory. That is not a use case where a dbms can add value.
If you have the hardware, I recommend MySQL Cluster. You get the MySQL/RDBMS interface you are so familiar with, and you get fast and parallel writes. Reads will be slower than regular MySQL due to network latency, but you have the advantage of being able to parallelize queries and reads due to the way MySQL Cluster and the NDB storage engine works.
Make sure that you have enough MySQL Cluster machines and enough memory/RAM for each of those though - MySQL Cluster is a heavily memory-oriented database architecture.
Or Redis, if you don't mind a key-value / NoSQL interface to your reads/writes. Make sure that Redis has enough memory - its super-fast for reads and writes, you can do basic queries with it (non-RDBMS though) but is also an in-memory database.
Like others have said, knowing more about the queries you will be running will help.
You will want the data stored in a columnar table / database. Database systems like Vertica and Greenplum are columnar databases, and I believe SQL Server now allows for columnar tables. These are extremely efficient for SELECTing from very large datasets. They are also efficient at importing large datasets.
A free columnar database is MonetDB.
If your use case is to simple read rows without aggregation, you can use Aerospike cluster. It's in memory database with support of file system for persistence. It's also SSD optimized.
If your use case needs aggregated data, go for Mongo DB cluster with date range sharding. You can club year vise data in shards.

Database that can handle >500 millions rows

I am looking for a database that could handle (create an index on a column in a reasonable time and provide results for select queries in less than 3 sec) more than 500 millions rows. Would Postgresql or Msql on low end machine (Core 2 CPU 6600, 4GB, 64 bit system, Windows VISTA) handle such a large number of rows?
Update: Asking this question, I am looking for information which database I should use on a low end machine in order to provide results to select questions with one or two fields specified in where clause. No joins. I need to create indices -- it can not take ages like on mysql -- to achieve sufficient performance for my select queries. This machine is a test PC to perform an experiment.
The table schema:
create table mapper {
key VARCHAR(1000),
attr1 VARCHAR (100),
attr1 INT,
attr2 INT,
value VARCHAR (2000),
PRIMARY KEY (key),
INDEX (attr1),
INDEX (attr2)
}
MSSQL can handle that many rows just fine. The query time is completely dependent on a lot more factors than just simple row count.
For example, it's going to depend on:
how many joins those queries do
how well your indexes are set up
how much ram is in the machine
speed and number of processors
type and spindle speed of hard drives
size of the row/amount of data returned in the query
Network interface speed / latency
It's very easy to have a small (less than 10,000 rows) table which would take a couple minutes to execute a query against. For example, using lots of joins, functions in the where clause, and zero indexes on a Atom processor with 512MB of total ram. ;)
It takes a bit more work to make sure all of your indexes and foreign key relationships are good, that your queries are optimized to eliminate needless function calls and only return the data you actually need. Also, you'll need fast hardware.
It all boils down to how much money you want to spend, the quality of the dev team, and the size of the data rows you are dealing with.
UPDATE
Updating due to changes in the question.
The amount of information here is still not enough to give a real world answer. You are going to just have to test it and adjust your database design and hardware as necessary.
For example, I could very easily have 1 billion rows in a table on a machine with those specs and run a "select top(1) id from tableA (nolock)" query and get an answer in milliseconds. By the same token, you can execute a "select * from tablea" query and it take a while because although the query executed quickly, transferring all of that data across the wire takes awhile.
Point is, you have to test. Which means, setting up the server, creating some of your tables, and populating them. Then you have to go through performance tuning to get your queries and indexes right. As part of the performance tuning you're going to uncover not only how the queries need to be restructured but also exactly what parts of the machine might need to be replaced (ie: disk, more ram, cpu, etc) based on the lock and wait types.
I'd highly recommend you hire (or contract) one or two DBAs to do this for you.
Most databases can handle this, it's about what you are going to do with this data and how you do it. Lots of RAM will help.
I would start with PostgreSQL, it's for free and has no limits on RAM (unlike SQL Server Express) and no potential problems with licences (too many processors, etc.). But it's also my work :)
Pretty much every non-stupid database can handle a billion rows today easily. 500 million is doable even on 32 bit systems (albeit 64 bit really helps).
The main problem is:
You need to have enough RAM. How much is enough depends on your queries.
You need to have a good enough disc subsystem. This pretty much means if you want to do large selects, then a single platter for everything is totally out of the question. Many spindles (or a SSD) are needed to handle the IO load.
Both Postgres as well as Mysql can easily handle 500 million rows. On proper hardware.
What you want to look at is the table size limit the database software imposes. For example, as of this writing, MySQL InnoDB has a limit of 64 TB per table, while PostgreSQL has a limit of 32 TB per table; neither limits the number of rows per table. If correctly configured, these database systems should not have trouble handling tens or hundreds of billions of rows (if each row is small enough), let alone 500 million rows.
For best performance handling extremely large amounts of data, you should have sufficient disk space and good disk performance—which can be achieved with disks in an appropriate RAID—and large amounts of memory coupled with a fast processor(s) (ideally server-grade Intel Xeon or AMD Opteron processors). Needless to say, you'll also need to make sure your database system is configured for optimal performance and that your tables are indexed properly.
The following article discusses the import and use of a 16 billion row table in Microsoft SQL.
https://www.itprotoday.com/big-data/adventures-big-data-how-import-16-billion-rows-single-table.
From the article:
Here are some distilled tips from my experience:
The more data you have in a table with a defined clustered index, the
slower it becomes to import unsorted records into it. At some point,
it becomes too slow to be practical. If you want to export your table
to the smallest possible file, make it native format. This works best
with tables containing mostly numeric columns because they’re more
compactly represented in binary fields than character data. If all
your data is alphanumeric, you won’t gain much by exporting it in
native format. Not allowing nulls in the numeric fields can further
compact the data. If you allow a field to be nullable, the field’s
binary representation will contain a 1-byte prefix indicating how many
bytes of data will follow. You can’t use BCP for more than
2,147,483,647 records because the BCP counter variable is a 4-byte
integer. I wasn’t able to find any reference to this on MSDN or the
Internet. If your table consists of more than 2,147,483,647 records,
you’ll have to export it in chunks or write your own export routine.
Defining a clustered index on a prepopulated table takes a lot of disk
space. In my test, my log exploded to 10 times the original table size
before completion. When importing a large number of records using the
BULK INSERT statement, include the BATCHSIZE parameter and specify how
many records to commit at a time. If you don’t include this parameter,
your entire file is imported as a single transaction, which requires a
lot of log space. The fastest way of getting data into a table with a
clustered index is to presort the data first. You can then import it
using the BULK INSERT statement with the ORDER parameter.
Even that is small compared to the multi-petabyte Nasdaq OMX database, which houses tens of petabytes (thousands of terabytes) and trillions of rows on SQL Server.
Have you checked out Cassandra? http://cassandra.apache.org/
As mentioned pretty much all DB's today can handle this situation - what you want to concentrate on is your disk i/o subsystem. You need to configure a RAID 0 or RAID 0+1 situation throwing as many spindles to the problem as you can. Also, divide up your Log/Temp/Data logical drives for performance.
For example, let say you have 12 drives - in your RAID controller I'd create 3 RAID 0 partitions of 4 drives each. In Windows (let's say) format each group as a logical drive (G,H,I) - now when configuring SQLServer (let's say) assign the tempdb to G, the Log files to H and the data files to I.
I don't have much input on which is the best system to use, but perhaps this tip could help you get some of the speed you're looking for.
If you're going to be doing exact matches of long varchar strings, especially ones that are longer than allowed for an index, you can do a sort of pre-calculated hash:
CREATE TABLE BigStrings (
BigStringID int identity(1,1) NOT NULL PRIMARY KEY CLUSTERED,
Value varchar(6000) NOT NULL,
Chk AS (CHECKSUM(Value))
);
CREATE NONCLUSTERED INDEX IX_BigStrings_Chk ON BigStrings(Chk);
--Load 500 million rows in BigStrings
DECLARE #S varchar(6000);
SET #S = '6000-character-long string here';
-- nasty, slow table scan:
SELECT * FROM BigStrings WHERE Value = #S
-- super fast nonclustered seek followed by very fast clustered index range seek:
SELECT * FROM BigStrings WHERE Value = #S AND Chk = CHECKSUM(#S)
This won't help you if you aren't doing exact matches, but in that case you might look into full-text indexing. This will really change the speed of lookups on a 500-million-row table.
I need to create indices (that does not take ages like on mysql) to achieve sufficient performance for my select queries
I'm not sure what you mean by "creating" indexes. That's normally a one-time thing. Now, it's typical when loading a huge amount of data as you might do, to drop the indexes, load your data, and then add the indexes back, so the data load is very fast. Then as you make changes to the database, the indexes would be updated, but they don't necessarily need to be created each time your query runs.
That said, databases do have query optimization engines where they will analyze your query and determine the best plan to retrieve the data, and see how to join the tables (not relevant in your scenario), and what indexes are available, obviously you'd want to avoid a full table scan, so performance tuning, and reviewing the query plan is important, as others have already pointed out.
The point above about a checksum looks interesting, and that could even be an index on attr1 in the same table.

Database scalability - performance vs. database size

I'm creating an app that will have to put at max 32 GB of data into my database. I am using B-tree indexing because the reads will have range queries (like from 0 < time < 1hr).
At the beginning (database size = 0GB), I will get 60 and 70 writes per millisecond. After say 5GB, the three databases I've tested (H2, berkeley DB, Sybase SQL Anywhere) have REALLY slowed down to like under 5 writes per millisecond.
Questions:
Is this typical?
Would I still see this scalability issue if I REMOVED indexing?
What are the causes of this problem?
Notes:
Each record consists of a few ints
Yes; indexing improves fetch times at the cost of insert times. Your numbers sound reasonable - without knowing more.
You can benchmark it. You'll need to have a reasonable amount of data stored. Consider whether or not to index based upon the queries - heavy fetch and light insert? index everywhere a where clause might use it. Light fetch, heavy inserts? Probably avoid indexes. Mixed workload; benchmark it!
When benchmarking, you want as real or realistic data as possible, both in volume and on data domain (distribution of data, not just all "henry smith" but all manner of names, for example).
It is typical for indexes to sacrifice insert speed for access speed. You can find that out from a database table (and I've seen these in the wild) that indexes every single column. There's nothing inherently wrong with that if the number of updates is small compared to the number of queries.
However, given that:
1/ You seem to be concerned that your writes slow down to 5/ms (that's still 5000/second),
2/ You're only writing a few integers per record; and
3/ You're queries are only based on time queries,
you may want to consider bypassing a regular database and rolling your own sort-of-database (my thoughts are that you're collecting real-time data such as device readings).
If you're only ever writing sequentially-timed data, you can just use a flat file and periodically write the 'index' information separately (say at the start of every minute).
This will greatly speed up your writes but still allow a relatively efficient read process - worst case is you'll have to find the start of the relevant period and do a scan from there.
This of course depends on my assumption of your storage being correct:
1/ You're writing records sequentially based on time.
2/ You only need to query on time ranges.
Yes, indexes will generally slow inserts down, while significantly speeding up selects (queries).
Do keep in mind that not all inserts into a B-tree are equal. It's a tree; if all you do is insert into it, it has to keep growing. The data structure allows for some padding, but if you keep inserting into it numbers that are growing sequentially, it has to keep adding new pages and/or shuffle things around to stay balanced. Make sure that your tests are inserting numbers that are well distributed (assuming that's how they will come in real life), and see if you can do anything to tell the B-tree how many items to expect from the beginning.
Totally agree with #Richard-t - it is quite common in offline/batch scenarios to remove indexes completely before bulk updates to a corpus, only to reapply them when update is complete.
The type of indices applied also influence insertion performance - for example with SQL Server clustered index update I/O is used for data distribution as well as index update, where as nonclustered indexes are updated in seperate (and therefore more expensive) I/O operations.
As with any engineering project - best advice is to measure with real datasets (skews page distribution, tearing etc.)
I think somewhere in the BDB docs they mention that page size greatly affects this behavior in btree's. Assuming you arent doing much in the way of concurrency and you have fixed record sizes, you should try increasing your page size

What slows down growing database performance?

I'm creating a database, and prototyping and benchmarking first. I am using H2, an open-source, commercially free, embeddable, relational, java database. I am not currently indexing on any column.
After the database grew to about 5GB, its batch write speed doubled (the rate of writing was slowed 2x the original rate). I was writing roughly 25 rows per milliseconds with a fresh, clean database and now at 7GB I'm writing roughly 7 rows/ms. My rows consist of a short, an int, a float, and a byte[5].
I do not know much about database internals or even how H2 was programmed. I would also like to note I'm not badmouthing H2, since this is a problem with other DBMSs I've tested.
What factors might slow down the database like this if there's no indexing overhead? Does it mainly have something to do with the file system structure? From my results, I assume the way windows XP and ntfs handle files makes it slower to append data to the end of a file as the file grows.
One factor that can complicate inserts as a database grows is the number of indexes on the table, and the depth of those indexes if they are B-trees or similar. There's simply more work to do, and it may be that you're causing index nodes to split, or you may simply have moved from, say, a 5-level B-tree to a 6-level one (or. more generally, from N to N+1 levels).
Another factor could be disk space usage -- if you are using cooked files (that's the normal kind for most people most of the time; some DBMS use 'raw files' on Unix, but it is unlikely that your embedded system would do so, and you'd know if it did because you'd have to tell it to do so), it could be that your bigger tables are now fragmented across the disk, leading to worse performance.
If the problem was on SELECT performance, there could be many other factors also affecting your system's performance.
This sounds about right. Database performance usually drops significantly as the data can no longer be held in memory and operations become disk bound. If you are using normal insert operations, and want a significant performance improvement, I suggest using some sort of a bulk load API if H2 supports it (like Oracle sqlldr, Sybase BCP, Mysql 'load data infile'). This type of API writes data directly to the data-file bypassing many of the database subsystems.
This is most likely caused by variable width fields. I don't know if H2 allows this, but in MySQL, you have to create your table with all fixed width fields, then explicitly declare it as a fixed width field table. This allows MySQL to calculate exactly where it needs to go in the database file to do the insert. If you aren't using a fixed width table, then it has to read through the table to find the end of the last row.
Appending data (if done right) is an O(n) operation, where n is the length of the data to be written. It doesn't depend on the file length, there are seek operations to skip over that easily.
For most databases, appending to a database file is definitely slower than pre-growing the file and then adding rows. See if H2 supports pre-growing the file.
Another cause is whether the entire database is held in memory or if the OS has to do a lot of disk swapping to find the location to store the record.
I would blame it on I/O, specially if you're running your database on a normal PC with a normal hard disk (by that I mean not in server with super fast hard drives, etc).
Many database engines create an implicit integer primary key for each update, so even if you haven't declared any indexes, your table is still indexed. This may be a factor.
Using H2 for 7G datafile is a wrong choice from technological point of view. As you said, embeddable. What kind of "embedded" application do you have, if you need to store so much data.
Are you performing incremental commits? Since H2 is an ACID compliant database, if you are not performing incremental commits, then there is some type of redo log so that in the case of some accidental failure (say, power outage) or rollback, the deletes can be rolled back.
In that case, your redo log may be growing large and overflowing memory buffers and needing to write out your redo log to disk, as well as your actual data, adding to your I/O overhead.

Resources