Scaling database for millions of records - database

We are developing an application that processes some codes and output large amount of rows each time (millions !). We want to save these rows in a database because the processing itself make take a couple of hours to complete.
1. What is the best way to save these records ?
2. is a NoSql solution usable here ?
Assume that we are saving five million records per day, and may be retrieving from it once in a while.

It depends very much on how you intend to use the data after it is generated. If you will only be looking it up by primary key then NoSQL will probably be fine, but if you ever want to search or sort the data (or join rows together) then an SQL database will probably work better.
Basically, NoSQL is really good at stuffing opaque data into a store and retrieving any individual item very quickly. Relational databases are really good at indexing data that may be joined together or searched.
Any modern SQL database will easily handle 5 million rows per day - disk space is more likely to be your bottleneck, depending on how big your rows are. I haven't done a lot with NoSQL, but I'd be surprised if 5 million items per day would cause a problem.

It depends on exactly what kind of data you want to store - could you elaborate on that? If the data is neatly structured into tables then you don't necessarily need a NoSQL approach. If, however, your data has a graph or network-like structure to it, then you should consider a NoSQL solution. If the latter is true for you, then maybe the following will be helpful to give you an overview of some of the NoSQL databases: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis

Related

Database and design assistance for large number of simple records

I'm hoping to get some help choosing a database and layout well suited to a web application I have to write (outlined below), I'm a bit stumped given the large number of records and fact that they need to be able to be queried in any manner.
The web app will basically allow querying of a large number of records using any combination of criteria that make up the records, date is the only mandatory item. A record consists of only eight items (below), but there will be about three million new records a day, with very few duplicate records. Data will be constantly inserted into the database real time for the current day.
I know the biggest interest will be in the last 6 months -> 1 years worth of data, but the rest will still need to be available for the same type of queries.
I'm not sure what database is best suited for this, nor how to structure it. The database will be on a reasonably powerful server. I basically want to start with a good db design, and see how the queries perform. I can then judge if I'd rather do optimizations or throw more powerful hardware at it. I just don't want to have to redo the base db design, and it's fine initially if we're doing a lot of optimizations we have time but not $$$.
We need to use something open source, not something like oracle. Right now I'm leaning towards postgres.
A record consists of:
1 Date
2 unsigned integer
3 unsigned integer
4 unsigned integer
5 unsigned integer
6 unsigned integer
7 Text 16 chars
8 Text 255 chars
I'm planning on creating yearly schemas, monthly tables, and indexing the record tables on date for sure.
I'll probably be able to add another index or two after I analyze usage patterns to see what the most popular queries are. I can do lots of tricks on the app site as far as caching popular queries and what not, it's really the db side I need assistance with. Field 8 will have some duplicate values so I'm planning on having that column be an id into a lookup table to join on. Beyond that I guess the remaining fields will all be in one monthly table...
I could break it into weekly tables i suppose as well and use a view for queries so the app doesn't have to deal with trying to assemble a complex query....
anyway, thanks very much for any feedback or assistance!
Some brief advice ...
3 million records a day is a lot! (At least I think so, others might not even blink at that.) I would try to write a tool to insert dummy records and see how something like Postgres performs with one months worth of data.
Might be best to look into NoSQL solutions, which give you the open source + the scalability. Look at Couchbase and Mongo to start. If you are keeping a months worth of data online for real time querying, I'm not sure how Postgres will handle 90 million records. Maybe great, but maybe not.
Consider having "offline" databases in whatever system you decide on. You keep the real time stuff on the best machines and it's ready to go, but you move older data out to another server that is cheaper (read: slower). This way you can always answer queries, but some are faster than others.
In my experience, using primarily Oracle with a similar record insert frequency (several ~billion row tables), you can achieve good web app query performance by carefully partitioning your data (probably by date, in your case) and indexing your tables. How exactly you approach your database architecture will depend on a lot of factors, but there are plenty of good resources on the web for getting help with this stuff.
It sounds like your database is relatively flat, so perhaps another database solution would be better, but Oracle has always worked well for me.

Database storage requirements and management for lots of numerical data

I'm trying to figure out how to manage and serve a lot of numerical data. Not sure an SQL database is the right approach. Scenario as follows:
10000 sets of time series data collected per hour
5 floating point values per set
About 5000 hours worth of data collected
So that gives me about 250 million values in total. I need to query this set of data by set ID and by time. If possible, also filter by one or two of the values themselves. I'm also continuously adding to this data.
This seems like a lot of data. Assuming 4 bytes per value, that's 1TB. I don't know what a general "overhead multiplier" for an SQL database is. Let's say it's 2, then that's 2TB of disk space.
What are good approaches to handling this data? Some options I can see:
Single PostgreSQL table with indices on ID, time
Single SQLite table -- this seemed to be unbearably slow
One SQLite file per set -- lots of .sqlite files in this case
Something like MongoDB? Don't even know how this would work ...
Appreciate commentary from those who have done this before.
Mongo is a key-value store; might work for your data but I don't have much experience.
I can tell you that PostgreSQL will be a good choice. It will be able to handle that kind of data. SQLite is definitely not optimized for those use-cases.

What'a good way to store large time/value datasets?

I'm working on an application that stores a lot of quite large time/value datasets (chart data, basically values taken from a sensor every day, hour or 15 minutes for a year+). Currently we're storing them in 2 MySQL tables: a datasets table that stores the info (ID, name, etc) for a dataset, and a table containing (dataset ID, timestamp, value) triplets. This second table is already well over a million rows, and the amount of data to be stored is expected to become many times larger.
The common operations such as retrieving all points for a particular dataset in a range are running quickly enough, but some other more complex operations can be painful.
Is this the best way to organize the data? Is a relational database even particularly suited to this sort of thing? Or do I just need to learn to define better indexes and optimize the queries?
A relational database is definitely what you need for this kind of large structured dataset. If individual queries are causing problems, it's worth profiling each one to find out if different indexes are required or whatever.

Scaling a MS SQL Server 2008 database

Im trying to work out the best way scale my site, and i have a question on how mssql will scale.
The way the table currently is:
cache_id - int - identifier
cache_name - nvchar 256 - Used for lookup along with event_id
cache_event_id - int - Basicly a way of grouping
cache_creation_date - datetime
cache_data - varbinary(MAX) - Data size will be from 2k to 5k
The data stored is a byte array, thats basically a cached instance (compressed) of a page on my site.
The different ways i see storing i see are:
1) 1 large table, it would contain tens millions of records and easily become several gigabytes in size.
2) Multiple tables to contain the data above, meaning each table would 200k to a million records.
The data will be used from this table to show web pages, so anything over 200ms to get a record is bad in my eyes ( I know some ppl think 1-2 seconds page load is ok, but i think thats slow and want to do my best to keep it lower).
So it boils down to, what is it that slows down the SQL server?
Is it the size of the table ( disk space )
Is the the number of rows
At what point does it stop becoming cost effective to use multiple database servers?
If its close to impossible to predict these things, il accept that as a reply to. Im not a DBA, and im basically trying to design my DB so i dont have to redesign it later when its it contains huge amount of data.
So it boils down to, what is it that slows down the SQL server?
Is it the size of the table ( disk space )
Is the the number of rows
At what point does it stop becoming cost effective to use multiple
database servers?
This is all a 'rule of thumb' view;
Load (and therefore to a considerable extent performance) of a DB is largely a factor of 2 issues data volumes and transaction load, with IMHO the second generally being more relevant.
With regards the data volume one can hold many gigabytes of data and get acceptable access times by way of Normalising, Indexing, Partitioning, Fast IO systems, appropriate buffer cache sizes, etc. Many of these, e.g. Normalisation are the issues that one considers at DB design time, others during system tuning, e.g. additional/less indexes, buffer cache size.
The transactional load is largely a factor of code design and total number of users. Code design includes factors like getting transaction size right (small and fast is the general goal, but like most things it is possible to take it to far and have transactions that are too small to retain integrity or so small as to in itself add load).
When scaling I advise first scale up (bigger, faster server) then out (multiple servers). The admin issues of a multiple server instance are significant and I suggest only worth considering for a site with OS, Network and DBA skills and processes to match.
Normalize and index.
How, we can't tell you, because you haven't told use what your table is trying to model or how you're trying to use it.
1 million rows is not at all uncommon. Again, we can't tell you much in the absence of context only you can, but don't, provide.
The only possible answer is to set it up, and be prepared for a long iterative process of learning things only you will know because only you will live in your domain. Any technical advice you see here will be naive and insufficiently informed until you have some practical experience to share.
Test every single one of your guesses, compare the results, and see what works. And keep looking for more testable ideas. (And don't be afraid to back out changes that end up not helping. It's a basic requirement to have any hope of sustained simplicity.)
And embrace the fact that your database design will evolve. It's not as fearsome as your comment suggests you think it is. It's much easier to change a database than the software that goes around it.

How to gain performance when maintaining historical and current data?

I want to maintain last ten years of stock market data in a single table. Certain analysis need only data of the last one month data. When I do this short term analysis it takes a long time to complete the operation.
To overcome this I created another table to hold current year data alone. When I perform the analysis from this table it 20 times faster than the previous one.
Now my question is:
Is this the right way to have a separate table for this kind of problem. (Or we use separate database instead of table)
If I have separate table Is there any way to update the secondary table automatically.
Or we can use anything like dematerialized view or something like that to gain performance.
Note: I'm using Postgresql database.
You want table partitioning. This will automatically split the data between multiple tables, and will in general work much better than doing it by hand.
I'm working on near the exact same issue.
Table partitioning is definitely the way to go here. I would segment by more than year though, it would give you a greater degree of control. Just set up your partitions and then constrain them by months (or some other date). In your postgresql.conf you'll need to turn constraint_exclusion=on to really get the benefit. The additional benefit here is that you can only index the exact tables you really want to pull information from. If you're batch importing large amounts of data into this table, you may get slightly better results a Rule vs a Trigger and for partitioning, I find rules easier to maintain. But for smaller transactions, triggers are much faster. The postgresql manual has a great section on partitioning via inheritance.
I'm not sure about PostgreSQL, but I can confirm that you are on the right track. When dealing with large data volumes partitioning data into multiple tables and then using some kind of query generator to build your queries is absolutely the right way to go. This approach is well established in Data Warehousing, and specifically in your case stock market data.
However, I'm curious why do you need to update your historical data? If you're dealing with stock splits, it's common to implement that using a seperate multiplier table that is used in conjunction with the raw historical data to give an accurate price/share.
it is perfectly sensible to use separate table for historical records. It's much more problematic with separate database, as it's not simple to write cross-database queries
automatic updates - it's a tool for cronjob
you can use partial indexes for such things - they do wonderful job
Frankly, you should check your execution plans and try fixing your queries or indexing before taking more radical steps.
Indexing comes at very little cost (unless you do a lot of insertions) and your existing code will be faster (if you index properly) without modifying it.
Other measures such as partioning come after that...

Resources