database table design for storing different sized datasets - arrays

I am designing a Microsoft Access database to store results from lab equipment. They are in the form of hundreds of lists of frequency vs. response curves which I have previously stored rather easily, but inefficiently in Excel.
The difficulty comes from the fact that the frequency can vary from 1 - 50E9 Hz, the step size between data points can vary from 1 - 1E9, Hz, and the number of points can vary from ~ 100 - 40,000. This has brought up a challenge when it comes to table design because everything I try seems to be very inefficient.
I have considered using links to external text files to store the data points which solves the table design, but seems to violate good database design. I've considered using tables of arrays (i.e. Start Freq, Stop Freq, Freq Step Size, and Array of Responses), but the array sizes could vary greatly which seems just as inefficient.
Is there a recommended practice for storing this type of data? It seems like a common task when storing instrument data, but I can't seem to find anything in web searches. Any assistance will be greatly appreciated.

Looks like a classic 1:N relationship to me. "1" is the measurement session and "N" is all the measurements (i.e. data points) taken in that session. This is modeled by two tables and one foreign key between them, similar to this:
Tweak the fields to suit your needs, but this general design should be more than able to handle large amounts of data and varying numbers of measurements per session.
That being said, MS Access has historically had significant limitations on the size of the data that can be stored in a single database. If you hit these limits, consider using a "real" DBMS.

Related

Recommended approach to store multi-dimensional data (e.g. spectra) in InfluxDB

I am trying to incorporate the time series database with the laboratory real time monitoring equipment. For scalar data such as temperature the line protocol works well:
temperature,site=reactor temperature=20.0 1556892576842902000
For 1D (e.g., IR Spectrum) or higher dimensional data, I came up two approaches to write data.
Write each element of the spectrum as field set as shown below. This way I can query individual frequency and perform analysis or visualization using the existing software. However, each record will easily contain thousands of field sets due to the high resolution of the spectrometer. My concern is whether the line protocol is too chunky and the storage can get inefficient or not.
ir_spectrum,site=reactor w1=10.0,w2=11.2,w3=11.3,......,w4000=2665.2 1556892576842902000
Store the vector as a serialized string (e.g., JSON). This way I may need some plugins to adapt the data to the visualization tools such as Grafana. But the protocol will look cleaner. I am not sure whether the storage layout is better than the first approach or not.
ir_spectrum,site=reactor data="[10.0, 11.2, 11.3, ......, 2665.2]" 1556892576842902000
I wonder whether there is any recommended way to store the high dimensional data? Thanks!
The first approach is better from the performance and disk space usage PoV. InfluxDB stores each field in a separate column. If a column contains similar numeric values, then it may be compressed better compared to the column with JSON strings. This also improves query speed when selecting only a subset of fields or filtering on a subset of fields.
P.S. InfluxDB may need high amounts of RAM for big number of fields and big number of tag combinations (aka high cardinality). In this case there are alternative solutions, which support InfluxDB line protocol and require lower amounts of RAM for high cardinality time series. See, for example, VictoriaMetrics.

Is there a disadvantage to having large columns in your database?

My database stores user stats on a variety of questions. There is no table of question types, so instead of using a join table on the question types, I've just stored the user stats for each type of question the user has done in a serialized hash-map in the user table. Obviously this has led to some decently sized user rows - the serialized stats for my own user is around 950 characters, and I can imagine them easily growing to 5 kb on power users.
I have never read an example of a column this large in any book. Will performance be greatly hindered by having such large/variable columns in my table? Should I add in a table for question types, and make the user stats a separate table as well?
I am currently using PostgreSQL, if that's relevant.
I've seen this serialized approach on systems like ProcessMaker, which is a web workflow and BPM app and stores its data in a serialized fashion. It performs quite well, but building reports based on this data is really tricky.
You can (and should) normalize your database, which is OK if your information model doesn´t change so often.
Otherwise, you may want to try non-relational databases like RavenDB, MongoDB, etc.
The big disadvantage has to do with what happens with a select *. If you have a specific field list, you are not likely to have a big problem but with select * with a lot of TOASTed columns, you have a lot of extra random disk I/O unless everything fits in memory. Selecting fewer columns makes things better.
In an object-relational database like PostgreSQL, database normalization poses different tradeoffs than in a purely relational model. In general it is still a good thing (as I say push the relational model as far as it can comfortably go before doing OR stuff in your db), but it isn't the absolute necessity that you might think of it as being in a purely relational db. Additionally you can add functions to process that data with regexps, extract elements from JSON, etc, and pull those back into your relational queries. So for data that cannot comfortably be normalized, big amorphous "docdb" fields are not that big of a problem.
Depends on the predominant queries you need:
If you need queries that select all (or most) of the columns, then this is the optimal design.
If, however, you select mostly on a subset of columns, then it might be worth trying to "vertically partition"1 the table, so you avoid I/O for the "unneeded" columns and increase the cache efficiency.2
Of course, all this is under assumption that the serialized data behaves as "black box" from the database perspective. If you need to search or constrain that data in some fashion, then just storing a dummy byte array would violate the principle of atomicity and therefore the 1NF, so you'd need to consider normalizing your data...
1 I.e. move the rarely used columns to a second table, which is in 1:1 relationship to the original table. If you are using BLOBs, similar effect could be achieved by declaring what portion of the BLOB should be kept "in-line" - the remainder of any BLOB that exceeds that limit will be stored to a set of pages separate from the table's "core" pages.
2 DBMSes typically implement caching at the page level, so the wider the rows, the less of them will fit into a single page on disk, and therefore into a single page in cache.
You can't search in serialzed arrays.

How to store 7.3 billion rows of market data (optimized to be read)?

I have a dataset of 1 minute data of 1000 stocks since 1998, that total around (2012-1998)*(365*24*60)*1000 = 7.3 Billion rows.
Most (99.9%) of the time I will perform only read requests.
What is the best way to store this data in a db?
1 big table with 7.3B rows?
1000 tables (one for each stock symbol) with 7.3M rows each?
any recommendation of database engine? (I'm planning to use Amazon RDS' MySQL)
I'm not used to deal with datasets this big, so this is an excellent opportunity for me to learn. I will appreciate a lot your help and advice.
Edit:
This is a sample row:
'XX', 20041208, 938, 43.7444, 43.7541, 43.735, 43.7444, 35116.7, 1, 0, 0
Column 1 is the stock symbol, column 2 is the date, column 3 is the minute, the rest are open-high-low-close prices, volume, and 3 integer columns.
Most of the queries will be like "Give me the prices of AAPL between April 12 2012 12:15 and April 13 2012 12:52"
About the hardware: I plan to use Amazon RDS so I'm flexible on that
So databases are for situations where you have a large complicated schema that is constantly changing. You only have one "table" with a hand-full of simple numeric fields. I would do it this way:
Prepare a C/C++ struct to hold the record format:
struct StockPrice
{
char ticker_code[2];
double stock_price;
timespec when;
etc
};
Then calculate sizeof(StockPrice[N]) where N is the number of records. (On a 64-bit system) It should only be a few hundred gig, and fit on a $50 HDD.
Then truncate a file to that size and mmap (on linux, or use CreateFileMapping on windows) it into memory:
//pseduo-code
file = open("my.data", WRITE_ONLY);
truncate(file, sizeof(StockPrice[N]));
void* p = mmap(file, WRITE_ONLY);
Cast the mmaped pointer to StockPrice*, and make a pass of your data filling out the array. Close the mmap, and now you will have your data in one big binary array in a file that can be mmaped again later.
StockPrice* stocks = (StockPrice*) p;
for (size_t i = 0; i < N; i++)
{
stocks[i] = ParseNextStock(stock_indata_file);
}
close(file);
You can now mmap it again read-only from any program and your data will be readily available:
file = open("my.data", READ_ONLY);
StockPrice* stocks = (StockPrice*) mmap(file, READ_ONLY);
// do stuff with stocks;
So now you can treat it just like an in-memory array of structs. You can create various kinds of index data structures depending on what your "queries" are. The kernel will deal with swapping the data to/from disk transparently so it will be insanely fast.
If you expect to have a certain access pattern (for example contiguous date) it is best to sort the array in that order so it will hit the disk sequentially.
I have a dataset of 1 minute data of 1000 stocks [...] most (99.9%) of the time I will perform only read requests.
Storing once and reading many times time-based numerical data is a use case termed "time series". Other common time series are sensor data in the Internet of Things, server monitoring statistics, application events etc.
This question was asked in 2012, and since then, several database engines have been developing features specifically for managing time series. I've had great results with the InfluxDB, which is open sourced, written in Go, and MIT-licensed.
InfluxDB has been specifically optimized to store and query time series data. Much more so than Cassandra, which is often touted as great for storing time series:
Optimizing for time series involved certain tradeoffs. For example:
Updates to existing data are a rare occurrence and contentious updates never happen. Time series data is predominantly new data that is never updated.
Pro: Restricting access to updates allows for increased query and write performance
Con: Update functionality is significantly restricted
In open sourced benchmarks,
InfluxDB outperformed MongoDB in all three tests with 27x greater write throughput, while using 84x less disk space, and delivering relatively equal performance when it came to query speed.
Queries are also very simple. If your rows look like <symbol, timestamp, open, high, low, close, volume>, with InfluxDB you can store just that, then query easily. Say, for the last 10 minutes of data:
SELECT open, close FROM market_data WHERE symbol = 'AAPL' AND time > '2012-04-12 12:15' AND time < '2012-04-13 12:52'
There are no IDs, no keys, and no joins to make. You can do a lot of interesting aggregations. You don't have to vertically partition the table as with PostgreSQL, or contort your schema into arrays of seconds as with MongoDB. Also, InfluxDB compresses really well, while PostgreSQL won't be able to perform any compression on the type of data you have.
Tell us about the queries, and your hardware environment.
I would be very very tempted to go NoSQL, using Hadoop or something similar, as long as you can take advantage of parallelism.
Update
Okay, why?
First of all, notice that I asked about the queries. You can't -- and we certainly can't -- answer these questions without knowing what the workload is like. (I'll co-incidentally have an article about this appearing soon, but I can't link it today.) But the scale of the problem makes me think about moving away from a Big Old Database because
My experience with similar systems suggests the access will either be big sequential (computing some kind of time series analysis) or very very flexible data mining (OLAP). Sequential data can be handled better and faster sequentially; OLAP means computing lots and lots of indices, which either will take lots of time or lots of space.
If You're doing what are effectively big runs against many data in an OLAP world, however, a column-oriented approach might be best.
If you want to do random queries, especially making cross-comparisons, a Hadoop system might be effective. Why? Because
you can better exploit parallelism on relatively small commodity hardware.
you can also better implement high reliability and redundancy
many of those problems lend themselves naturally to the MapReduce paradigm.
But the fact is, until we know about your workload, it's impossible to say anything definitive.
Okay, so this is somewhat away from the other answers, but... it feels to me like if you have the data in a file system (one stock per file, perhaps) with a fixed record size, you can get at the data really easily: given a query for a particular stock and time range, you can seek to the right place, fetch all the data you need (you'll know exactly how many bytes), transform the data into the format you need (which could be very quick depending on your storage format) and you're away.
I don't know anything about Amazon storage, but if you don't have anything like direct file access, you could basically have blobs - you'd need to balance large blobs (fewer records, but probably reading more data than you need each time) with small blobs (more records giving more overhead and probably more requests to get at them, but less useless data returned each time).
Next you add caching - I'd suggest giving different servers different stocks to handle for example - and you can pretty much just serve from memory. If you can afford enough memory on enough servers, bypass the "load on demand" part and just load all the files on start-up. That would simplify things, at the cost of slower start-up (which obviously impacts failover, unless you can afford to always have two servers for any particular stock, which would be helpful).
Note that you don't need to store the stock symbol, date or minute for each record - because they're implicit in the file you're loading and the position within the file. You should also consider what accuracy you need for each value, and how to store that efficiently - you've given 6SF in your question, which you could store in 20 bits. Potentially store three 20-bit integers in 64 bits of storage: read it as a long (or whatever your 64-bit integer value will be) and use masking/shifting to get it back to three integers. You'll need to know what scale to use, of course - which you could probably encode in the spare 4 bits, if you can't make it constant.
You haven't said what the other three integer columns are like, but if you could get away with 64 bits for those three as well, you could store a whole record in 16 bytes. That's only ~110GB for the whole database, which isn't really very much...
EDIT: The other thing to consider is that presumably the stock doesn't change over the weekend - or indeed overnight. If the stock market is only open 8 hours per day, 5 days per week, then you only need 40 values per week instead of 168. At that point you could end up with only about 28GB of data in your files... which sounds a lot smaller than you were probably originally thinking. Having that much data in memory is very reasonable.
EDIT: I think I've missed out the explanation of why this approach is a good fit here: you've got a very predictable aspect for a large part of your data - the stock ticker, date and time. By expressing the ticker once (as the filename) and leaving the date/time entirely implicit in the position of the data, you're removing a whole bunch of work. It's a bit like the difference between a String[] and a Map<Integer, String> - knowing that your array index always starts at 0 and goes up in increments of 1 up to the length of the array allows for quick access and more efficient storage.
It is my understanding that HDF5 was designed specifically with the time-series storage of stock data as one potential application. Fellow stackers have demonstrated that HDF5 is good for large amounts of data: chromosomes, physics.
I think any major RDBMS would handle this. At the atomic level, a one table with correct partitioning seems reasonable (partition based on your data usage if fixed - this is ikely to be either symbol or date).
You can also look into building aggregated tables for faster access above the atomic level. For example if your data is at day, but you often get data back at the wekk or even month level, then this can be pre-calculated in an aggregate table. In some databases this can be done though a cached view (various names for different DB solutions - but basically its a view on the atomic data, but once run the view is cached/hardened intoa fixed temp table - that is queried for subsequant matching queries. This can be dropped at interval to free up memory/disk space).
I guess we could help you more with some idea as to the data usage.
Here is an attempt to create a Market Data Server on top of the Microsoft SQL Server 2012 database which should be good for OLAP analysis, a free open source project:
http://github.com/kriasoft/market-data
First, there isn't 365 trading days in the year, with holidays 52 weekends (104) = say 250 x the actual hours of day market is opened like someone said, and to use the symbol as the primary key is not a good idea since symbols change, use a k_equity_id (numeric) with a symbol (char) since symbols can be like this A , or GAC-DB-B.TO , then in your data tables of price info, you have, so your estimate of 7.3 billion is vastly over calculated since it's only about 1.7 million rows per symbol for 14 years.
k_equity_id
k_date
k_minute
and for the EOD table (that will be viewed 1000x over the other data)
k_equity_id
k_date
Second, don't store your OHLC by minute data in the same DB table as and EOD table (end of day) , since anyone wanting to look at a pnf, or line chart, over a year period , has zero interest in the by the minute information.
Let me recommend that you take a look at apache solr, which I think would be ideal for your particular problem. Basically, you would first index your data (each row being a "document"). Solr is optimized for searching and natively supports range queries on dates. Your nominal query,
"Give me the prices of AAPL between April 12 2012 12:15 and April 13 2012 12:52"
would translate to something like:
?q=stock:AAPL AND date:[2012-04-12T12:15:00Z TO 2012-04-13T12:52:00Z]
Assuming "stock" is the stock name and "date" is a "DateField" created from the "date" and "minute" columns of your input data on indexing. Solr is incredibly flexible and I really can't say enough good things about it. So, for example, if you needed to maintain the fields in the original data, you can probably find a way to dynamically create the "DateField" as part of the query (or filter).
You should compare the slow solutions with a simple optimized in memory model. Uncompressed it fits in a 256 GB ram server. A snapshot fits in 32 K and you just index it positionally on datetime and stock. Then you can make specialized snapshots, as open of one often equals closing of the previous.
[edit] Why do you think it makes sense to use a database at all (rdbms or nosql)? This data doesn't change, and it fits in memory. That is not a use case where a dbms can add value.
If you have the hardware, I recommend MySQL Cluster. You get the MySQL/RDBMS interface you are so familiar with, and you get fast and parallel writes. Reads will be slower than regular MySQL due to network latency, but you have the advantage of being able to parallelize queries and reads due to the way MySQL Cluster and the NDB storage engine works.
Make sure that you have enough MySQL Cluster machines and enough memory/RAM for each of those though - MySQL Cluster is a heavily memory-oriented database architecture.
Or Redis, if you don't mind a key-value / NoSQL interface to your reads/writes. Make sure that Redis has enough memory - its super-fast for reads and writes, you can do basic queries with it (non-RDBMS though) but is also an in-memory database.
Like others have said, knowing more about the queries you will be running will help.
You will want the data stored in a columnar table / database. Database systems like Vertica and Greenplum are columnar databases, and I believe SQL Server now allows for columnar tables. These are extremely efficient for SELECTing from very large datasets. They are also efficient at importing large datasets.
A free columnar database is MonetDB.
If your use case is to simple read rows without aggregation, you can use Aerospike cluster. It's in memory database with support of file system for persistence. It's also SSD optimized.
If your use case needs aggregated data, go for Mongo DB cluster with date range sharding. You can club year vise data in shards.

Storing time-series data, relational or non?

I am creating a system which polls devices for data on varying metrics such as CPU utilisation, disk utilisation, temperature etc. at (probably) 5 minute intervals using SNMP. The ultimate goal is to provide visualisations to a user of the system in the form of time-series graphs.
I have looked at using RRDTool in the past, but rejected it as storing the captured data indefinitely is important to my project, and I want higher level and more flexible access to the captured data. So my question is really:
What is better, a relational database (such as MySQL or PostgreSQL) or a non-relational or NoSQL database (such as MongoDB or Redis) with regard to performance when querying data for graphing.
Relational
Given a relational database, I would use a data_instances table, in which would be stored every instance of data captured for every metric being measured for all devices, with the following fields:
Fields: id fk_to_device fk_to_metric metric_value timestamp
When I want to draw a graph for a particular metric on a particular device, I must query this singular table filtering out the other devices, and the other metrics being analysed for this device:
SELECT metric_value, timestamp FROM data_instances
WHERE fk_to_device=1 AND fk_to_metric=2
The number of rows in this table would be:
d * m_d * f * t
where d is the number of devices, m_d is the accumulative number of metrics being recorded for all devices, f is the frequency at which data is polled for and t is the total amount of time the system has been collecting data.
For a user recording 10 metrics for 3 devices every 5 minutes for a year, we would have just under 5 million records.
Indexes
Without indexes on fk_to_device and fk_to_metric scanning this continuously expanding table would take too much time. So indexing the aforementioned fields and also timestamp (for creating graphs with localised periods) is a requirement.
Non-Relational (NoSQL)
MongoDB has the concept of a collection, unlike tables these can be created programmatically without setup. With these I could partition the storage of data for each device, or even each metric recorded for each device.
I have no experience with NoSQL and do not know if they provide any query performance enhancing features such as indexing, however the previous paragraph proposes doing most of the traditional relational query work in the structure by which the data is stored under NoSQL.
Undecided
Would a relational solution with correct indexing reduce to a crawl within the year? Or does the collection based structure of NoSQL approaches (which matches my mental model of the stored data) provide a noticeable benefit?
Definitely Relational. Unlimited flexibility and expansion.
Two corrections, both in concept and application, followed by an elevation.
Correction
It is not "filtering out the un-needed data"; it is selecting only the needed data. Yes, of course, if you have an Index to support the columns identified in the WHERE clause, it is very fast, and the query does not depend on the size of the table (grabbing 1,000 rows from a 16 billion row table is instantaneous).
Your table has one serious impediment. Given your description, the actual PK is (Device, Metric, DateTime). (Please don't call it TimeStamp, that means something else, but that is a minor issue.) The uniqueness of the row is identified by:
(Device, Metric, DateTime)
The Id column does nothing, it is totally and completely redundant.
An Id column is never a Key (duplicate rows, which are prohibited in a Relational database, must be prevented by other means).
The Id column requires an additional Index, which obviously impedes the speed of INSERT/DELETE, and adds to the disk space used.
You can get rid of it. Please.
Elevation
Now that you have removed the impediment, you may not have recognised it, but your table is in Sixth Normal Form. Very high speed, with just one Index on the PK. For understanding, read this answer from the What is Sixth Normal Form ? heading onwards.
(I have one index only, not three; on the Non-SQLs you may need three indices).
I have the exact same table (without the Id "key", of course). I have an additional column Server. I support multiple customers remotely.
(Server, Device, Metric, DateTime)
The table can be used to Pivot the data (ie. Devices across the top and Metrics down the side, or pivoted) using exactly the same SQL code (yes, switch the cells). I use the table to erect an unlimited variety of graphs and charts for customers re their server performance.
Monitor Statistics Data Model.
(Too large for inline; some browsers cannot load inline; click the link. Also that is the obsolete demo version, for obvious reasons, I cannot show you commercial product DM.)
It allows me to produce Charts Like This, six keystrokes after receiving a raw monitoring stats file from the customer, using a single SELECT command. Notice the mix-and-match; OS and server on the same chart; a variety of Pivots. Of course, there is no limit to the number of stats matrices, and thus the charts. (Used with the customer's kind permission.)
Readers who are unfamiliar with the Standard for Modelling Relational Databases may find the IDEF1X Notation helpful.
One More Thing
Last but not least, SQL is a IEC/ISO/ANSI Standard. The freeware is actually Non-SQL; it is fraudulent to use the term SQL if they do not provide the Standard. They may provide "extras", but they are absent the basics.
Found very interesting the above answers.
Trying to add a couple more considerations here.
1) Data aging
Time-series management usually need to create aging policies. A typical scenario (e.g. monitoring server CPU) requires to store:
1-sec raw samples for a short period (e.g. for 24 hours)
5-min detail aggregate samples for a medium period (e.g. 1 week)
1-hour detail over that (e.g. up to 1 year)
Although relational models make it possible for sure (my company implemented massive centralized databases for some large customers with tens of thousands of data series) to manage it appropriately, the new breed of data stores add interesting functionalities to be explored like:
automated data purging (see Redis' EXPIRE command)
multidimensional aggregations (e.g. map-reduce jobs a-la-Splunk)
2) Real-time collection
Even more importantly some non-relational data stores are inherently distributed and allow for a much more efficient real-time (or near-real time) data collection that could be a problem with RDBMS because of the creation of hotspots (managing indexing while inserting in a single table). This problem in the RDBMS space is typically solved reverting to batch import procedures (we managed it this way in the past) while no-sql technologies have succeeded in massive real-time collection and aggregation (see Splunk for example, mentioned in previous replies).
You table has data in single table. So relational vs non relational is not the question. Basically you need to read a lot of sequential data. Now if you have enough RAM to store a years worth data then nothing like using Redis/MongoDB etc.
Mostly NoSQL databases will store your data on same location on disk and in compressed form to avoid multiple disk access.
NoSQL does the same thing as creating the index on device id and metric id, but in its own way. With database even if you do this the index and data may be at different places and there would be a lot of disk IO.
Tools like Splunk are using NoSQL backends to store time series data and then using map reduce to create aggregates (which might be what you want later). So in my opinion to use NoSQL is an option as people have already tried it for similar use cases. But will a million rows bring the database to crawl (maybe not , with decent hardware and proper configurations).
Create a file, name it 1_2.data. weired idea? what you get:
You save up to 50% of space because you don't need to repeat the fk_to_device and fk_to_metric value for every data point.
You save up even more space because you don't need any indices.
Save pairs of (timestamp,metric_value) to the file by appending the data so you get a order by timestamp for free. (assuming that your sources don't send out of order data for a device)
=> Queries by timestamp run amazingly fast because you can use binary search to find the right place in the file to read from.
if you like it even more optimized start thinking about splitting your files like that;
1_2_january2014.data
1_2_february2014.data
1_2_march2014.data
or use kdb+ from http://kx.com because they do all this for you:) column-oriented is what may help you.
There is a cloud-based column-oriented solution popping up, so you may want to have a look at: http://timeseries.guru
You should look into Time series database. It was created for this purpose.
A time series database (TSDB) is a software system that is optimized for handling time series data, arrays of numbers indexed by time (a datetime or a datetime range).
Popular example of time-series database InfluxDB
I think that the answer for this kind of question should mainly revolve about the way your Database utilize storage.
Some Database servers use RAM and Disk, some use RAM only (optionally Disk for persistency), etc.
Most common SQL Database solutions are using memory+disk storage and writes the data in a Row based layout (every inserted raw is written in the same physical location).
For timeseries stores, in most cases the workload is something like: Relatively-low interval of massive amount of inserts, while reads are column based (in most cases you want to read a range of data from a specific column, representing a metric)
I have found Columnar Databases (google it, you'll find MonetDB, InfoBright, parAccel, etc) are doing terrific job for time series.
As for your question, which personally I think is somewhat invalid (as all discussions using the fault term NoSQL - IMO):
You can use a Database server that can talk SQL on one hand, making your life very easy as everyone knows SQL for many years and this language has been perfected over and over again for data queries; but still utilize RAM, CPU Cache and Disk in a Columnar oriented way, making your solution best fit Time Series
5 Millions of rows is nothing for today's torrential data. Expect data to be in the TB or PB in just a few months. At this point RDBMS do not scale to the task and we need the linear scalability of NoSql databases. Performance would be achieved for the columnar partition used to store the data, adding more columns and less rows kind of concept to boost performance. Leverage the Open TSDB work done on top of HBASE or MapR_DB, etc.
I face similar requirements regularly, and have recently started using Zabbix to gather and store this type of data. Zabbix has its own graphing capability, but it's easy enough to extract the data out of Zabbix's database and process it however you like. If you haven't already checked Zabbix out, you might find it worth your time to do so.

Scaling a MS SQL Server 2008 database

Im trying to work out the best way scale my site, and i have a question on how mssql will scale.
The way the table currently is:
cache_id - int - identifier
cache_name - nvchar 256 - Used for lookup along with event_id
cache_event_id - int - Basicly a way of grouping
cache_creation_date - datetime
cache_data - varbinary(MAX) - Data size will be from 2k to 5k
The data stored is a byte array, thats basically a cached instance (compressed) of a page on my site.
The different ways i see storing i see are:
1) 1 large table, it would contain tens millions of records and easily become several gigabytes in size.
2) Multiple tables to contain the data above, meaning each table would 200k to a million records.
The data will be used from this table to show web pages, so anything over 200ms to get a record is bad in my eyes ( I know some ppl think 1-2 seconds page load is ok, but i think thats slow and want to do my best to keep it lower).
So it boils down to, what is it that slows down the SQL server?
Is it the size of the table ( disk space )
Is the the number of rows
At what point does it stop becoming cost effective to use multiple database servers?
If its close to impossible to predict these things, il accept that as a reply to. Im not a DBA, and im basically trying to design my DB so i dont have to redesign it later when its it contains huge amount of data.
So it boils down to, what is it that slows down the SQL server?
Is it the size of the table ( disk space )
Is the the number of rows
At what point does it stop becoming cost effective to use multiple
database servers?
This is all a 'rule of thumb' view;
Load (and therefore to a considerable extent performance) of a DB is largely a factor of 2 issues data volumes and transaction load, with IMHO the second generally being more relevant.
With regards the data volume one can hold many gigabytes of data and get acceptable access times by way of Normalising, Indexing, Partitioning, Fast IO systems, appropriate buffer cache sizes, etc. Many of these, e.g. Normalisation are the issues that one considers at DB design time, others during system tuning, e.g. additional/less indexes, buffer cache size.
The transactional load is largely a factor of code design and total number of users. Code design includes factors like getting transaction size right (small and fast is the general goal, but like most things it is possible to take it to far and have transactions that are too small to retain integrity or so small as to in itself add load).
When scaling I advise first scale up (bigger, faster server) then out (multiple servers). The admin issues of a multiple server instance are significant and I suggest only worth considering for a site with OS, Network and DBA skills and processes to match.
Normalize and index.
How, we can't tell you, because you haven't told use what your table is trying to model or how you're trying to use it.
1 million rows is not at all uncommon. Again, we can't tell you much in the absence of context only you can, but don't, provide.
The only possible answer is to set it up, and be prepared for a long iterative process of learning things only you will know because only you will live in your domain. Any technical advice you see here will be naive and insufficiently informed until you have some practical experience to share.
Test every single one of your guesses, compare the results, and see what works. And keep looking for more testable ideas. (And don't be afraid to back out changes that end up not helping. It's a basic requirement to have any hope of sustained simplicity.)
And embrace the fact that your database design will evolve. It's not as fearsome as your comment suggests you think it is. It's much easier to change a database than the software that goes around it.

Resources