Database design for large amounts of data - database

I would like to store stock trading data for 1000 symbols. The data is actually converted from text files so there is no need for inserts and updates; only read-only access will be required.
The data is basically grouped like this: each symbol has many records: {timestamp, price, quantity}, each record represents a trade.
An approximate upperbound of data for one symbol is 5 records/second, 8 hours for each working day, i.e. 5x60x60x8 = 144K per day. I.e. 1K symbols would generate 144M records per day.
Most of operations over the data would be something like:
give me all records for a symbol for the period Date D1, Time T1 to Date D2, Time T2
find an min/max/avg of price or quantity for the period [D1, T1...D2, T2]
Now the question: what would be the best design for a database in this case?
Can I store all trades for symbol in a single table? Tables would quickly grow too big in this case though.
Shall I create a separate table per day/week/month? I.e. 2013-10-25_ABC (ABC - symbol name). In this case we may get 1K new tables per day/week/month.
Or, may be plain text files would be enough in such case? E.g., having all symbols data as files under 2013-10-15 folder, resulting in 1K files in each folder
The database may be either MS SQL or MySQL. The total time period - up to 5 years.
Thank you!

That's a whole lot of data. Do look at NoSQl.
Using SQL, here are some basic ideas:
Put all price data in a table, using as small data types as possible. Use a SymbolId (int) to reference the symbol, the smallest datetime type needed, the smallest monetary type needed.
Do denormalize. Make a second table with min/max/avg per day and SymbolId.
Research horizontal partitioning and use indexes.

Third option is the best 1. You need high Read performance with almost negligible writes.
You requirements are best suited for NoSql databases. Single table with no relationships; MySQL would be overkill. More info --> NoSql Databases

Since you'll be running queries from one datetime to another I wouldn't split tables up at all. Instead, learn more about sharding. Below is the schema I would use:
symbols
id varchar(6) // MSFT, GOOG, etc.
name varchar(50) // Microsoft, Google, etc.
...
trades
id unsigned bigint(P)
symbol_id varchar(6)(F symbols.id)
qwhen datetime
price double
quantity double
...

Related

SQL Server table design with non fixed column

I need your help in designing one table.
I have some groups tables and we need to load data in that group tables from xml files that contain column names and data.
The column name is actually index of some main column like activity_col1, activity_col2 and so on and not fixed every time, there is possibility that same table file contains 1000 columns sometimes and 10 column values some time also there is maximum limit is also defined so no file will contain more than
2000 column per group.
So I need to design a table that is the best possible solution for this also I need to do the aggregation of column values. The files contain min level data and I need to store this data in min table and after that this min data need to be aggregated in an hour, day, week and month.
If I create max columns in all tables but data will not come every time in all columns so this design seems not good because most of the values will be null.
If I insert column name as rows in column_name column and values against each column value in values column then aggregation will be a tedious task for me
and it will impact performance.
Please suggest.
One option would be EAV, but it's more complicated to build, to query and to insert, and readability is very low.
You require a schema-less design, Allowing an unlimited number of columns,
your best bet is probably to use a NoSQL solution. Even though the weaknesses of EAV relative to relational databases also apply to NoSQL alternatives.
Also take a look at here :
Benefits of NoSQL
Recommendations (as priority):
Choice EAV, If you are using a relational-database and this is
where you turn either the whole table or a portion (in another
table) on its side. This is a good choice if you already have a
relational-database in-house that you can't move away from easily.
Choice NoSQL, If does not matter kind of DBMS for you It is very
flexible and fast and not all of the report writers out there
support this style of storage. There are many example database
implementations of NoSQL. The one that seems to be most popular
right now, is MongoDB.
and the last option that I don't recommend you to use it:
Choice Standard tables with XML columns, If the you don't need to
query them, and you just want to be stored and retrieved as plain
text for using some extra usage.
I hope to be helpful for you:)

Very wide denormalized data (around 40000 columns). Which database to use?

I have a very specific problem. I have data with around 40000 columns. The data is denormalized because processing real time would take a lot.
Postgresql has a limitation on column numbers 1600. Can anybody suggest me a database which doesn't have this limitation?
Or if not a database, the method how very wide data can be stored?
The partitioning to smaller tables proved to be tedious task because joining them when a specific query is executed with specific filters can be really messy. I have tried that already.
Thanks!
Edit:
census.gov/programs-surveys/acs/data.html This is the dataset.
Example table:
Nr. of people in some street:
Columns: Number of people, number of people age <18, number of people age<22, number of people age 22<30 etc.
And these combinations get higher and higher. Include race, gender, nationality etc. There are your 40000 columns. And these columns cannot be calculated on the fly. It needs to be precalculated and stored for faster reading – Forsythe 15 mins ago
All databases that I can readily think of have limits in the low thousands (at least SQL Server, MS Access, Oracle, MySQL, Postgres, Teradata, and DB2). You might have better luck with columnar databases but these are rather specialized.
This leaves you with various options:
You can use key-value pairs for the data. However, if the data is dense, then you might have very large data.
You can use other data structures, such as JSON, XML, arrays (in Postgres), or BLOBs (binary large objects).
You can use NOSQL technologies for storing the data.
You can use statistics tools, such as R, SAS, and SPSS.
Ultimately, the question of how you want to store the data depends on what you want to do with it. For instance, if you have a system that has lots of relational data and functionality but has time-series data, then you can store the timeseries in its own table (one row per whatever and per time unit) or you might store the series as a BLOB because you are returning it to the application for further processing.
Maybe I'm getting this wrong, but there seem to be a problem of understanding and data sotring here...
You say you want to display columns that say Number of People, people aged under 18, between 18 and 21... But this is not a propert way to store the data...
The REAL data here is the age of everybody, their gender... Then all the other columns are just calculated...
Then your query needs to be parametrized so you can select properly.
Example using PHP and MySQL :
//These vars come from user input
$ageMin = 18;
$ageMax = 21;
$gender = "male";
$query = "SELECT COUNT(*) FROM MyTable WHERE
Age >= {$ageMin} AND
Age <= {$ageMax} AND
Gender = '{$gender}' ... "
And if you say it cannot be precalculated, then store those result in a table as rows :
Table Calculated
IDCalculation (INTEGER)
Name (CHAR(30))
Criteria
Result
So you just transform your 40000 columns into rows

Database design: storing many large reports for frequent historical analysis

I'm a long time programmer who has little experience with DBMSs or designing databases.
I know there are similar posts regarding this, but am feeling quite discombobulated tonight.
I'm working on a project which will require that I store large reports, multiple times per day, and have not dealt with storage or tables of this magnitude. Allow me to frame my problem in a generic way:
The process:
A script collects roughly 300 rows of information, set A, 2-3 times per day.
The structure of these rows never change. The rows contain two columns, both integers.
The script also collects roughly 100 rows of information, set B, at the same time. The
structure of these rows does not change either. The rows contain eight columns, all strings.
I need to store all of this data. Set A will be used frequently, and daily for analytics. Set B will be used frequently on the day that it is collected and then sparingly in the future for historical analytics. I could theoretically store each row with a timestamp for later query.
If stored linearly, both sets of data in their own table, using a DBMS, the data will reach ~300k rows per year. Having little experience with DBMSs, this sounds high for two tables to manage.
I feel as though throwing this information into a database with each pass of the script will lead to slow read times and general responsiveness. For example, generating an Access database and tossing this information into two tables seems like too easy of a solution.
I suppose my question is: how many rows is too many rows for a table in terms of performance? I know that it would be in very poor taste to create tables for each day or month.
Of course this only melts into my next, but similar, issue, audit logs...
300 rows about 50 times a day for 6 months is not a big blocker for any DB. Which DB are you gonna use? Most will handle this load very easily. There are a couple of techniques for handling data fragmentation if the data rows exceed more than a few 100 millions per table. But with effective indexing and cleaning you can achieve the performance you desire. I myself deal with heavy data tables with more than 200 million rows every week.
Make sure you have indexes in place as per the queries you would issue to fetch that data. Whats ever you have in the where clause should have an appropriate index in db for it.
If you row counts per table exceed many millions you should look at partitioning of tables DBs store data in filesystems as files actually so partitioning would help in making smaller groups of data files based on some predicates e.g: date or some unique column type. You would see it as a single table but on the file system the DB would store the data in different file groups.
Then you can also try table sharding. Which actually is what you mentioned....different tables based on some predicate like date.
Hope this helps.
You are over thinking this. 300k rows is not significant. Just about any relational database or NoSQL database will not have any problems.
Your design sounds fine, however, I highly advise that you utilize the facility of the database to add a primary key for each row, using whatever facility is available to you. Typically this involves using AUTO_INCREMENT or a Sequence, depending on the database. If you used a nosql like Mongo, it will add an id for you. Relational theory depends on having a primary key, and it's often helpful to have one for diagnostics.
So your basic design would be:
Table A tableA_id | A | B | CreatedOn
Table B tableB_id | columns… | CreatedOn
The CreatedOn will facilitate date range queries that limit data for summarization purposes and allow you to GROUP BY on date boundaries (Days, Weeks, Months, Years).
Make sure you have an index on CreatedOn, if you will be doing this type of grouping.
Also, use the smallest data types you can for any of the columns. For example, if the range of the integers falls below a particular limit, or is non-negative, you can usually choose a datatype that will reduce the amount of storage required.

Database Implementation Help : Time-Series data

This is the re-submission of my previous question:
I have a collection of ordered time-series data(stock minute price information). My current database structure using PostgreSQL is below:
symbol_table - where I keep the list of the symbols with the symbol_id as a primary key(serial).
time_table, date_table - time/date values are stored there. time_id/date_id are primary keys(serial/serial).
My main minute_table contains the minute pricing information where
date_id|time_id|symbol_id are primary keys(also foreign keys from the corresponding tables)
Using this main minute_table I'm performing different statistical analyses and keep the results in a separate tables, like one_minute_std - where one minute standard deviation measures are kept.
Every night I'm updating the tables with the current price information from the last day's closing prices.
With the current implementation my tables contain all the symbols with around 50m records each.
Primary keys are indexed.
If I want to query for all the symbols where closing price > x and one_minute_std >2 and one_minute_std < 4 for the specific date it takes about 3-4 minutes for the search.
To speed up the process I was thinking of separating each symbol to its own table but not 100% sure if this is a 'proper' way of doing it.
Could you advise me on how I can speed up the query process?
It sounds like you want a combination of approaches.
First, you should look into table partitioning. This stores a single table across multiple storage units ("files"), but still gives you the flexibility of a single table. (Here is postgres documentation http://www.postgresql.org/docs/current/interactive/ddl-partitioning.html).
You would want to partition either by day or by ticker symbol. My first reaction would be by time (day/week/month), since that is the unit of updates. However, if you analyses are only by a single ticker and often span multiple days, then there is an argument for using that instead.
After partitioning, you may want to consider indexes. However, I suspect that partitioning will solve your performance problems.
Since your updates are at night, you should be folding in your summarization process in with the updates. For instance, one_minute_std should be calculated during this process. You might find it best to load the nightly data into a temporary table, do the calculation for summaries such as one_minute_std, and then load the data into the final partitioned table scheme.
With so many rows that have so few columns, you are probably better off with a good partitioning scheme than an indexing scheme. In particular, indexes have a space overhead, and the smaller the record in each row, the more that using the index incurs an overhead comparable to scanning the entire table.

How to store the plant data and where to store?

Plant data is real time data from plant process, such as, press, temperature, gas flow and so on. The data model of these data is typically like this:
(Point Name, Time stamps, value(float or integer), state(int))
We have thousands of points and longtime to store. And important, we want search them easy and quickly when we need.
A typically search request is like:
get data order by time stamp
from database
where Point name is P001_Press
between 2010-01-01 and 2010-01-02
A database similar to MySql is not suitable for us, because the records is too many and the query is too slowly.
So, how to store data (like above) and where to store them? Any NOSQL databases?? Thanks!
This data and query pattern actually fits pretty well into a flat table in a SQL database, which means implementing it with NoSQL will be significantly more work than fixing your query performance in SQL.
If your data is inserted in real time, you can remove the order by clause as the date will already be sorted by timestamp and there is no need to waste time resorting it. An index on point name and timestamp should get you good performance on the rest of the query.
If you are really getting to the limits of what a SQL table can hold (many millions of records) you have the option of sharding - a table for each data point may work fairly well.

Resources