Very wide denormalized data (around 40000 columns). Which database to use? - database

I have a very specific problem. I have data with around 40000 columns. The data is denormalized because processing real time would take a lot.
Postgresql has a limitation on column numbers 1600. Can anybody suggest me a database which doesn't have this limitation?
Or if not a database, the method how very wide data can be stored?
The partitioning to smaller tables proved to be tedious task because joining them when a specific query is executed with specific filters can be really messy. I have tried that already.
Thanks!
Edit:
census.gov/programs-surveys/acs/data.html This is the dataset.
Example table:
Nr. of people in some street:
Columns: Number of people, number of people age <18, number of people age<22, number of people age 22<30 etc.
And these combinations get higher and higher. Include race, gender, nationality etc. There are your 40000 columns. And these columns cannot be calculated on the fly. It needs to be precalculated and stored for faster reading – Forsythe 15 mins ago

All databases that I can readily think of have limits in the low thousands (at least SQL Server, MS Access, Oracle, MySQL, Postgres, Teradata, and DB2). You might have better luck with columnar databases but these are rather specialized.
This leaves you with various options:
You can use key-value pairs for the data. However, if the data is dense, then you might have very large data.
You can use other data structures, such as JSON, XML, arrays (in Postgres), or BLOBs (binary large objects).
You can use NOSQL technologies for storing the data.
You can use statistics tools, such as R, SAS, and SPSS.
Ultimately, the question of how you want to store the data depends on what you want to do with it. For instance, if you have a system that has lots of relational data and functionality but has time-series data, then you can store the timeseries in its own table (one row per whatever and per time unit) or you might store the series as a BLOB because you are returning it to the application for further processing.

Maybe I'm getting this wrong, but there seem to be a problem of understanding and data sotring here...
You say you want to display columns that say Number of People, people aged under 18, between 18 and 21... But this is not a propert way to store the data...
The REAL data here is the age of everybody, their gender... Then all the other columns are just calculated...
Then your query needs to be parametrized so you can select properly.
Example using PHP and MySQL :
//These vars come from user input
$ageMin = 18;
$ageMax = 21;
$gender = "male";
$query = "SELECT COUNT(*) FROM MyTable WHERE
Age >= {$ageMin} AND
Age <= {$ageMax} AND
Gender = '{$gender}' ... "
And if you say it cannot be precalculated, then store those result in a table as rows :
Table Calculated
IDCalculation (INTEGER)
Name (CHAR(30))
Criteria
Result
So you just transform your 40000 columns into rows

Related

Save data efficiently to sql table(s)?

Every 15 minutes we read and parse an XML file that includes kpi scores of one element. This element has 5 sub-elements and each sub-element has 400 kpi scores. There are around 250 elements (250 files), which means that every 15 minutes I'll be storing 500K kpi scores (250 elements * 5 sub-elements * 400 kpi scores = 500K KPIs)
This data will be used for reporting, mostly aggregating all this data hourly and daily. In other words, most of the KPIs will eventually be grouped. But the first step is to somehow store the individual counters.
The first, most common, thought was to create a table with the columns being the KPIs. But this was done with similar data, and performance was sub-par, to say the least.
So my question is, what would be the best way to store this raw data?
I was considering of creating a small table that would include the following columns: [Date], [Hour], [Minute], [KPI], [Score]. The problem here (I think) will be the difficulty of querying the data; with a "regular" table, I can simply SELECT KPI1, KPI2, KPI29 from TABLE GROUP BY whatever. With this new format, grouping several KPIS with just one query would be slightly more difficult.
Thanks.
if you need the raw data as such then the best way is store it on a column type XML within a table (plus date/time and maybe other data to identify the load)
#Jayvee is completely correct (+1). I would add: querying raw data is usually slow. Store aggregates as soon as possible and query the aggregates instead of the raw data. The raw data should usually only get used/queried for historical and diagnostic purposes.
What I mean by "store aggregates as soon as possible": you could calculate them on import (maybe slowing your import), or (better) you could (also) insert them into a (2nd) "buffer" table and run a separate job to aggregate the buffered records and then clear the buffer. That way, you don't have to query the massive history table to aggregate your new data.

Database design: storing many large reports for frequent historical analysis

I'm a long time programmer who has little experience with DBMSs or designing databases.
I know there are similar posts regarding this, but am feeling quite discombobulated tonight.
I'm working on a project which will require that I store large reports, multiple times per day, and have not dealt with storage or tables of this magnitude. Allow me to frame my problem in a generic way:
The process:
A script collects roughly 300 rows of information, set A, 2-3 times per day.
The structure of these rows never change. The rows contain two columns, both integers.
The script also collects roughly 100 rows of information, set B, at the same time. The
structure of these rows does not change either. The rows contain eight columns, all strings.
I need to store all of this data. Set A will be used frequently, and daily for analytics. Set B will be used frequently on the day that it is collected and then sparingly in the future for historical analytics. I could theoretically store each row with a timestamp for later query.
If stored linearly, both sets of data in their own table, using a DBMS, the data will reach ~300k rows per year. Having little experience with DBMSs, this sounds high for two tables to manage.
I feel as though throwing this information into a database with each pass of the script will lead to slow read times and general responsiveness. For example, generating an Access database and tossing this information into two tables seems like too easy of a solution.
I suppose my question is: how many rows is too many rows for a table in terms of performance? I know that it would be in very poor taste to create tables for each day or month.
Of course this only melts into my next, but similar, issue, audit logs...
300 rows about 50 times a day for 6 months is not a big blocker for any DB. Which DB are you gonna use? Most will handle this load very easily. There are a couple of techniques for handling data fragmentation if the data rows exceed more than a few 100 millions per table. But with effective indexing and cleaning you can achieve the performance you desire. I myself deal with heavy data tables with more than 200 million rows every week.
Make sure you have indexes in place as per the queries you would issue to fetch that data. Whats ever you have in the where clause should have an appropriate index in db for it.
If you row counts per table exceed many millions you should look at partitioning of tables DBs store data in filesystems as files actually so partitioning would help in making smaller groups of data files based on some predicates e.g: date or some unique column type. You would see it as a single table but on the file system the DB would store the data in different file groups.
Then you can also try table sharding. Which actually is what you mentioned....different tables based on some predicate like date.
Hope this helps.
You are over thinking this. 300k rows is not significant. Just about any relational database or NoSQL database will not have any problems.
Your design sounds fine, however, I highly advise that you utilize the facility of the database to add a primary key for each row, using whatever facility is available to you. Typically this involves using AUTO_INCREMENT or a Sequence, depending on the database. If you used a nosql like Mongo, it will add an id for you. Relational theory depends on having a primary key, and it's often helpful to have one for diagnostics.
So your basic design would be:
Table A tableA_id | A | B | CreatedOn
Table B tableB_id | columns… | CreatedOn
The CreatedOn will facilitate date range queries that limit data for summarization purposes and allow you to GROUP BY on date boundaries (Days, Weeks, Months, Years).
Make sure you have an index on CreatedOn, if you will be doing this type of grouping.
Also, use the smallest data types you can for any of the columns. For example, if the range of the integers falls below a particular limit, or is non-negative, you can usually choose a datatype that will reduce the amount of storage required.

Database design for large amounts of data

I would like to store stock trading data for 1000 symbols. The data is actually converted from text files so there is no need for inserts and updates; only read-only access will be required.
The data is basically grouped like this: each symbol has many records: {timestamp, price, quantity}, each record represents a trade.
An approximate upperbound of data for one symbol is 5 records/second, 8 hours for each working day, i.e. 5x60x60x8 = 144K per day. I.e. 1K symbols would generate 144M records per day.
Most of operations over the data would be something like:
give me all records for a symbol for the period Date D1, Time T1 to Date D2, Time T2
find an min/max/avg of price or quantity for the period [D1, T1...D2, T2]
Now the question: what would be the best design for a database in this case?
Can I store all trades for symbol in a single table? Tables would quickly grow too big in this case though.
Shall I create a separate table per day/week/month? I.e. 2013-10-25_ABC (ABC - symbol name). In this case we may get 1K new tables per day/week/month.
Or, may be plain text files would be enough in such case? E.g., having all symbols data as files under 2013-10-15 folder, resulting in 1K files in each folder
The database may be either MS SQL or MySQL. The total time period - up to 5 years.
Thank you!
That's a whole lot of data. Do look at NoSQl.
Using SQL, here are some basic ideas:
Put all price data in a table, using as small data types as possible. Use a SymbolId (int) to reference the symbol, the smallest datetime type needed, the smallest monetary type needed.
Do denormalize. Make a second table with min/max/avg per day and SymbolId.
Research horizontal partitioning and use indexes.
Third option is the best 1. You need high Read performance with almost negligible writes.
You requirements are best suited for NoSql databases. Single table with no relationships; MySQL would be overkill. More info --> NoSql Databases
Since you'll be running queries from one datetime to another I wouldn't split tables up at all. Instead, learn more about sharding. Below is the schema I would use:
symbols
id varchar(6) // MSFT, GOOG, etc.
name varchar(50) // Microsoft, Google, etc.
...
trades
id unsigned bigint(P)
symbol_id varchar(6)(F symbols.id)
qwhen datetime
price double
quantity double
...

How to store the plant data and where to store?

Plant data is real time data from plant process, such as, press, temperature, gas flow and so on. The data model of these data is typically like this:
(Point Name, Time stamps, value(float or integer), state(int))
We have thousands of points and longtime to store. And important, we want search them easy and quickly when we need.
A typically search request is like:
get data order by time stamp
from database
where Point name is P001_Press
between 2010-01-01 and 2010-01-02
A database similar to MySql is not suitable for us, because the records is too many and the query is too slowly.
So, how to store data (like above) and where to store them? Any NOSQL databases?? Thanks!
This data and query pattern actually fits pretty well into a flat table in a SQL database, which means implementing it with NoSQL will be significantly more work than fixing your query performance in SQL.
If your data is inserted in real time, you can remove the order by clause as the date will already be sorted by timestamp and there is no need to waste time resorting it. An index on point name and timestamp should get you good performance on the rest of the query.
If you are really getting to the limits of what a SQL table can hold (many millions of records) you have the option of sharding - a table for each data point may work fairly well.

What'a good way to store large time/value datasets?

I'm working on an application that stores a lot of quite large time/value datasets (chart data, basically values taken from a sensor every day, hour or 15 minutes for a year+). Currently we're storing them in 2 MySQL tables: a datasets table that stores the info (ID, name, etc) for a dataset, and a table containing (dataset ID, timestamp, value) triplets. This second table is already well over a million rows, and the amount of data to be stored is expected to become many times larger.
The common operations such as retrieving all points for a particular dataset in a range are running quickly enough, but some other more complex operations can be painful.
Is this the best way to organize the data? Is a relational database even particularly suited to this sort of thing? Or do I just need to learn to define better indexes and optimize the queries?
A relational database is definitely what you need for this kind of large structured dataset. If individual queries are causing problems, it's worth profiling each one to find out if different indexes are required or whatever.

Resources