I have question about creating Data warehouse.
We have system that generate more than 50 million records per day, I do some pre-process in these records then load them to table in data base.
Now you should see the problem which is: the size of single table and the how you can manage it (after about ~15 days of load ~50M record/day) and I need to keep records from 60-days old.
Now my question is: the best way to design my data warehouse is:
to use different table for every day or for every let say week.
OR use single table with many partitions.
OR some other Approach that you find is better for my case?
I need Headline to start my designing for DWH. I'm using Oracle 11g as my database.
Use partitioning if it's available.
Partitioning gets you the best of both worlds. You can access all your data at once, in one simple table. If the query predicates or partition name syntax is used correctly the table will act like it's magically much smaller than it really is. And you can manage the data by day - bulk operations like loading and dropping data can be done in a way that only affects a single day's worth of data.
Interval partitioning makes things even easier. You don't even have to specify the partitions. Just tell Oracle, "make each day a new partition". There are a few new things to learn. But it's a small price to pay for a significant boost in performance and manageability.
If you're using Enterprise Edition and have already licensed the partitioning option then there's no reason not to use it.
Related
Here is my scenario with SQLServer 2008 R2 database table
(Update: Migration to SQL Server 2014 SP1 is in progress, so SQL Server 2014 can be used here).
A. Maintain daily history in the table (which is a fact table)
B. Create tableau graphs using the fact and dimension tables
A few steps to follow to create the table
A copy of the table from the source database will be pushed to my SQLServer DAILY which contain 120,000 to 130,000 rows with 20 columns approximately
a. 1st day, we get 120,000 records, sample structure is below.
(Modified or New records are highlighted in Yellow)
Source System Data:
b. 2nd day, we get, say 122,000 records (2,000 are newly inserted and 1,000 are modified/updated on previous day's data and 119,000 are as it is from previous day)
c. 3rd day, we get, say 123,000 records (1,000 are newly inserted and 1,000 are modified / updated on 2nd day's data and 121,000 are as it is from 2nd day)
Since the daily history has to be maintained in the Fact table, within a week the table will have 1 million rows,
for 2 weeks - 2 million rows
for 1 month - 5 million rows
for 1 year - say 65 - 70 million rows
for 12 years - say 1 billion rows (1,000 million)
12 years history has to be maintained
What could be right strategy to store data in the table to handle this scenario, which should also provide sufficient performance while generating reports ?
Partitioning the table by month wise (the table will contain 5 million rows approx.) ?
Thought of copying the differential data only in the table daily (new and modified rows only) but it is not possible to create tableau reports with Approach-2.
Fact Table Approaches:
Tableau graphs have to created using the fact and dimension tables for scenarios like
Weekly Bar graph for Sample Count
Weekly (week no. on X-axis) plotter graph for average Sample values (on Y-axis)
Weekly (week no. on x-axis) average sample values (on Y-axis) by quality
How to handle this scenario ?
Please provide references on the approach to follow.
Should we create any indexes on the fact table ?
A data warehouse can handle millions of rows these days without a lot of difficulty. Many have tens of billions of rows, and then things get a little difficult. You should look at both table partitioning over time and at columnstore compression and page compression in terms of seeing what is out there. Large warehouses often use both. 2008 R2 is quite old at this point, and note that huge progress has been made in this area in current versions of SQL Server.
Use a standard fact-dimensional design, and try to avoid tweaking the actual schema with workarounds just to conserve space - that generally will bite you in the long run.
For proven, time tested designs in warehousing I like the Kimball group's patterns, e.g. The Data Warehouse Lifecycle Toolkit book.
There are a few different requirements in your case. Because of that, I suggest splitting the requirements according to the standard data warehouse three-tier model.
DWH model (delta-driven, historized, high performance)
Presentation model (Again, high performance, should fit Tableau)
Front end
DWH model
Basically, you have three different approaches here, all with their pros and cons.
3NF
Can become cumbersome down the road. Is highly flexible if being used right. Time-to-market is long (depending on complexity). Historization can become complicated.
Star Schema (for DWH storage!)
Has a very, very fast time-to-market. Will become extremely complicated to maintain when business rules or business structure changes. Helpful for a very small business but not in the case of businesses which want to expand their Business Intelligence infrastructure. Historization can become a mess if the star schema is the DWH main model.
Data Vault
Has a medium time-to-market. Is easier to understand than 3NF but can be puzzling for people used to a star schema. Automatically historized, parallelizable and very flexible for changing business needs, because the business rules are implemented downstream. Scales quickly.
Anchor Modelling
Another highly flexible approach which I haven't used yet. Is in some kind the same approach as Data Vault but with some differences.
Presentation model
Now, to represent the never-touched-again data in the DWH layer, nothing fits better than Star Schema. Also, while creating the star schema, you can implement business logic.
Front end
Shouldn't matter, take the tool you like.
In your case, it would be smart to implement a DWH (using one of those models) and put the presentation model on top of it. If any problems are in the star schema, you could always re-generate it with the new changes.
NOTE: If you would use a star schema as a DWH model, you cannot re-create the star schema in the presentation layer without using some complex transformation logic to begin with.
NOTE: Also, sometimes the star schema is seen as a DWH. I don't think that this is a good use for it for any requirement which could become more complex.
EDIT
To clarify my last note, see this blog post: http://www.tobiasmaasland.de/2016/08/24/why-your-data-warehouse-is-not-a-data-warehouse/
I'm hoping to get some help choosing a database and layout well suited to a web application I have to write (outlined below), I'm a bit stumped given the large number of records and fact that they need to be able to be queried in any manner.
The web app will basically allow querying of a large number of records using any combination of criteria that make up the records, date is the only mandatory item. A record consists of only eight items (below), but there will be about three million new records a day, with very few duplicate records. Data will be constantly inserted into the database real time for the current day.
I know the biggest interest will be in the last 6 months -> 1 years worth of data, but the rest will still need to be available for the same type of queries.
I'm not sure what database is best suited for this, nor how to structure it. The database will be on a reasonably powerful server. I basically want to start with a good db design, and see how the queries perform. I can then judge if I'd rather do optimizations or throw more powerful hardware at it. I just don't want to have to redo the base db design, and it's fine initially if we're doing a lot of optimizations we have time but not $$$.
We need to use something open source, not something like oracle. Right now I'm leaning towards postgres.
A record consists of:
1 Date
2 unsigned integer
3 unsigned integer
4 unsigned integer
5 unsigned integer
6 unsigned integer
7 Text 16 chars
8 Text 255 chars
I'm planning on creating yearly schemas, monthly tables, and indexing the record tables on date for sure.
I'll probably be able to add another index or two after I analyze usage patterns to see what the most popular queries are. I can do lots of tricks on the app site as far as caching popular queries and what not, it's really the db side I need assistance with. Field 8 will have some duplicate values so I'm planning on having that column be an id into a lookup table to join on. Beyond that I guess the remaining fields will all be in one monthly table...
I could break it into weekly tables i suppose as well and use a view for queries so the app doesn't have to deal with trying to assemble a complex query....
anyway, thanks very much for any feedback or assistance!
Some brief advice ...
3 million records a day is a lot! (At least I think so, others might not even blink at that.) I would try to write a tool to insert dummy records and see how something like Postgres performs with one months worth of data.
Might be best to look into NoSQL solutions, which give you the open source + the scalability. Look at Couchbase and Mongo to start. If you are keeping a months worth of data online for real time querying, I'm not sure how Postgres will handle 90 million records. Maybe great, but maybe not.
Consider having "offline" databases in whatever system you decide on. You keep the real time stuff on the best machines and it's ready to go, but you move older data out to another server that is cheaper (read: slower). This way you can always answer queries, but some are faster than others.
In my experience, using primarily Oracle with a similar record insert frequency (several ~billion row tables), you can achieve good web app query performance by carefully partitioning your data (probably by date, in your case) and indexing your tables. How exactly you approach your database architecture will depend on a lot of factors, but there are plenty of good resources on the web for getting help with this stuff.
It sounds like your database is relatively flat, so perhaps another database solution would be better, but Oracle has always worked well for me.
We are developing an application that processes some codes and output large amount of rows each time (millions !). We want to save these rows in a database because the processing itself make take a couple of hours to complete.
1. What is the best way to save these records ?
2. is a NoSql solution usable here ?
Assume that we are saving five million records per day, and may be retrieving from it once in a while.
It depends very much on how you intend to use the data after it is generated. If you will only be looking it up by primary key then NoSQL will probably be fine, but if you ever want to search or sort the data (or join rows together) then an SQL database will probably work better.
Basically, NoSQL is really good at stuffing opaque data into a store and retrieving any individual item very quickly. Relational databases are really good at indexing data that may be joined together or searched.
Any modern SQL database will easily handle 5 million rows per day - disk space is more likely to be your bottleneck, depending on how big your rows are. I haven't done a lot with NoSQL, but I'd be surprised if 5 million items per day would cause a problem.
It depends on exactly what kind of data you want to store - could you elaborate on that? If the data is neatly structured into tables then you don't necessarily need a NoSQL approach. If, however, your data has a graph or network-like structure to it, then you should consider a NoSQL solution. If the latter is true for you, then maybe the following will be helpful to give you an overview of some of the NoSQL databases: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
I have a database that increases every month. The schema remains the same, so I think I use one of these two methods:
Use only one table, new data will be appended to this table, and will be identified by a date column. The increasing data every month is about 20,000 rows, but in long term, I think this should be problem to search and analyze this data
create dynamically one table per month, the table name will indicate which data it contains (for example, Usage-20101125), this will force us to use dynamic SQL, but in long term, it seems fine.
I must confess that I have no experiences about designing this kind of database. Which one should I use in real world?
Thank you so much
20 000 rows per month is not a lot. Go with your first option. You didn't mention which database you'll be using, but SQL Server, Oracle, Sybase and PostgreSQL, to name just a few, can handle millions of rows comfortably.
You will need to investigate a proper maintenance plan, including indexing and statistics, but that will come with lots of reading and experience.
Look into partitioning your table.
That way you can physically store the data on different disks for performance while logically it would be one table so your database stays well designed.
I want to maintain last ten years of stock market data in a single table. Certain analysis need only data of the last one month data. When I do this short term analysis it takes a long time to complete the operation.
To overcome this I created another table to hold current year data alone. When I perform the analysis from this table it 20 times faster than the previous one.
Now my question is:
Is this the right way to have a separate table for this kind of problem. (Or we use separate database instead of table)
If I have separate table Is there any way to update the secondary table automatically.
Or we can use anything like dematerialized view or something like that to gain performance.
Note: I'm using Postgresql database.
You want table partitioning. This will automatically split the data between multiple tables, and will in general work much better than doing it by hand.
I'm working on near the exact same issue.
Table partitioning is definitely the way to go here. I would segment by more than year though, it would give you a greater degree of control. Just set up your partitions and then constrain them by months (or some other date). In your postgresql.conf you'll need to turn constraint_exclusion=on to really get the benefit. The additional benefit here is that you can only index the exact tables you really want to pull information from. If you're batch importing large amounts of data into this table, you may get slightly better results a Rule vs a Trigger and for partitioning, I find rules easier to maintain. But for smaller transactions, triggers are much faster. The postgresql manual has a great section on partitioning via inheritance.
I'm not sure about PostgreSQL, but I can confirm that you are on the right track. When dealing with large data volumes partitioning data into multiple tables and then using some kind of query generator to build your queries is absolutely the right way to go. This approach is well established in Data Warehousing, and specifically in your case stock market data.
However, I'm curious why do you need to update your historical data? If you're dealing with stock splits, it's common to implement that using a seperate multiplier table that is used in conjunction with the raw historical data to give an accurate price/share.
it is perfectly sensible to use separate table for historical records. It's much more problematic with separate database, as it's not simple to write cross-database queries
automatic updates - it's a tool for cronjob
you can use partial indexes for such things - they do wonderful job
Frankly, you should check your execution plans and try fixing your queries or indexing before taking more radical steps.
Indexing comes at very little cost (unless you do a lot of insertions) and your existing code will be faster (if you index properly) without modifying it.
Other measures such as partioning come after that...