I have time series data in a relational database (postgres). Data import to the database every 5 minutes, but imput get overwritten during the day, meaning at the end of the day there is only 1 record for that day for specific id (id and date-> composite PKs).
current process is like this ->Data comes in and is evaluated the same way 1:1. (data comes in every table as they are in source, there is many redundancy.
3 problems:
currently performance of getting data out of database(reading) is fast (good performance)
frontend get query from this database and show data. result of the query is very fast. if I do normalization then getting the query become slower, but writing and updating become easier.
how can I optimize this database?
missing data (ignore this problem )
if we are able to store more records daily (history of one ID in different points of time everyday) then we can show comparison of two points in time in a day. does database support huge amoount of data every day?
DWH
source is just one, all data come from one source. can we have DWH for it or since source is only one, there is no need for it?
Edit:
How can I optimise this database?
currently there is only one Schema in a database. Data comes in and is evaluated the same way 1:1. writng is hard since we have redundany.
my solution:
I want to create 3 schemas for this database.
1 schema, for inserting data into tables, tables structure is base on data source. ( I assume data remains here temporary, and will be transfer in second schema)
2 schema, incoming data stored, and data is structured in 3NF.
3 Schema, denormlising data again because we need to get fast query (fast reading is required).
Your three schema model is exactly how this has been done for many years.
Schema 1:
Names: Staging/Landing/Ingestion
Schema matches the source system but it is cleared and reloaded for every load batch. Typically has a "looser" schema definition to allow for import and capture of bad data
Schema 2:
Names: Replica/ODS/Persisted data store
Schema 2 is never cleared, it's permanent. Following a data load, this layer should look exactly like your source systems. Data in schema 1 is "merged" into schema 2 each time. For example on a daily load cycle, Schema 1 just contains that days data but schema 2 contains the entire history of data loaded. Reference data is merged on a known primary key. Transactional data might be merged on a key or it might be merged on a "windowing" basis - i.e. delete the last days data from schema 2 and load schema 1 in
Some people like to have a "point in time view" where they can recreate what the source system looks like a historical point in time. I've never seen anyone use that though.
Schema 3:
Names: Business Layer/Star Schema/Reporting Layer/Datamart/Sematic Layer
Layer 2, which is usually a replica of an OLTP data model (OLTP is optimised for entering data). This is transformed into a data model that is optimised for reporting.
The tried and tested data model here is a star schema. It's been around for decades. If you research any reporting tool (i.e. Power BI), thay all say that the preferred data model to report from is a star schema. Yes a star schema is denormalised and has other benefits beyonf perforamnce, for example it is more easily understood by a business user, supports slowly changing dimensions etc.
All these concepts are explained further online but of you have any specific questions happy to expand further
Related
(table: order_items)
I'm not sure if this is the correct way to implement an order history table in my database. Normally, I'm trying to reduce the redundancy. But because the user can change data in his/her offer, I need to save the minimum information of the order.
Goal: Buyer can see his/her old orders with correct title/pictures/origin path/allergens (long story...)
What speaks against my approach?
The only "fear" is that the table is going to be bloated with a lot of redundancy information.
This started out as a comment but it's getting too long, so...
What database are you working with?
SQL Server, for instance, introduced the concept of temporal tables in 2016 version. Basically you have two tables identical in structure, where one is the main table where you can use DML just as you would with normal table, and the other is a readonly table that's storing the historical data - so when you update a record in the main table, what is actually happening is that the record gets copied into the history table first, and updated later.
Something similar might exists in other databases as well, and can also be quite easily manually implemented using triggers in case your database does not provide it out of the box.
Of course, you could use the technique called "soft delete", where instead of actually deleting the data you simply mark it as deleted, and instead of updating the data you create a new record with the updated data, and change the status of the existing record to Inactive.
The major advantage of this approach over temporal tables is that you still only have one table for your entity instead of two - but on the other hand, the advantage of temporal tables is that the active data is being kept in a separate table from the historical data, therefor the active data is stored in a relatively small table and as a result, all CRUD operations is more efficient.
The "fear" of having a bloated table in this day and age when memory and storage are so cheep seems a bit strange to me.
Consider a database with several (3-4) tables with a lot of columns (from 15 to 40). In each table we have several thousand records generated per year and about a dozen of changes made for each record.
Right now we need to add a following functionality to our system: every time user makes a change to the record of one of our tables, the system needs to keep track of it - we need to have complete history of changes and also be able to restore row data to selected point.
For some reasons we cannot keep "final" and "historic" data in the same table (so we cannot add some columns to our tables to keep some kind of versioning information, i.e. like wordpress does when it comes to keeping edit history of posts).
What would be best approach to this problem? I was thinking about two solutions:
For each tracked table we have a mirror table with the same columns, and with additional columns where we keep information about versions (i.e. timestamps, id of "original" row etc...)
Pros:
we have data stored exactly in the same way it was in original tables
whenever we need to add a new column to the original table, we can do the same to mirror table
Cons:
we need to create one additional mirror table for each tracked table.
We create one table for "history" revisions. We keep some revisioning information like timestamps etc., and also we keep the track from which table the data originates. But the original data row is being stored in large text column in JSON.
Pros:
we have only one history table for all tracked tables
we don't need to create new mirror tables every time we add new tracked table,
Cons:
there can be some backward compatibility issues while trying to restore data after structure of the original table was changed (i.e. new column was added)
Maybe some other solution?
What would be the best way of keeping the history of versions in such system?
Additional information:
each of the tracked tables can change in the future (i.e. new columns added),
number of tracked tables can change in the future (i.e. new tables added).
FYI: we are using laravel 5.3 and mysql database.
How often do you need access to the auditing data? Is cost of storage ever a concern? Do you need it in the same system that you need the normal data?
Basically, having a table called foo and a second table called foo_log isn't uncommon. It also lets you store foo_log somewhere differently, even possibly a secondary DB. If foo_log is on a spindle disk and foo is on flash, you still get fast reads, but you get somewhat cheaper storage of the backups.
If you don't ever need to display this data, and just need it for legal reasons, or to figure out how something went wrong, the single-table isn't a terrible plan.
But if the issue is backups, which it sounds like it might be, why not just backup the MySQL database on a regular basis and store the backups elsewhere?
Here is my scenario with SQLServer 2008 R2 database table
(Update: Migration to SQL Server 2014 SP1 is in progress, so SQL Server 2014 can be used here).
A. Maintain daily history in the table (which is a fact table)
B. Create tableau graphs using the fact and dimension tables
A few steps to follow to create the table
A copy of the table from the source database will be pushed to my SQLServer DAILY which contain 120,000 to 130,000 rows with 20 columns approximately
a. 1st day, we get 120,000 records, sample structure is below.
(Modified or New records are highlighted in Yellow)
Source System Data:
b. 2nd day, we get, say 122,000 records (2,000 are newly inserted and 1,000 are modified/updated on previous day's data and 119,000 are as it is from previous day)
c. 3rd day, we get, say 123,000 records (1,000 are newly inserted and 1,000 are modified / updated on 2nd day's data and 121,000 are as it is from 2nd day)
Since the daily history has to be maintained in the Fact table, within a week the table will have 1 million rows,
for 2 weeks - 2 million rows
for 1 month - 5 million rows
for 1 year - say 65 - 70 million rows
for 12 years - say 1 billion rows (1,000 million)
12 years history has to be maintained
What could be right strategy to store data in the table to handle this scenario, which should also provide sufficient performance while generating reports ?
Partitioning the table by month wise (the table will contain 5 million rows approx.) ?
Thought of copying the differential data only in the table daily (new and modified rows only) but it is not possible to create tableau reports with Approach-2.
Fact Table Approaches:
Tableau graphs have to created using the fact and dimension tables for scenarios like
Weekly Bar graph for Sample Count
Weekly (week no. on X-axis) plotter graph for average Sample values (on Y-axis)
Weekly (week no. on x-axis) average sample values (on Y-axis) by quality
How to handle this scenario ?
Please provide references on the approach to follow.
Should we create any indexes on the fact table ?
A data warehouse can handle millions of rows these days without a lot of difficulty. Many have tens of billions of rows, and then things get a little difficult. You should look at both table partitioning over time and at columnstore compression and page compression in terms of seeing what is out there. Large warehouses often use both. 2008 R2 is quite old at this point, and note that huge progress has been made in this area in current versions of SQL Server.
Use a standard fact-dimensional design, and try to avoid tweaking the actual schema with workarounds just to conserve space - that generally will bite you in the long run.
For proven, time tested designs in warehousing I like the Kimball group's patterns, e.g. The Data Warehouse Lifecycle Toolkit book.
There are a few different requirements in your case. Because of that, I suggest splitting the requirements according to the standard data warehouse three-tier model.
DWH model (delta-driven, historized, high performance)
Presentation model (Again, high performance, should fit Tableau)
Front end
DWH model
Basically, you have three different approaches here, all with their pros and cons.
3NF
Can become cumbersome down the road. Is highly flexible if being used right. Time-to-market is long (depending on complexity). Historization can become complicated.
Star Schema (for DWH storage!)
Has a very, very fast time-to-market. Will become extremely complicated to maintain when business rules or business structure changes. Helpful for a very small business but not in the case of businesses which want to expand their Business Intelligence infrastructure. Historization can become a mess if the star schema is the DWH main model.
Data Vault
Has a medium time-to-market. Is easier to understand than 3NF but can be puzzling for people used to a star schema. Automatically historized, parallelizable and very flexible for changing business needs, because the business rules are implemented downstream. Scales quickly.
Anchor Modelling
Another highly flexible approach which I haven't used yet. Is in some kind the same approach as Data Vault but with some differences.
Presentation model
Now, to represent the never-touched-again data in the DWH layer, nothing fits better than Star Schema. Also, while creating the star schema, you can implement business logic.
Front end
Shouldn't matter, take the tool you like.
In your case, it would be smart to implement a DWH (using one of those models) and put the presentation model on top of it. If any problems are in the star schema, you could always re-generate it with the new changes.
NOTE: If you would use a star schema as a DWH model, you cannot re-create the star schema in the presentation layer without using some complex transformation logic to begin with.
NOTE: Also, sometimes the star schema is seen as a DWH. I don't think that this is a good use for it for any requirement which could become more complex.
EDIT
To clarify my last note, see this blog post: http://www.tobiasmaasland.de/2016/08/24/why-your-data-warehouse-is-not-a-data-warehouse/
I think the question in the title speaks it all and is general.
I can give a concrete example as well:
I have tagged articles and want to find similar articles with the tags associated with them.
The score function will look at two articles and count the number of tags in common.
Since the score is not stored anywhere, I'll have to calculate the score everytime I need to find similar articles given an article.
But this is too expensive.
What is the common work-around to this kind of problem in general?
Is there a better approach for my specific tag problem? (e.g. solr's moreLikeThis)
edit
I'm using postgres, if that matters.
I'm looking for a general solution that people used successfully, such as you should batch calculate the score and save it somewhere and etc...
The answer will vary wildly by database product and version. For example, in some database products, it may be the case that a view or an indexed view might be faster than the more common solution...
Typically the way to handle a situation like this is by precalculating the result. You can do that in a handful of ways:
a. You can use something like triggers (added in the SQL 99 standard) that update the counts as rows are added, updated or removed from the source table. In this solution, you are making a (presumably) small sacrifice on inserts, updates and deletes of the source table in order to make significant gains in retrieving the information.
b. You can use a data warehouse where you accept some level of latency of live data to reported data. That means you accept that the data queried from the data warehouse will be stale by some accepted number of minutes, hours, days, or weeks. The data warehouse works by periodically querying the live OLTP (Online Transaction Processing) data and updates the OLAP (Online Analytical Processing) database which contains the precalculated results. You then run your reports off the OLAP data or a combination of OLTP and OLAP data. A formal database warehouse isn't required to achieve the equivalent results. You could write a procedure which is executed on a timer that updates a table periodically with updated results.
Plant data is real time data from plant process, such as, press, temperature, gas flow and so on. The data model of these data is typically like this:
(Point Name, Time stamps, value(float or integer), state(int))
We have thousands of points and longtime to store. And important, we want search them easy and quickly when we need.
A typically search request is like:
get data order by time stamp
from database
where Point name is P001_Press
between 2010-01-01 and 2010-01-02
A database similar to MySql is not suitable for us, because the records is too many and the query is too slowly.
So, how to store data (like above) and where to store them? Any NOSQL databases?? Thanks!
This data and query pattern actually fits pretty well into a flat table in a SQL database, which means implementing it with NoSQL will be significantly more work than fixing your query performance in SQL.
If your data is inserted in real time, you can remove the order by clause as the date will already be sorted by timestamp and there is no need to waste time resorting it. An index on point name and timestamp should get you good performance on the rest of the query.
If you are really getting to the limits of what a SQL table can hold (many millions of records) you have the option of sharding - a table for each data point may work fairly well.