The temporal table can use it to replace the SCD type2 in a data warehouse ?.
I use temporal table in azure sql database.
Typically no. Temporal tables are a good fit for staging tables, and can be used as a source to create slowly-changing dimensions if needed in a dimensional model.
The whole point of a dimensional model is to make writing queries easy. In SCD the fact table still has a simple single-column foreign key to the dimension table. So you get a historically accurate dimension values for each fact rows without complicating the queries.
To get the same result from a temporal table you'd have to join both the main table and the archive table, and filter them on both the business key of the dimension and the validfrom and validto dates.
Also temporal tables only support system versioning, which means that the validfrom and validto are always refer to the clock of the database server. In a data warehouse you might want to use some other temporal reference in your data to model your SCD.
Related
I want to design a data warehouse (Data MART) with one fact table and 2 dimensional tables, where the data mart takes some Slowly Changing Dimensions into consideration, with surrogate key. I'm wondering how I can model this so that data insertion to the dimensional tables can be made independent (inserted before fact table row exist) of the fact table. The data will be streamed from PubSub to BigQuery via Dataflow, thus some of the dimensional data might arrive earlier, needing to be inserted into the dimensional table before the fact data.
I don't completely understand your question. Dimensions are always (or rather, almost always) populated before fact tables are, since fact table records refer to dimensions (and not the other way around).
If you're worried about being able to destroy and rebuild your dimension table without having to also rebuild your fact table, then you'll need to use some sort of surrogate key pipeline to maintain your surrogate key to natural key relationships. But again, I'm not sure that this is what you're asking.
BigQuery does not perform referential integrity check, which means it will not check whether parent row exists in dimension table while inserting child row into fact table and you don't need this in data analytics setup. You can keep appending records to both fact table and dimension tables independently in BigQuery.
Flatten / denormalise the table and keep dimensions in fact tables - repeated records are not going to be an issue in BigQuery - you can make use of Clustering and Partitioning
Other option is, if you have dimensions in RDBMS system, upload dimension tables as files to Cloud Storage / rows to Cloud SQL and join them in Dataflow, in this case you can skip multiple sinks - you can write to a flatten schema into a BigQuery table sink.
Inserting order does not matter in BigQuery, you can reference event records based on pubsub message publishing time / source event time, etc.
Say I want to implement SCD type2 history dimension table (or should I say table with SCD type2 attributes) in the DWH system which for now I has been implementing as a "usual table" with a natural key + primary surrogate key + datefrom + dateto + iscurrent additional columns.
where
the primary surrogate key is needed in order to use it as a foreign key in all fact tables and
datefrom + dateto + iscurrent columns are needed in order to track a history.
Now I want to use a system-versioned temporal table in the fact-dimension DWH design, but MSDN is said that:
A temporal table must have a primary key defined in order to correlate
records between the current table and the history table, and the
history table cannot have a primary key defined.
So it looks like I should use a view with a primary surrogate key generating "on the fly" or another ETL process, but I do not like the both ideas...
Maybe there is another way?
You would use a Temporal Table in the persistent staging area of your data warehouse. Then you can simply apply changes from the source systems, and not loose any historical versions.
Then when you are querying, or when building a dimensional datamart, you can join facts to the current or to the historical version of a dimension. Note that you do not need surrogate keys to do this, but you can produce them to simplify and optimize querying the dimensional model. You can generate the surrogate key with an expression like
ROW_NUMBER () OVER (ORDER BY EmployeeID, ValidTo) AS EmployeeKey
And then joining the dimension table when loading the fact table as usual.
But the interesting thing is that this can defer your dimensional modeling, and you choice of SCD types until you really need them. And reducing and deferring data mart design and implementation helps you deliver incremental progress faster. You can confidently deliver an initial set of reports using views over your persistent staging area (or 'data lake' if you prefer that term), while your design thinking for the datamarts evolves.
I've read a lot of tips and tutorials about normalization but I still find it hard to understand how and when we need normalization. So right now I need to know if this database design for an electricity monitoring system needs to be normalized or not.
So far I have one table with fields:
monitor_id
appliance_name
brand
ampere
uptime
power_kWh
price_kWh
status (ON/OFF)
This monitoring system monitors multiple appliances (TV, Fridge, washing machine) separately.
So does it need to be normalized further? If so, how?
Honestly, you can get away without normalizing every database. Normalization is good if the database is going to be a project that affects many people or if there are performance issues and the database does OLTP. Database normalization in many ways boils down to having larger numbers of tables themselves with fewer columns. Denormalization involves having fewer tables with larger numbers of columns.
I've never seen a real database with only one table, but that's ok. Some people denormalize their database for reporting purposes. So it isn't always necessary to normalize a database.
How do you normalize it? You need to have a primary key (on a column that is unique or a combination of two or more columns that are unique in their combined form). You would need to create another table and have a foreign key relationship. A foreign key relationship is a pair of columns that exist in two or more tables. These columns need to share the same data type. These act as a map from one table to another. The tables are usually separated by real-world purpose.
For example, you could have a table with status, uptime and monitor_id. This would have a foreign key relationship to the monitor_id between the two tables. Your original table could then drop the uptime and status columns. You could have a third table with Brands, Models and the things that all models have in common (e.g., power_kWh, ampere, etc.). There could be a foreign key relationship to the first table based on model. Then the brand column could be eliminated (via the DDL command DROP) from the first table as this third table will have it relating from the model name.
To create new tables, you'll need to invoke a DDL command CREATE TABLE newTable with a foreign key on the column that will in effect be shared by the new table and the original table. With foreign key constraints, the new tables will share a column. The tables will have less information in them (fewer columns) when they are highly normalized. But there will be more tables to accommodate and store all the data. This way you can update one table and not put a lock on all the other columns in a denormalized database with one big table.
Once new tables have the data in the column or columns from the original table, you can drop those columns from the original table (except for the foreign key column). To drop columns, you need to invoke DDL commands (ALTER TABLE originalTable, drop brand).
In many ways, performance will be improved if you try to do many reads and writes (commit many transactions) on a database table in a normalized database. If you use the table as a report, and want to present all the data as it is in the table normally, normalized the database will hurt the peformance.
By the way, normalizing the database can prevent redundant data. This can make the database consume less storage space and use less memory.
It is nice to have our database normalize.It helps us to have a efficient data because we can prevent redundancy here and also saves memory usages. On normalizing tables we need to have a primary key in each table and use this to connect to another table and when the primary key (unique in each table) is on another table it is called the foreign key (use to connect to another table).
Sample you already have this table :
Table name : appliances_tbl
-inside here you have
-appliance_id : as the primary key
-appliance_name
-brand
-model
and so on about this appliances...
Next you have another table :
Table name : appliance_info_tbl (anything for a table name and must be related to its fields)
-appliance_info_id : primary key
-appliance_price
-appliance_uptime
-appliance_description
-appliance_id : foreign key (so you can get the name of the appliance by using only its id)
and so on....
You can add more table like that but just make sure that you have a primary key in each table. You can also put the cardinality to make your normalizing more understandable.
We're a manufacturing company, and we've hired a couple of data scientists to look for patterns and correlation in our manufacturing data. We want to give them a copy of our reporting database (SQL 2014), but it must be in a 'sanitized' form. This means that all table names get converted to 'Table1', 'Table2' etc., and column names in each table become 'Column1', 'Column2' etc. There will be roughly 100 tables, some having 30+ columns, and some tables have 2B+ rows.
I know there is a hard way to do this. This would be to manually create each table, with the sanitized table name and column names, and then use something like SSIS to bulk insert the rows from one table to another. This would be rather time consuming and tedious because of the manual SSIS column mapping required, and manual setup of each table.
I'm hoping someone has done something like this before and has a much faster, more efficienct, way.
By the way, the 'sanitized' database will have no indexes or foreign keys. Also, it may seem to make any sense why we would want to do this, but this is what was agreed to by our Director of Manufacturing and the data scientists, as the first round of analysis which will involve many rounds.
You basically want to scrub the data and objects, correct? Here is what I would do.
Restore a backup of the db.
Drop all objects not needed (indexes, constraints, stored procedures, views, functions, triggers, etc.)
Create a table with two columns, populate the table, each row has orig table name and new table name
Write a script that iterates through the table, roe by row, and renames your tables. Better yet, put the data into excel, and create a third column that builds the tsql you want to build, then cut/paste and execute in ssms.
Repeat step 4, but for all columns. Best to query sys.columns to get all the objects you need, put to excel, and build your tsql
Repeat again for any other objects needed.
Backip/restore will be quicker than dabbling in SSIS and data transfer.
They can see the data but they can't see the column names? What can that possibly accomplish? What are you protecting by not revealing the table or column names? How is a data scientist supposed to evaluate data without context? Without a FK all I see is a bunch of numbers on a column named colx. What are expecting to accomplish? Get a confidentially agreement. Consider a FK columns customerID verses a materialID. Patterns have widely different meanings and analysis. I would correlate a quality measure with materialID or shiftID but not with a customerID.
Oh look there is correlation between tableA.colB and tableX.colY. Well yes that customer is college team and they use aluminum bats.
On top of that you strip indexes (on tables with 2B+ rows) so the analysis they run will be slow. What does that accomplish?
As for the question as stated do a back up restore. Using system table drop all triggers, FK, index, and constraints. Don't forget to drop the triggers and constraints - that may disclose some trade secret. Then rename columns and then tables.
I am looking for ideas to populate a fact table in a data mart. Lets say i have the following dimensions
Physician
Patient
date
geo_location
patient_demography
test
I have used two ETL tools to populate the dimension tables- Pentaho and Oracle Warehouse Builder. The date, patient demography and geo locations do not pull data from the operational store. All dimension tables have their own NEW surrogate key.
I now want to populate the fact table with the details of a visit by a patient.When a patient visits a physician on a particular date, he orders a test. This is the info in the fact table. There are other measures too that I am omitting for simplicity.
I can create a single join with all the required columns in the fact table from the source system. But, i need to store the keys from the dimension tables for Patient, Physician, test etc.. What is the best way to achieve this?
Can ETL tools help in this?
Thank You
Krishna
Each dimension table should have a BusinessKey that uniquely identifies the object (person, date, location) that a table row describes. During loading of the fact table, you have to lookup the PrimaryKey from the dimension table, based on the BusinessKey. You can choose to lookup the dimension table directly, or create a key-lookup table for each dimension just before loading the fact table.
Pentaho Kettle has the "Database Value Lookup" (transformation step) for the purpose. You may also want to look at the "Delivering Fact Tables" section of Kimball's Data Warehouse ETL Toolkit.