Can a SCD 2 table have non SCD columns? - database

Lets say there is a table schema where in the columns are a,b,c,d,start_time,end_time,current_status.
Can we have a,b,c as SCD columns and let d not be a part of SCD logic so that if d changes, it wont create a new SCD row?

Yes, if d column value changes, the record is overwritten.

The Type 2 SCD is called Row Versioning where you cab track changes as version records with current flag & active dates and other metadata.
Do not forget that after you have implemented your chosen dimension type, you need to point your fact records at the relevant business or surrogate key. Surrogate keys in the SCD type 2 relate to a specific historical version of the record, removing join complexity from later data structures.

Related

Temporal tables VS SCD 2

The temporal table can use it to replace the SCD type2 in a data warehouse ?.
I use temporal table in azure sql database.
Typically no. Temporal tables are a good fit for staging tables, and can be used as a source to create slowly-changing dimensions if needed in a dimensional model.
The whole point of a dimensional model is to make writing queries easy. In SCD the fact table still has a simple single-column foreign key to the dimension table. So you get a historically accurate dimension values for each fact rows without complicating the queries.
To get the same result from a temporal table you'd have to join both the main table and the archive table, and filter them on both the business key of the dimension and the validfrom and validto dates.
Also temporal tables only support system versioning, which means that the validfrom and validto are always refer to the clock of the database server. In a data warehouse you might want to use some other temporal reference in your data to model your SCD.

How are fact tables formed in relation to the dimension tables?

I am trying to understand how fact tables are form in relation to the dimension tables.
E.g. Sale Fact Table
For there is a query for Sale of product by year/month/week/day, do I create a dimension for each type of period: Dim_Year, Dim_Month, Dim_Week and Dim_Day, each with their own respective keys?
Or is it possible to just use one dimension for all periods: Dim_Date and only have one date key?
Another area I am confused about is that why do some fact tables not contain their own ID? E.g. Sale fact table does not have SaleID included in the fact table.
Sale Fact Table Textbook Example
DATES
Your date dimension needs to correspond to the grain of your fact table. So if you had daily sales you would have a Dim_Day, weekly sales you would have a Dim_Week, etc.
You would normally have multiple date dimensions (at different grains) in your data warehouse as you would have facts at different date grains.
Each date dimension would hold hold attributes applicable to levels higher up in the date hierarchy. So a Dim_Day might hold day, week, month, year attributes; Dim_Month might hold month, quarter and year attributes, etc.
PRIMARY KEYS
Primary keys are rarely (never?) a technical requirement when creating tables in a database i.e. you can create a table without defining a PK. So you need to consider why we normally (at least in OLTP DBs) include PKs. Common reasons include:
To easily identify an individual record
To ensure that duplicate records (those with the same PK value) are
not created
So there are good reasons for creating PKs, however there are cost overheads e.g. the PK needs to be checked every time a new record is inserted into the table.
In a dimensional model where you are performing bulk inserts/updates, having PKs would cause a significant performance hit. Additionally, the insert logic/checks should always be implemented in your ETL processes so there is no need to include these types of checks/constraints in the DB itself.
Fact tables do have a primary key but it is often implicit rather than explicit - so a group of the FKs in the fact table uniquely identify each record. This compound PK may be documented but is is never enabled/implemented.
Occasionally a fact table will have an explicit, single column, PK. This is normally used when the fact table needs to be updated and its implicit PK involves a large number of columns. There is normally logic required to identify the record to be updated using its FKs but this returns the PK; then the update statement just has a clause like this:
WHERE table_pk = 12345678
rather than having to include all the columns in the implicit PK:
WHERE table_sk1 = 1234
AND table_sk2 = 5678
AND table_sk3 = 9876
....
Hope this helps?

Is it possible to use system-versioned temporal table in the fact-dimension DWH design?

Say I want to implement SCD type2 history dimension table (or should I say table with SCD type2 attributes) in the DWH system which for now I has been implementing as a "usual table" with a natural key + primary surrogate key + datefrom + dateto + iscurrent additional columns.
where
the primary surrogate key is needed in order to use it as a foreign key in all fact tables and
datefrom + dateto + iscurrent columns are needed in order to track a history.
Now I want to use a system-versioned temporal table in the fact-dimension DWH design, but MSDN is said that:
A temporal table must have a primary key defined in order to correlate
records between the current table and the history table, and the
history table cannot have a primary key defined.
So it looks like I should use a view with a primary surrogate key generating "on the fly" or another ETL process, but I do not like the both ideas...
Maybe there is another way?
You would use a Temporal Table in the persistent staging area of your data warehouse. Then you can simply apply changes from the source systems, and not loose any historical versions.
Then when you are querying, or when building a dimensional datamart, you can join facts to the current or to the historical version of a dimension. Note that you do not need surrogate keys to do this, but you can produce them to simplify and optimize querying the dimensional model. You can generate the surrogate key with an expression like
ROW_NUMBER () OVER (ORDER BY EmployeeID, ValidTo) AS EmployeeKey
And then joining the dimension table when loading the fact table as usual.
But the interesting thing is that this can defer your dimensional modeling, and you choice of SCD types until you really need them. And reducing and deferring data mart design and implementation helps you deliver incremental progress faster. You can confidently deliver an initial set of reports using views over your persistent staging area (or 'data lake' if you prefer that term), while your design thinking for the datamarts evolves.

Database joins / FK - basic questions

There are 2 tables A & B. Each has say 10 colums.
Table A has 8 columns as FK to other tables. Table B uses enums and std colunms without any FK.
So which table is faster / better to
use?
If i do any action with table A,
i assume I only have to touch colunms
I am relating the action too and do
not have to join all the 10 FK tables
even if i only need 1 FK colunm?
If i do
need to perform any action on a FK,
like write, update or delete a value,
do i need to join to the parent
table?
If i understand correctly,
EAV model is better than a expanded
colunm table because if i need to
display two text from the table then
i need to use a inner join for the
colunm table for for a EAV table i
can use a regular select only with no join?
For only a few values and if the amount of values doesn't change, ENUM can be faster and takes up less space. However, to later add possible values, you'll need to alter the entire table, which is not good design. Table A is in most cases the better option.
Offcourse you only join the table A with the tables you need.
No, you can just modify the table containing the value, unless you change the PK. You should however design your tables in such way that changing the PK is not often needed - use artificial PK's (autoincrements are perfect). Even countries cease to exist or change names...
No, for your EAV you'll need the join. However, joining on keys is extremely fast... this is what relational databases are all about, it's their strong point.

Ideas on Populating the Fact Table in a Data Mart

I am looking for ideas to populate a fact table in a data mart. Lets say i have the following dimensions
Physician
Patient
date
geo_location
patient_demography
test
I have used two ETL tools to populate the dimension tables- Pentaho and Oracle Warehouse Builder. The date, patient demography and geo locations do not pull data from the operational store. All dimension tables have their own NEW surrogate key.
I now want to populate the fact table with the details of a visit by a patient.When a patient visits a physician on a particular date, he orders a test. This is the info in the fact table. There are other measures too that I am omitting for simplicity.
I can create a single join with all the required columns in the fact table from the source system. But, i need to store the keys from the dimension tables for Patient, Physician, test etc.. What is the best way to achieve this?
Can ETL tools help in this?
Thank You
Krishna
Each dimension table should have a BusinessKey that uniquely identifies the object (person, date, location) that a table row describes. During loading of the fact table, you have to lookup the PrimaryKey from the dimension table, based on the BusinessKey. You can choose to lookup the dimension table directly, or create a key-lookup table for each dimension just before loading the fact table.
Pentaho Kettle has the "Database Value Lookup" (transformation step) for the purpose. You may also want to look at the "Delivering Fact Tables" section of Kimball's Data Warehouse ETL Toolkit.

Resources