When implementing a development process to load data into a SCD2 dimension table, what is the most practical method of doing so for a scenario where there are multiple records in the staging table per BusinessKey in the Dimension table.
The first issue in this scenario is that you have 2 or more records in your staging table that can update the IscurrentFlag, EffectiveToDate.
Is implementing a post process to recalculate IsCurrentRecord and EffectiveToDates after data is loaded the only solution?
Scenario example:
The dimension table is populated from 1 source system.
The source system table (Customer) from which the data is extracted contains history. Multiple updates can be done for the same customer in 1 day resulting in multiple records in the source system table
Dimension table :
Staging Table :
Related
During cdc process odi is creating two views JV$ AND JV$D even both have same structure why odi need two views if both are doing the same work.
In the next paragraphs you will see the diferences (extract from link).
The JV$ view is the view that is used in the mappings where you select the option Journalized data only. Records from the J$ table are filtered so that only the following records are returned:
Only Locked records :JRN_CONSUMED=’1’;
If the same PK appears multiple times, only the last entry for that PK (based on the JRN_DATE) is taken into account. Again the logic here is that we want to replicate values as they are currently in the source database. We are not interested in the history of intermediate values that could have existed.
An additional filter is added in the mappings at design time so that only the records for the selected subscriber are consumed from the J$ table, as we saw in figure 5.
Similarly to the JV$ view, the JV$D view joins the J$ table with the source table on the primary key. This view shows all changed records, locked or not, but applies the same filter on the JRN_DATE column so that only the last entry is taken into account when the same record has been modified multiple times since the last consumption cycle. It lists the changes for all subscribers.
I need to move data between two databases and wanted to see if SSIS would be a good tool. I've pieced together the following solution, but it is much more complex than I was hoping it would be - any insight on a better approach to tackling this problem would be greatly appreciated!
So what makes my situation unique; we have a large volume of data, so to keep the system performant we have split our customers into multiple database servers. These servers have databases with the same schema, but are each populated with unique data. Occasionally we have the need to move a customer's data from one server to another. Because of this, simple recreating the tables and moving the data in place won't work as in the database on server A there could be 20 records, but there could be 30 records in the same table for the database on server B. So when moving record 20 from A to B, it will need to be assigned ID 31. Getting past this wasn't difficult, but the trouble comes when needing to move the tables which have a foreign key reference to what is now record 31....
An example:
Here's a sample schema for a simple example:
There is a table to track manufacturers, and a table to track products which each reference a manufacturer.
Example of data in the source database:
To handle moving this data while maintaining relational integrity, I've taken the approach of gathering the manufacturer records, looping through them, and for each manufacturer moving the associated products. Here's a high level look at the Control Flow in SSDT:
The first Data Flow grabs the records from the source database and pulls them into a Recordset Destination:
The OLE DB Source pulls from the source databases manufacturer table while pulling all columns, and places it into a record set:
Back in the control flow, I then loop through the records in the Manufacturer recordset:
For each record in the manufacturer recordset I then execute a SQL task which determines what the next available auto-incrementing ID will be in the destination database, inserts the record, and then returns the results of a SELECT MAX(ManufacturerID) in the Execute SQL Task result set so that the newly created Manufacturer ID can be used when inserting the related products into the destination database:
The above works, however once you get more than a few layers deep of tables that reference one another, this is no longer very tenable. Is there a better way to do this?
You could always try this:
Populate you manufacturers table.
Get your products data (ensure you have a reference such as name etc. to manufacturer)
Use a lookup to get the ID where your name or whatever you choose matches.
Insert into database.
This will keep your FK constraints and not require you to do all that max key selection.
My basic table contain 2 million records of users with 30 columns.
Once in a while there is a new activity open for participation of potentially 100K users (different groups for each activity).
Each user will do a self authentication , and his/her activity data will be saved for further use.
What is the best method to design the database?
Copy 100K into Users_In_Activity table with all required and needed details from the base table. a new PK (Users_In_Activity Primary Key) will be create for each record.
In this method, there will be no joint between the tables and the search
for a record will be done by one PK (Users_In_Activity) from only 100K of records.
Copy 100K of the user basic details for authentication to Potential_Users_In_Activity table. a new PK will be created (include the user PK) and a new User_In_activity PK will be created.
For each successful authentication, a full record will be created in Actual_Users_In_Activity table.
Search for a record will be done by one PK (Users_In_Activity) from only 100K of records.
in this method the is join between 2 tabled with one PK (Users_In_Activity)
For each successful authentication, a full record will be created in Actual_Users_In_Activity table.
in this method the there is no join, but the search will be from all the 2 million records.
.
.
Summarise:
Method 1 : Create 100K of 30 columns records. Search from 100K of records , no need to create new records during activity. No join is needed. Only one table to work with.
Method 2 : Create 100K of 5 columns. Search from 100K of records. Create new records (30 columns) during activity (active users only). Join is needed. 2 table to work with
Method 3 : Search from 2M of records . Create new records (30 columns) during activity (active users only). 2 table to work with
you didn't discuss the basic design,
User Table=2 million record,USERID is PK.This table only contain user details.
Activity Table=Activity Detail,ACtivityID is PK (no relation with user table here) .This table contain Activity detail whenever new activity is created.
User_Activity_Mapping=ActivityID,USERID (Copy 100K users here):This is user-activity relationship table here.
With proper indexing it will work ok.
let me know
I have the following fact table : PlaceId, DateId, StatisticId, StatisticValue.
And I have a dimension with the statistics Ids and its names as the following : StatisticId, StatisticName.
I want to load the fact table with Data with 2 statistics. With this architecture, each row of my data will be represented with 2 rows in my fact table.
The Data has the following attributes : Place,Date,Stat1_Value, Stat2_Value.
How to load my fact table with Ids of these measures and its corresponding Values.
Thank You.
I would use SSIS to move your data into a holding table that has the same columns as your data. Then call a stored procedure that uses SQL to populate your fact table, using UNION to get all the Stat1_Values, and then all the Stat2_Values.
I recently build a simple data warehouse with 2 dimension tables and 1 fact table.
First Dim hold the user input "queryId, dna sequence, dna database name, other parameters".
Second Dim hold database description "databaseId, other parameters".
The fact table will hold the result of the search "queryId, databaseID, hit founded, other parameters describe the hit".
Now, where should I upload the data (The result)? To the fact table? or to the dimensions table?
To where should I upload "queryId and databaseID"? because they are in dimensions and in the fact. Sorry for this question but, I am new to DW.
Thanks a lot,
You have to create an ETL that loads like this (this assumes we rebuild the DW on each import, the steps are different for incremental loading):
Truncate fact table
Truncate dimensions
Populate dimensions, (your keys should be in the dimensions)
Populate the fact with with your dimension keys and measures
Then, when querying, you'll join your dimensions onto your fact via the keys.
Neither nor.
You UPLOAD data into staging tables. Those are created for optimal upload speed. Staging tables may be flat, may not be complete and require joining with other tables.
Then you use a loading process to load them from staging into the data warehouse.