Data Vault 2 - Hash diff and recurring data changes

Data Vault 2 - Hash diff and recurring data changes - database

I am having issues retrieving the latest value in a satellite table when some data is changed back to a former value.
The database is Snowflake.
As per Data Vault 2.0, I am currently using the hash diff function to assess whether to insert a new record in a satellite table, like this:
INSERT ALL
WHEN (SELECT COUNT(*) FROM SAT_ACCOUNT_DETAILS AD WHERE AD.MD5_HUB_ACCOUNT = MD5_Account AND AD.HASH_DIFF = AccHashDiff) = 0
THEN
INTO SAT_ACCOUNT_DETAILS (MD5_HUB_ACCOUNT
, HASH_DIFF
, ACCOUNT_CODE
, DESCRIPTION
, SOME_DETAIL
, LDTS)
VALUES (MD5_AE_Account
, AccHashDiff
, AccountCode
, Description
, SomeDetail
, LoadDTS)
SELECT DISTINCT
MD5(AccountId) As MD5_Account
, MD5(UPPER(COALESCE(TO_VARCHAR(AccountCode), '')
|| '^' || COALESCE(TO_VARCHAR(Description), '')
|| '^' || COALESCE(TO_VARCHAR(SomeDetail), '')
)) AS AccHashDiff
, AccountCode
, Description
, SomeDetail
, LoadDTS
FROM source_table;
The first time, a new record with AccountCode = '100000' and SomeDetail = 'ABC' is added:
MD5_HUB_ACCOUNT
HASH_DIFF
ACCOUNT_CODE
DESCRIPTION
SOME_DETAIL
LDTS
c81e72...
8d9d43...
100000
An Account
ABC
2021-04-08 10:00
An hour later, an update changes the value of SomeDetail to 'DEF', this is the resulting table:
MD5_HUB_ACCOUNT
HASH_DIFF
ACCOUNT_CODE
DESCRIPTION
SOME_DETAIL
LDTS
c81e72...
8d9d43...
100000
An Account
ABC
2021-04-08 10:00
c81e72...
a458b2...
100000
An Account
DEF
2021-04-08 11:00
A third update sets the value of SomeDetail back to 'ABC', but the record is not inserted in the satellite table, because the value of the hash diff is the same as the first inserted record (i.e. 8d9d43...).
If I query which is the latest record in the satellite table, the LDTS column tells me it's the one with 'DEF' which is not the desired result.
Instead, I should have a record with SomeDetail = 'ABC' and LDTS = '2021-04-08 12:00'.
What is the correct approach to this? If I add LoadDTS to the hash diff, a new record will be created each time an update is pushed, which is not the desired result either.

As you (and also the standard) mentionned, you need to compare to the last effective record.
I'm not an expert with Snowflake, but it might look like this :
INSERT ALL
WHEN (SELECT COUNT(*) FROM SAT_ACCOUNT_DETAILS AD WHERE AD.MD5_HUB_ACCOUNT = MD5_Account AND AD.HASH_DIFF = AccHashDiff AND AD.LDTS = (SELECT MAX(LDTS) FROM SAT_ACCOUNT_DETAILS MAD WHERE MAD.MD5_HUB_ACCOUNT = AD.MD5_HUB_ACCOUNT)) = 0
THEN ....
By adding "AD.LDTS = (SELECT MAX(LDTS) FROM....." to the query, you make sure you test against the latest data and not historical data

Related

INSERT INTO, if already exist UPDATE. from an existing query. to be put on SQL Agent job

I will try to be clear.
I have a query to pull data :
select typ_no_store as IDstore, str_name as 'name', (CASE WHEN str_addr2 IS NULL THEN str_addr1 ELSE str_addr2 END) as adresse, str_postal as postalcode
from store_type inner join STRFIL on typ_no_store = str_store_no
where typ_code = 'a'
That will result in something like this :
001 Newy store 600 BLVD someht G11111
002 LA store 770 BLVD ests G22222
010 Texas store 112 dsntexists G33333
I need to put the ongoing changes of this result into a new table.
I've created a table name 'Webstore' with the same column value. (IDstore,name,adresse,postalcode)
I need a query that will INSERT INTO or UPDATE(if ID already exist) the result of the query into the new table. The query will run every 2 hours on SQL server Agent.
I tried command like IF EXIST, ON DUPLICATE KEY and MERGE but I can't get it to work. It seems that because I must pull the data from a query and not from typed values or an entire Table it doesn't work.
Any idea ?
(Sorry if my english is not clear, don't hesitate to ask question Thank you !!)

You can use MERGE in sql server.
;WITH cte
AS
(
SELECT typ_no_store as IDstore,
str_name as 'name',
(CASE WHEN str_addr2 IS NULL THEN str_addr1 ELSE str_addr2 END) as adresse,
str_postal as postalcode
FROM store_type
INNER JOIN STRFIL on typ_no_store = str_store_no
WHERE typ_code = 'a'
)
MERGE Webstore AS t
USING cte AS s
ON t.IDstore=s.IDstore
WHEN MATCHED THEN UPDATE
SET t.[name]=s.name,
t.other fields = s.other fields ...
WHEN NOT MATCHED BY TARGET THEN
INSERT (IDstore, other fields...)
VALUES(s.IDstore, s.other fields)
;

SQL Merge with Partial Matching tables

I have investigated lots of posts relating to SQL merging of two tables, and am very familiar with merging in SQL, but this one has me stumped. Here is the scenario: I have a table of values which are stored by Grade and Currency, i.e.
GradeID = 3, Rate = 175, CurrencyID = 5.
There are five grades, a different value for each and these are the default values for each grade in GBP (CurrencyID = 5).
I have another table with an extra column of Fiscal Period, which is a 1st of the month date to which the rate applies, i.e.
GradeID = 3, Rate = 150, FiscalPeriodID = 95, CurrencyID = 5.
Not all values for each grade, fiscal period and currency will be different from the default, so users will only enter differences where necessary. I have a Windows form for the user to view and update these values, but am struggling to show the results of both tables together.
The ideal solution for a case where there is no stored data, i.e. only the default values apply, is that the screen will show the default values at the current fiscal period i.e. 95 = 01/11/2017, but where there are values which have been entered, the table will show the changed rows by grade but grouped by each fiscal period which has a non-default value.
The first part of setting the current fiscal period for the default values is relatively easy:
SELECT [RateID], [GradeID], [ChargeRate],
(SELECT [FiscalPeriodID] FROM [dbo].[FiscalPeriod] WHERE MONTH([FiscalPeriod]) =
MONTH(GETDATE()) AND YEAR([FiscalPeriod]) = YEAR(GETDATE())) AS [FiscalPeriodID]
,[CurrencyID]
FROM [dbo].[DefaultChargeRateByLevel]
But what I'd like to do is perform one query that does either this simple selection (when no results returned for any fiscal period) or a merge of the default and updated values to show groups of 5 values per fiscal period.
My initial thought was to populate two temporary tables with the results from each query then perform a merge, but I can't figure out how to populate the fiscal period for each 'group of values. This is what i have so far, but it fails with the following error: Cannot insert the value NULL into column 'FiscalPeriodID', table 'tempdb.dbo.#TempTarget
IF OBJECT_ID('tempdb..#TempTarget') IS NOT NULL
DROP TABLE #TempTarget
IF OBJECT_ID('tempdb..#TempSource') IS NOT NULL
DROP TABLE #TempSource
SELECT
[GradeID]
,[ChargeRate]
,[FiscalPeriodID]
,[CurrencyID]
INTO #TempTarget
FROM (SELECT
[GradeID]
,[ChargeRate]
,[FiscalPeriodID]
,[CurrencyID]
FROM
[dbo].[ChargeRateByProjectAndLevel]) AS A
SELECT * FROM #TempTarget
SELECT
[GradeID]
,[ChargeRate]
,[FiscalPeriodID]
,[CurrencyID]
INTO #TempSource
FROM (SELECT
D.[GradeID]
,D.[ChargeRate]
,M.[FiscalPeriodID]
,D.[CurrencyID]
FROM
[dbo].[DefaultChargeRateByLevel] AS D
LEFT JOIN #TempTarget AS M ON M.[GradeID] = D.[GradeID]
AND M.[CurrencyID] = D.[CurrencyID]
) AS B
SELECT * FROM #TempSource
/* Now merge rates data from temp table to FxRates table */
MERGE #TempTarget AS [Target]
USING #TempSource AS [Source]
ON [Source].[GradeID] = [Target].[GradeID]
AND [Source].[CurrencyID] = [Target].[CurrencyID]
WHEN NOT MATCHED BY TARGET THEN
INSERT (
[GradeID]
,[ChargeRate]
,[FiscalPeriodID]
,[CurrencyID]
)
VALUES (
[Source].[GradeID]
,[Source].[ChargeRate]
,[Source].[FiscalPeriodID]
,[Source].[CurrencyID]
);
SELECT * FROM #TempTarget

Move Data from One Table To Another Table With New Layout

Please note, I asked this previously, but realized that I left out some very important information and felt it better to remove the original question and post a new one. My apologies to all........
I have a table that has the following columns:
ID
Name
2010/Jan
2010/Jan_pct
2010/Feb
2010/Feb_pct
.....
.....
2017/Nov
2017/Nov_pct
And then a column like that for every month/year combination to the present (hopefully that makes sense). Please note though: it is NOT a given that every month / year combination is present. There might be a gap or a missing month/year. For instance, I know 2017/Jan, 2017/Feb are missing and there could be any number missing. I just didn't want to list out every column but give a general idea of the layout.
Added to that, there isn't one row in the database, but can have multiple rows for a Name / ID and the ID is not an identity, but can be any number.
To give an idea of how the table looks, here is some sample data (mind you, I only added two of the Year/Mon combinations, but there are dozens that do not necessarily have one for each month/year)
ID Name 2010/Jan 2010/Jan_Pct 2010/Feb 2010/Feb_Pct
10 Gold 81 0.00123 79 0.01242
134 Silver 82 0 75 0.21291
678 Iron 987 1.53252 1056 2.9897
As you can imagine, this isn't the best design as you need to add a new two new columns every month. So I created a new table with the following definitions
ID - float,
Name - varchar(255),
Month - varchar(3),
Year - int,
Value - int,
Value_Pct - float
I am trying to figure out how to move the existing data from the old table into the new table design.
Any help would be greatly appreciated.....

You can work with the unpivot operator to get what you need, with one added step of combining extra rows returned by the unpivot operator.
Sample Data Setup:
Given that the destination table has a value column of int datatype and the value_pct of float data type, I followed the same datatype guidance for the existing data table.
create table dbo.data_table
(
ID float not null
, [Name] varchar(255) not null
, [2010/Jan] int null
, [2010/Jan_Pct] float null
, [2010/Feb] int null
, [2010/Feb_Pct] float null
)
insert into dbo.data_table
values (10, 'Gold', 81, 0.00123, 79, 0.01242)
, (134, 'Sliver', 82, 0, 75, 0.21291)
, (678, 'Iron', 987, 1.53252, 1056, 2.9897)
Answer:
--combine what was the "value" row and the "value_pct" row
--into a single row via summation
select a.ID
, a.[Name]
, a.[Month]
, a.[Year]
, sum(a.value) as value
, sum(a.value_pct) as value_pct
from (
--get the data out of the unpivot with one row for value
--and one row for value_pct.
select post.ID
, post.[Name]
, substring(post.col_nm, 6, 3) as [Month]
, cast(substring(post.col_nm, 1, 4) as int) as [Year]
, iif(charindex('pct', post.col_nm, 0) = 0, post.value_prelim, null) as value
, iif(charindex('pct', post.col_nm, 0) > 0, post.value_prelim, null) as value_pct
from (
--cast the columns that are currently INT as Float so that
--all data points can fit in one common data type (will separate back out later)
select db.ID
, db.[Name]
, cast(db.[2010/Jan] as float) as [2010/Jan]
, db.[2010/Jan_Pct]
, cast(db.[2010/Feb] as float) as [2010/Feb]
, db.[2010/Feb_Pct]
from dbo.data_table as db
) as pre
unpivot (value_prelim for col_nm in (
[2010/Jan]
, [2010/Jan_Pct]
, [2010/Feb]
, [2010/Feb_Pct]
--List all the rest of the column names here
)
) as post
) as a
group by a.ID
, a.[Name]
, a.[Month]
, a.[Year]
Final Output:

Incremental load in T-SQL with recorded history

Please help me, I need do a incremental process to my dimensions, to store history data too by T-SQL. I am trying use the MERGE statement, but it doesn't work, because this process deletes data that exists in the target but not in the source table.
Does someone have a suggestion ?
For exemple I have the source table: The source table is my STAGE,
Cod Descript State
AAA Desc1 MI
BBB Desc 2 TX
CCC Desc 3 MA
In the first load my dimension will be equal STAGE
However I can change the value in source table for exemple
AAA CHANGEDESCRIPTION Mi
So, I need update my dimension like this:
Cod Descript State
AAA Desc1 Mi before
AAA CHANGEDESCRIPTION MI actual
BBB Desc 2 TX actual
CCC Desc 3 MA actual
This is my DW and I need the information actual and all history

Try this. Column Aging is always "0" for current record and indicates change generation:
SELECT * INTO tbl_Target FROM (VALUES
('AAA','Desc1','MI',0),('BBB','Desc 2','TX',0),('CCC','Desc 3','MA',0)) as X(Cod, Descript, State, Aging);
GO
SELECT * INTO tbl_Staging FROM (VALUES ('AAA','Desc4','MI')) as X(Cod, Descript, State);
GO
UPDATE t SET Aging += 1
FROM tbl_Target as t
INNER JOIN tbl_Staging as s on t.Cod = s.Cod;
GO
INSERT INTO tbl_Target(Cod, Descript, State, Aging)
SELECT Cod, Descript, State, 0
FROM tbl_Staging;
GO
SELECT * FROM tbl_Target;
Please note that if you have records in staging table, which are "unchanged", you'll get false changes. If so, you have to filter them out in both queries.

I just commented the clause DELETE...tell me what do you think please
MERGE DimTarget AS [Target] --— begin merge statements (merge statements end with a semi-colon)
USING TableSource AS [Source]
ON [Target].ID = [Source].ID AND [Target].[IsCurrentRow] = 1
WHEN MATCHED AND --— record exists but values are different
(
[Target].Dscript <> [Source].Descript
)
THEN UPDATE SET --— update records (Type 1 means record values are overwritten)
[Target].[IsCurrentRow] = 0
-- , [Target].[ValidTo] = GETDATE()
WHEN NOT MATCHED BY TARGET --— record does not exist
THEN INSERT --— insert record
(
Descritp
, [IsCurrentRow]
)
VALUES
(
Descript
, 1
)
--WHEN NOT MATCHED BY SOURCE --— record exists in target but not source
--THEN DELETE -- delete from target
OUTPUT $action AS Action, [Source].* --— output results

Loading Dimension Tables - Methodologies

Recently I been working on project, where need to populated Dim Tables from EDW Tables.
EDW Tables are of type II which does maintain historical data. When comes to load Dim Table, for which source may be multiple EDW Tables or would be single table with multi level pivoting (on attributes).
Mean: There would be 10 records - one for each attribute which need to be pivoted on domain_code to make a single row in Dim. Out of these 10 records there would be some attributes with same domain_code but with different sub_domain_code, which needs further pivoting on subdomain code.
Ex:
if i got domain code: 01,02, 03 => which are straight pivot on domain code
I would also have domain code: 10 with subdomain code / version as 2006,2007,2008,2009
That means I need to split my source table with above attributes into two => one for domain code and other for domain_code + version.
so far so good.
When it comes to load Dim Table:
As per design specs for Dimensions (originally written by third party), what they want is:
for every single change in EDW (attribute), it should assemble all the related records (for that NK) mean new one with other attribute values which are current => process them to create a new dim record and insert it.
That mean if a single extract contains 100 records updated (one for each NK), it should assemble 100 + (100*9) records to insert / update dim table. How good is this approach.
Other way I tried to do is just do a lookup into dim table for that NK get the value's of recent records (attributes which not changed) and insert it and update the current one.
What would be the better approach assembling records at source side for one attribute change or looking into dim table's recent record and process it.
If this doesn't make sense, would like to elaborate it further.
Thanks
Here is the model of the tables
alt text http://img96.imageshack.us/img96/1203/modelzp.jpg

Have a look at this example.
It should be relatively straightforward.
It pivots the base data according to your rules.
It determines the change times for the denormalized "row"
It creates a triangular join to determine the start and end of each period (what I'm calling a snapshot)
Then it joins those windows to the base data to determine what the state of the data was at that time (the pivot is actually completed at this time)
I think you may need to look at the windowing mechanism - it's returning the right data, but I don't like the way the window overlap logic looks to me - it doesn't quite small right - I'm worried about the boundary conditions.
-- SO3014289
CREATE TABLE #src (
key1 varchar(4) NOT NULL
,key2 varchar(3) NOT NULL
,key3 varchar(3) NOT NULL
,AttribCode int NOT NULL
,AttribSubCode varchar(2)
,Value varchar(10) NOT NULL
,[Start] date NOT NULL
,[End] date NOT NULL
)
INSERT INTO #src VALUES
('9750', 'C04', '789', 1, NULL, 'AAA', '1/1/2000', '12/31/9999')
,('9750', 'C04', '789', 2, NULL, 'BBB', '1/1/2000', '12/31/9999')
,('9750', 'C04', '789', 3, 'V1', 'XXXX', '1/1/2000', '12/31/9999')
,('9750', 'C04', '789', 3, 'V2', 'YYYY', '1/1/2000', '1/2/2000')
,('9750', 'C04', '789', 3, 'V2', 'YYYYY', '1/2/2000', '12/31/9999')
;WITH basedata AS (
SELECT key1 + '-' + key2 + '-' + key3 AS NK
,CASE WHEN AttribCode = 1 THEN Value ELSE NULL END AS COL1
,CASE WHEN AttribCode = 2 THEN Value ELSE NULL END AS COL2
,CASE WHEN AttribCode = 3 AND AttribSubCode = 'V1' THEN Value ELSE NULL END AS COL3
,CASE WHEN AttribCode = 3 AND AttribSubCode = 'V2' THEN Value ELSE NULL END AS COL4
,[Start]
,[End]
FROM #src
)
,ChangeTimes AS (
SELECT NK, [Start] AS Dt
FROM basedata
UNION
SELECT NK, [End] AS Dt
FROM basedata
)
,Snapshots as (
SELECT s.NK, s.Dt AS [Start], MIN(e.Dt) AS [End]
FROM ChangeTimes AS s
INNER JOIN ChangeTimes AS e
ON e.NK = s.NK
AND e.Dt > s.Dt
GROUP BY s.NK, s.Dt
)
SELECT Snapshots.NK
,MAX(COL1) AS COL1
,MAX(COL2) AS COL2
,MAX(COL3) AS COL3
,MAX(COL4) AS COL4
,Snapshots.[Start]
,Snapshots.[End]
FROM Snapshots
INNER JOIN basedata
ON basedata.NK = Snapshots.NK
AND NOT (basedata.[End] <= Snapshots.[Start] OR basedata.[Start] >= Snapshots.[End])
GROUP BY Snapshots.NK
,Snapshots.[Start]
,Snapshots.[End]

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight