Loading Dimension Tables - Methodologies - sql-server

Recently I been working on project, where need to populated Dim Tables from EDW Tables.
EDW Tables are of type II which does maintain historical data. When comes to load Dim Table, for which source may be multiple EDW Tables or would be single table with multi level pivoting (on attributes).
Mean: There would be 10 records - one for each attribute which need to be pivoted on domain_code to make a single row in Dim. Out of these 10 records there would be some attributes with same domain_code but with different sub_domain_code, which needs further pivoting on subdomain code.
Ex:
if i got domain code: 01,02, 03 => which are straight pivot on domain code
I would also have domain code: 10 with subdomain code / version as 2006,2007,2008,2009
That means I need to split my source table with above attributes into two => one for domain code and other for domain_code + version.
so far so good.
When it comes to load Dim Table:
As per design specs for Dimensions (originally written by third party), what they want is:
for every single change in EDW (attribute), it should assemble all the related records (for that NK) mean new one with other attribute values which are current => process them to create a new dim record and insert it.
That mean if a single extract contains 100 records updated (one for each NK), it should assemble 100 + (100*9) records to insert / update dim table. How good is this approach.
Other way I tried to do is just do a lookup into dim table for that NK get the value's of recent records (attributes which not changed) and insert it and update the current one.
What would be the better approach assembling records at source side for one attribute change or looking into dim table's recent record and process it.
If this doesn't make sense, would like to elaborate it further.
Thanks
Here is the model of the tables
alt text http://img96.imageshack.us/img96/1203/modelzp.jpg

Have a look at this example.
It should be relatively straightforward.
It pivots the base data according to your rules.
It determines the change times for the denormalized "row"
It creates a triangular join to determine the start and end of each period (what I'm calling a snapshot)
Then it joins those windows to the base data to determine what the state of the data was at that time (the pivot is actually completed at this time)
I think you may need to look at the windowing mechanism - it's returning the right data, but I don't like the way the window overlap logic looks to me - it doesn't quite small right - I'm worried about the boundary conditions.
-- SO3014289
CREATE TABLE #src (
key1 varchar(4) NOT NULL
,key2 varchar(3) NOT NULL
,key3 varchar(3) NOT NULL
,AttribCode int NOT NULL
,AttribSubCode varchar(2)
,Value varchar(10) NOT NULL
,[Start] date NOT NULL
,[End] date NOT NULL
)
INSERT INTO #src VALUES
('9750', 'C04', '789', 1, NULL, 'AAA', '1/1/2000', '12/31/9999')
,('9750', 'C04', '789', 2, NULL, 'BBB', '1/1/2000', '12/31/9999')
,('9750', 'C04', '789', 3, 'V1', 'XXXX', '1/1/2000', '12/31/9999')
,('9750', 'C04', '789', 3, 'V2', 'YYYY', '1/1/2000', '1/2/2000')
,('9750', 'C04', '789', 3, 'V2', 'YYYYY', '1/2/2000', '12/31/9999')
;WITH basedata AS (
SELECT key1 + '-' + key2 + '-' + key3 AS NK
,CASE WHEN AttribCode = 1 THEN Value ELSE NULL END AS COL1
,CASE WHEN AttribCode = 2 THEN Value ELSE NULL END AS COL2
,CASE WHEN AttribCode = 3 AND AttribSubCode = 'V1' THEN Value ELSE NULL END AS COL3
,CASE WHEN AttribCode = 3 AND AttribSubCode = 'V2' THEN Value ELSE NULL END AS COL4
,[Start]
,[End]
FROM #src
)
,ChangeTimes AS (
SELECT NK, [Start] AS Dt
FROM basedata
UNION
SELECT NK, [End] AS Dt
FROM basedata
)
,Snapshots as (
SELECT s.NK, s.Dt AS [Start], MIN(e.Dt) AS [End]
FROM ChangeTimes AS s
INNER JOIN ChangeTimes AS e
ON e.NK = s.NK
AND e.Dt > s.Dt
GROUP BY s.NK, s.Dt
)
SELECT Snapshots.NK
,MAX(COL1) AS COL1
,MAX(COL2) AS COL2
,MAX(COL3) AS COL3
,MAX(COL4) AS COL4
,Snapshots.[Start]
,Snapshots.[End]
FROM Snapshots
INNER JOIN basedata
ON basedata.NK = Snapshots.NK
AND NOT (basedata.[End] <= Snapshots.[Start] OR basedata.[Start] >= Snapshots.[End])
GROUP BY Snapshots.NK
,Snapshots.[Start]
,Snapshots.[End]

Related

Can I pull Max and Min values in SQL without using group by for non aggregate values?

I have a table of user data for when they enroll in a program. The fields include a user ID, start date, end date, entry reason, exit reason and program type. For each year the user is enrolled in a specific program they will have an entry and exit date for that year along with an entry reason. They only get an exit reason when they are exited from the program completely. Here is an example of the data in the table.
Data Table
Desired Result
I need to pull one line for each user that has their original start date in the program, most recent start date, and most recent end date. I also need to pull the exit reason if one exists and entry reason associated with the most recent start date and this is what is getting me hung up. I’m assuming the problem is related to having to group by the entry reason. Is there any way around using an aggregate function to get the min/max dates?
My query is:
Select
Table1.userID,
CAST(Min(table2.startdate) as date) as Originalstartdate,
CAST(Max(table2.startdate) as date) as Maxstartdate,
CAST(Max(table2.enddate) as date) as ExitDate,
CASE
WHEN table2.exitreason = NULL then ‘’
ELSE table2.exitreason
END as Exitcode,
Table2.entryreason
From
Table1 left outer join
Table2 on Table1.userID = Table2.userID
Where
Table1.status = ‘active’ and Table2.programID = ‘Program1’ and (Table2.exitreason <> ‘NULL’ or Table2.entryreason <> ‘NULL’)
Group By
Table1.userID, Table2.exitreason, Table2.entryreason
I used the below sample code in order to generate this.
The idea here is to utilize the userID as the anchor (you want one row per user, right?), aggregating the rest of the information but with the situation you requested.
CREATE TABLE SCRIPT:
CREATE TABLE table1
(
userID INT IDENTITY(1, 1) PRIMARY KEY,
name VARCHAR(200),
stat CHAR(1) NOT NULL
DEFAULT 'A');
CREATE TABLE table2
(
t2ID INT IDENTITY(1, 1),
StartDate DATE,
UserID INT FOREIGN KEY REFERENCES table1(userID),
ProgramID VARCHAR(150) DEFAULT 'Program1',
EndDate DATE,
EntryReason VARCHAR(2000),
ExitReason VARCHAR(2000));
INSERT INTO Table1
(name)
SELECT *
FROM(VALUES
(
'First name'),
(
'Second name'),
(
'Third name')) x("name");
INSERT INTO Table2
SELECT *
FROM(VALUES
(
'20180101', 1, 'Program1', '20181231', 11, NULL),
(
'20190101', 1, 'Program1', '20191231', 12, NULL),
(
'20200101', 1, 'Program1', NULL, 11, NULL),
(
'20170101', 2, 'Program1', '20171231', 11, NULL),
(
'20180101', 2, 'Program1', '20171231', 14, 2),
(
'20200101', 3, 'Program1', NULL, 11, NULL)
) x(StartDate, UserID, ProgramID, EndDate, EntryReason, ExitReason);
QUERY:
SELECT t1.userID,
CAST(MIN(t2.StartDate) AS DATE)
AS OriginalStartDate, -- This uses your logic to grab the earliest date
CAST(MAX(t2.StartDate) AS DATE)
AS RecentStartDate, -- This utilizes your logic to grab the last start date
CAST(MAX(t2.enddate) AS DATE)
AS ExitDate,
-- This works because we know an ExitDate must be populated due to the where
-- criteria (which prevents people who haven't exited yet from showing up)
ISNULL(MAX(t2.exitreason), '')
AS ExitCode, -- This is just a cleaner way to handle nulls.
STUFF(
(
SELECT CONCAT(',', EntryReason)
FROM Table2
WHERE Table2.UserID=t1.UserID FOR XML PATH('')
), 1, 1, '')
AS EntryReasonList
-- this solution creates a list of entry reasons; we could pick a best winner
-- (e.g. first entry code, last entry code..) but I created a list because
-- I didn't understand your intent.
FROM Table1
AS t1
LEFT JOIN
Table2
AS t2
ON T1.userID=T2.userID
WHERE t1.stat='A' -- you would use status= 'active'
AND t2.programID='Program1' -- same as before
AND NOT EXISTS
-- a not exists clause will do what you want to filter graduates out
(
SELECT 1
FROM Table2
AS t2self
WHERE t2.userID=t2self.userID
AND t2self.exitreason IS NOT NULL
)
GROUP BY t1.userID;

I have two tables I need a query that combines these two tables and sorts them by Key A, Key B, Key C, Date, Time

I have two tables in SQL DB. They both contain 3 columns that match, and additional columns that have different info in each one. I want to write a query that Interleaves them according to date / timestamp. Table A is for a machine that runs and takes a sample every 10 minutes. Table B is the logfile that has entries logged when operator makes adjustments, turns machine on / off, etc.
I have used the following query but it is giving me duplicates on table A.
I did the where(BatchTable.Batch = 'HB20419' and EventLogTable.Batch = 'HB20419') just to cut down on the amount of date being returned until I get the query figured out. One complication is each table has it's own date / time columns and they are named different and completely independent of each other.
SELECT BatchTable.Asset_Number,BatchTable.Recipe,BatchTable.Batch,BatchTable.Group_No, BatchTable.Sample_No, BatchTable.SampleDate, BatchTable.SampleTime, BatchTable.Weight, EventLogTable.EvtTime, EventLogTable.EvtValueBefore, EventLogTable.EvtValueAfter, EventLogTable.EvtComment
FROM BatchTable,EventLogTable
where(BatchTable.Batch = 'HB20419' and EventLogTable.Batch = 'HB20419')
order by Asset_Number, Recipe, Batch, Group_No, Sample_No ASC
Here is how that query would look using aliases, formatting and ANSI-92 style joins.
SELECT bt.Asset_Number
, bt.Recipe
, bt.Batch
, bt.Group_No
, bt.Sample_No
, bt.SampleDate
, bt.SampleTime
, bt.Weight
, elt.EvtTime
, elt.EvtValueBefore
, elt.EvtValueAfter
, elt.EvtComment
FROM BatchTable bt
join EventLogTable elt on elt.Batch = bt.Batch
WHERE bt.Batch = 'HB20419'
ORDER BY Asset_Number
, Recipe
, Batch
, Group_No
, Sample_No ASC
I had to make up some sample data, but it sounds like you want to union the two tables together to "interleave" them. You can do this by aliasing the column names to match and selecting null values for the final values from the opposite table. I acknowledge that I'm guessing at your desired outcome to some extent.
Make some sample data:
DECLARE #batch table (SampleDate VARCHAR(MAX), SampleTime VARCHAR(MAX), Recipe
VARCHAR(MAX))
DECLARE #event table (EvtTime DATETIME, EvtComment VARCHAR(MAX))
INSERT INTO #batch (SampleDate, SampleTime, Recipe) VALUES ('2018-08-09', '11:56:25
AM', 'Peanut Butter'), ('2018-08-09', '12:11:25 PM', 'Chocolate')
INSERT INTO #event (EvtTime, EvtComment) VALUES ('2018-08-09 11:58:22 AM', 'Turned up
speed'), ('2018-08-09 11:59:22 AM', 'Turned down temperature')
Then select and union to interleave:
SELECT CONVERT(DATETIME, CAST(SampleDate + ' ' + SampleTime AS datetime)) AS [Date],
Recipe, NULL as EvtComment FROM #batch
UNION
SELECT EvtTime AS [Date], NULL AS Recipe, EvtComment FROM #event
ORDER BY [Date]
Which yields:
Date Recipe EvtComment
----------------------- ------------------------- -------------------------
2018-08-09 11:56:25.000 Peanut Butter NULL
2018-08-09 11:58:22.000 NULL Turned up speed
2018-08-09 11:59:22.000 NULL Turned down temperature
2018-08-09 12:11:25.000 Chocolate NULL

Count(*) for View returning different results on SQL Server

I am working on an ETL optimization problem and that requires creating a temp table that could be merged with the final table. Currently I have a couple Views that are used to load the final table and that is taking a lot of time. I tried to take the SQL logic from the view and created a temp table and noticed that the values in the temp table do not match the values in the final table. To look deeper I was running count(*) on the view couple of times and noticed that the result for total row count is different for every run by about 10/15 rows give or take. The view has 16 columns from 9 tables which load only once a day. So the time when I run the count(*) the underlying data does not change but the result of the count from the view does change.
This is on a SQL Server 2016 server. I have tried looking into the View logic and nothing stands out as odd. I have tried doing a count(*) on the tables that loads this view and the counts for the tables do not change. I have also tried to create 2 column table from the view logic to simplify the problem and tried an EXCEPT command and that still yields about 20 rows of inconsistent values between the 2 column table created from the same exact view logic.
Here is a reproduction of the VIEW definition that has the row count inconsistency
USE [PROD]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE VIEW Base_View
AS
select
concat(x, y, z)feild1
,*
,ROW_NUMBER() OVER(PARTITION BY a,b ORDER BY some_Date) AS rec_num
,count(a) OVER(PARTITION BY a) AS rec_total
from (
SELECT
case when RESULT='stored value' and e.code is not null then 'x' else '' end x
,case when RESULT='stored value 2' and r.l_id is not null then 'y' else '' end y
,case when RESULT in ('stored value 3','stored value 4') and t.amount is not null then 'z' else '' end z
,case when
CASE WHEN
(m.status = 'stored value 4' OR m.status = 'stored value 5')
AND m.bal < 0
THEN
CASE WHEN DATEDIFF(day,m.due,m.SNAP_DATE) < 0
THEN 0
ELSE DATEDIFF(day,m.due,m.SNAP_DATE)
END
ELSE 0
END=0 AND w.W_ID is null AND m.status<>'stored value 5'
then case
when RESULT in ('stored value 5','stored value 4')
then case when isnull(AMOUNT,0)<>0
then 'abc'
else 'def' end
else 'abc' end
else 'def'
end imp_feild
,result
,es.emp_id
,concat(es.fname,' ',es.lname)task_emp
,concat(e.fname,' ',e.lname)ext_emp
,case when RESULT ='stored value' then t.P_STATUS else null end p_status
,t.CREATE_DATE
,t.l_key
,t.l_id
,m.status
,cast(w.wodate as date)wo_date
,rm.balance refi_balance,rnl.LOAN_key refi_loan,r.effective refi_effective
,case trancode when 'ext' then m.payment else null end ext_amount,e.entered ext_entered,e.effective ext_effective
FROM
(
select t0.*,ROW_NUMBER() OVER(PARTITION BY t0.some_KEY,cast(t0.CREATE_DATE as date),t0.output
ORDER BY t0.some_KEY,cast(t0.CREATE_DATE as date),t0.output ) AS SEQ_NUM
from base_table_1 t0
left join base_table_2 e0
on t0.c_e_key=e0.e_key
where t0.active_rec_ind='Y'
and t0.output in (d,e,f,g)
and (t0.output2 in (j,k)
or ISNULL(e0.some_KEY,'h') in ('u','w'))
) t
join
base_table_3 l
on t.loan_sf_id=l.loan_sf_id
and t.active_rec_ind='Y'
join base_table_4 m
on
t.SOME_DATE=m.SNAP_DATE
and t.L_ID=m.L_ID
left
join base_table_5 es
on t.c_emp_key=es.emp_key
left
join base_table_6 r
on l.l_id=r.l_old_id
and r.entered between dateadd(day,0,cast(t.CREATE_DATE as date)) and dateadd(day,0,t.SOME_DATE)
left
join base_table_7 w
on l.l_id=w.l_id
and w.wodate between cast(t.CREATE_DATE_ETZ as date) and dateadd(day,0,t.SOME_DATE)
left
join base_table_8 wl
on w.l_id=wl.l_id
left
join base_table_8 rnl
on r.l_new_id=rnl.l_id
left
join base_table_8 rol
on r.l_old_id=rol.l_id
left
join base_table_4 rm
on
dateadd(day,-1,r.effective)=rm.SNAP_DATE
and rol.L_ID=rm.L_ID
left
join
(select e0.*,ew.value_1,ew.new_key,ROW_NUMBER() OVER(PARTITION BY e0.L_ID,e0.ENT ORDER BY e0.L_ID,e0.ENT) AS SEQ_NUM
from base_table_9 e0
join base_table_5 ew
on e0.EMP_ID=ew.EMP_ID
where e0.code='a'
) e
on l.sid=e.sid
and e.code='a' and RESULT='stored value 5'
and e.entered between cast(t.CREATE_DATE as date) and dateadd(day,0,t.HOLD_DATE)
AND e.SEQ_NUM=t.SEQ_NUM
and ((isnumeric(e.roll_key)=1 and isnumeric(es.roll_key)=1 and e.roll_key=es.roll_key)
or ((isnumeric(e.roll_key)=0 or isnumeric(es.roll_key)=0) and e.FNAME+e.LNAME=es.FNAME+es.LNAME))
where t.RESULT in ('abc','def')
and cast(t.CREATE_DATE as date) between cast(dateadd(month,-12,getdate()) as date) and cast(getdate() as date)
and (AGENT in ('lmn', 'pqr')
or ISNULL(es.VKEY,'stored value 8') in ('xx','yy','zz'))
)x
where imp_feild='abc'
and concat(x, y, z)<>''
or imp_feild='def'
GO
Expected result is that it should return a consistent number for the row count and that hopefully should solve the inconsistent values problem on the temp table.
Your query has between cast(dateadd(month,-12,getdate()) as date) and cast(getdate() as date) near the bottom. Of course the result of getdate() will be different with each execution and each call to getdate(). That will affect the result.
BTW, having * in your SELECT list is not a good idea. You should only return the columns needed. It makes the view results vulnerable to changes in the underlying tables.
There are a few other things that wouldn't pass code review where I work but that's kinda OT, I think.
This is too long for a comment. Using * in a view is a very bad idea. Not only does the view NOT update (unless you execute sp_refreshview) when you change the base table you can actually get some very interesting things happening.
Check this out as an example of just how bad this can be.
create table ViewExample (Col1 int, Col2 int)
go
create view ViewExampleView as select * from ViewExample
go
insert ViewExample select 1, 2
go
select * from ViewExampleView --obviously we get just a single column
alter table ViewExample add Col3 int --add a new column to the table, surely the view will pick this up?
go
insert ViewExample select 3, 4, 5 --insert a new row with data in all three columns
go
select * from ViewExampleView --what??? The view says select * but we only get Col1 and Col2?
alter table ViewExample drop column Col2 --Oops we decide to drop this column because we don't need it anymore
select * from ViewExampleView --What in the world? Col2 doesn't exist in the table, why is it in the view? And what the heck is going on here. The data from Col3 is now moved to Col2
drop view ViewExampleView
drop table ViewExample
Notice how in the last select from the view that the data from Col3 is being displayed in Col2. If this doesn't convince you to stop using * in views (and pretty much everywhere) I don't know what will.

Count rows that follows other rows in a single table both restricted with a where clause

I'm using SQL Server 2014.
I have a table that contains several millions of events. The primary key is composed of three columns:
Time DateTime
user (bigint)
context (varchar(50))
I have another column with a value (nvarchar(max))
I need to count rows restricted on
context = 'somecontext' and value = 'value2'
that follows in time rows restricted on
context = 'somecontext' and value = 'value1'
for the same user.
For Example with the following records:
Time user context value
2019-02-22 14:56:57.710 359586015014836 somecontext value1
2019-02-22 15:13:42.887 359586015014836 somecontext value2 <------ Need to count this rows only.
It is "recorded" 15 min after the first one and the user and context are the same.
I have seen other similar questions like this one or that one.
Should I make a JOIN on the same table? Use subqueries? may be a CTE? I'm concerned about performance that should be optimal.
The idea would be to use query features available in this version of the DB engine.
If the example that I made in comment is what you want than you can use the following code
assuming that you want to select all the rows where context = 'c1', current value = 'v1', next value = 'v3' if ordered by time:
declare #t table
(
Time_ DateTime,
user_ bigint,
context varchar(50),
value_ varchar(50)
);
insert into #t values
('20000101', 1, 'c1', 'v1'),
('20000102', 1, 'c2', 'v3'),
('20000103', 1, 'c1', 'v3'),
('20000104', 2, 'c1', 'v1'),
('20000105', 2, 'c1', 'v4'),
('20000106', 2, 'c1', 'v2');
with cte as
(
select *,
lead(value_) over(partition by user_ order by time_) as next_value
from #t
where context = 'c1'
)
select *
from cte
where next_value = 'v3';

rSQL While Loop insert

*Updated - Please see below(Past the picture)
I am really stuck with this particular problem, I have two tables, Projects and Project Allocations, they are joined by the Project ID.
My goal is to populate a modified projects table's columns using the rows of the project allocations table. I've included an image below to illustrate what I'm trying to achieve.
A project can have up to 6 Project Allocations. Each Project Allocation has an Auto increment ID (Allocation ID) but I can't use this ID in a sub-selects because it isn't in a range of 1-6 so I can distinguish between who is the first PA2 and who is PA3.
Example:
(SELECT pa1.name FROM table where project.projectid = project_allocations.projectid and JVID = '1') as [PA1 Name],
(SELECT pa2.name FROM table where project.projectid = project_allocations.projectid and JVID = '1') as [PA2 Name],
The modified Projects table has columns for PA1, PA2, PA3. I need to populate these columns based on the project allocations table. So the first record in the database FOR EACH project will be PA1.
I've put together an SQL Agent job that drops and re-creates this table with the added columns so this is more about writing the project allocation row's into the modified projects table by row_num?
Any advice?
--Update
What I need to do now is to get the row_number added as a column for EACH project in order of DESC.
So the first row for each project ID will be 1 and for each row after that will be 2,3,4,5,6.
I've found the following code on this website:
use db_name
with cte as
(
select *
, new_row_id=ROW_NUMBER() OVER (ORDER BY eraprojectid desc)
from era_project_allocations_m
where era_project_allocations_m.eraprojectid = era_project_allocations_m.eraprojectid
)
update cte
set row_id = new_row_id
update cte
set row_id = new_row_id
I've added row_id as a column in the previous SQL Agent step and this code and it runs but it doesn't produce me a row_number FOR EACH projectid.
As you can see from the above image; I need to have 1-2 FOR Each project ID - effectively giving me thousands of 1s, 2s, 3s, 4s.
That way I can sort them into columns :)
From what I can tell a query using row number is what you are after. (Also, it might be a pivot table..)
Example:
create table Something (
someId int,
someValue varchar(255)
);
insert into Something values (1, 'one'), (1, 'two'), (1, 'three'), (1, 'four'), (2, 'ein'), (2, 'swei'), (3, 'un')
with cte as (
select someId,
someValue,
row_number() over(partition by someId order by someId) as rn
from Something
)
select distinct someId,
(select someValue from cte where ct.someId = someId and rn = 1) as value1,
(select someValue from cte where ct.someId = someId and rn = 2) as value2,
(select someValue from cte where ct.someId = someId and rn = 3) as value3,
(select someValue from cte where ct.someId = someId and rn = 4) as value4
into somethingElse
from cte ct;
select * from somethingElse;
Result:
someId value1 value2 value3 value4
1 one two three four
2 ein swei NULL NULL
3 un NULL NULL NULL

Resources