About Surrogate key in Loading Process in DataWarehouse - sql-server

When you do the loading process from stage table to the fact and dimension table and does it mean that you also load the surrogate key from stage to the dimension table in relation to new rows?
Or do you create new surrogate key in dimension table by using the sql code Identity for the table? (https://learn.microsoft.com/en-us/sql/t-sql/statements/create-table-transact-sql-identity-property?view=sql-server-2017)?
Which approach is correct?
Other information:
*I'm newbie in ETL and Business Intelligence
*I'm using only T-SQL, no SSIS.
Thank you!

Question is not very clear. Ill attempt to answer based on what i "think" you are asking but it would be better to ensure the question is crystal clear to people unfamiliar with the data, and provide sample data.
I think you are asking if you need to load entries into a dimension table, for records that are being loaded into a fact table, at the same time the fact table is being loaded.
Generally the dimension members are loaded into the dimension table before loading data into the fact table. Its just easier to do it this way if at all possible.
The steps i would use, in order are:
load the dimension with any new members in its own stored procedure. This ensures the you now have a surrogate key for any new members. do this for all dimensions.
Create a 2nd stored procedure to load the fact table. join the staging table to the dimension tables to get the surrogate keys. code below shows an example for one dimension but just do more joins to more dimensions as needed.
The below code populates a sample dimension and factStaging table with contrived data, to show how to then get the surrogate key and data to be inserted into the fact table.
create table #factstaging
(
dimension1Value nvarchar(20),
factmeasure1 int,
factmeasure2 int
)
create table #dimension1
(
ID int identity(1,1),
dimension1Value nvarchar(20)
)
insert into #dimension1
values
('d1 value 1'),
('d1 value 2'),
('d1 value 3')
insert into #factstaging
values
('d1 value 1',22,44),
('d1 value 1',22,44),
('d1 value 2',22,44),
('d1 value 3',22,44)
--contents of stored procedure to insert fact rows
select d1.ID as Dimension1SurrogateKey, s.factmeasure1,s.factmeasure2
from #factStaging s
join #dimension1 d1 on s.dimension1Value = d1.dimension1Value
Note:
your data needs to be clean.
if facts are arriving before the dimension data, the pattern will be different, and need to use something like a late arriving dimension pattern which is a lot more complex.

Related

Primary Key Constraint, migration of table from local db to snowflake, recommended data types for json column?

What order can I copy data into two different tables to comply with the table constraints I created locally?
I created an example from the documentation, but was hoping to get recommendations on how to optimize the data stored by selecting the right types.
I created two tables, one was the list of names and the second is a list of names with a date they did something.
create or replace table name_key (
id integer not null,
id_sub integer not null,
constraint pkey_1 primary key (id, id_sub) not enforced,
name varchar
);
create or replace table recipts (
col_a integer not null,
col_b integer not null,
constraint fkey_1 foreign key (col_a, col_b) references name_key (id, id_sub) not enforced,
recipt_date datetime,
did_stuff variant
);
Insert into name_key values (0, 0, 'Geinie'), (1, 1, 'Greg'), (2,2, 'Alex'), (3,3, 'Willow');
Insert into recipts values(0,0, Current_date()), (1,1, Current_date()), (2,2, Current_date()), (3,3, Current_date());
Select * from name_key;
Select * from recipts;
Select * from name_key
join recipts on name_key.id = recipts.col_a
where id = 0 or col_b = 2;
I read: https://docs.snowflake.net/manuals/user-guide/table-considerations.html#storing-semi-structured-data-in-a-variant-column-vs-flattening-the-nested-structure where it recommends to change timestamps from strings to a variant. I did not include the fourth column, I left it blank for future use. Essentially it captures data in json format, so I made it a variant. Would it be better to rethink this table structure to flatten the variant column?
Also I would like to change the key to AUTO_INCRDEMENT, is there something like this in Snowflake?
What order can I copy data into two different tables to comply with the table constraints I created locally?
You need to give more context about your constraints, but you can control the order of copy statements. For foreign keys generally you want to load the table that is referenced before the table that does the referencing.
where it recommends to change timestamps from strings to a variant.
I think you misread that documentation. It recommends extracting values from a variant column into their own separate columns (in this case a timestamp column), ESPECIALLY if those columns are dates and times, arrays, and numbers within strings.
Converting a timestamp column to a variant, is exactly what it is recommending against.
Would it be better to rethink this table structure to flatten the variant column?
It's definitely good to think carefully about, and do performance tests on, situations where you are using semi-structured data, but without more information on your specific situation and data, it's hard to say.
Also I would like to change the key to AUTO_INCRDEMENT, is there something like this in Snowflake?
Yes Snowflake has an Auto_increment feature. Although I've heard this has some issue with working with COPY INTO Statements

How to emulate a BEFORE INSERT trigger in T-SQL / SQL Server for super/subtype (Inheritance) entities? [duplicate]

This question already has answers here:
How can I do a BEFORE UPDATED trigger with sql server?
(9 answers)
Closed 2 years ago.
This is on Azure.
I have a supertype entity and several subtype entities, the latter of which needs to obtain their foreign keys from the primary key of the super type entity on each insert. In Oracle, I use a BEFORE INSERT trigger to accomplish this. How would one accomplish this in SQL Server / T-SQL?
DDL
CREATE TABLE super (
super_id int IDENTITY(1,1)
,subtype_discriminator char(4) CHECK (subtype_discriminator IN ('SUB1', 'SUB2')
,CONSTRAINT super_id_pk PRIMARY KEY (super_id)
);
CREATE TABLE sub1 (
sub_id int IDENTITY(1,1)
,super_id int NOT NULL
,CONSTRAINT sub_id_pk PRIMARY KEY (sub_id)
,CONSTRAINT sub_super_id_fk FOREIGN KEY (super_id) REFERENCES super (super_id)
);
I wish for an insert into sub1 to fire a trigger that actually inserts a value into super and uses the super_id generated to put into sub1.
In Oracle, this would be accomplished by the following:
CREATE TRIGGER sub_trg
BEFORE INSERT ON sub1
FOR EACH ROW
DECLARE
v_super_id int; //Ignore the fact that I could have used super_id_seq.CURRVAL
BEGIN
INSERT INTO super (super_id, subtype_discriminator)
VALUES (super_id_seq.NEXTVAL, 'SUB1')
RETURNING super_id INTO v_super_id;
:NEW.super_id := v_super_id;
END;
Please advise on how I would simulate this in T-SQL, given that T-SQL lacks the BEFORE INSERT capability?
Sometimes a BEFORE trigger can be replaced with an AFTER one, but this doesn't appear to be the case in your situation, for you clearly need to provide a value before the insert takes place. So, for that purpose, the closest functionality would seem to be the INSTEAD OF trigger one, as #marc_s has suggested in his comment.
Note, however, that, as the names of these two trigger types suggest, there's a fundamental difference between a BEFORE trigger and an INSTEAD OF one. While in both cases the trigger is executed at the time when the action determined by the statement that's invoked the trigger hasn't taken place, in case of the INSTEAD OF trigger the action is never supposed to take place at all. The real action that you need to be done must be done by the trigger itself. This is very unlike the BEFORE trigger functionality, where the statement is always due to execute, unless, of course, you explicitly roll it back.
But there's one other issue to address actually. As your Oracle script reveals, the trigger you need to convert uses another feature unsupported by SQL Server, which is that of FOR EACH ROW. There are no per-row triggers in SQL Server either, only per-statement ones. That means that you need to always keep in mind that the inserted data are a row set, not just a single row. That adds more complexity, although that'll probably conclude the list of things you need to account for.
So, it's really two things to solve then:
replace the BEFORE functionality;
replace the FOR EACH ROW functionality.
My attempt at solving these is below:
CREATE TRIGGER sub_trg
ON sub1
INSTEAD OF INSERT
AS
BEGIN
DECLARE #new_super TABLE (
super_id int
);
INSERT INTO super (subtype_discriminator)
OUTPUT INSERTED.super_id INTO #new_super (super_id)
SELECT 'SUB1' FROM INSERTED;
INSERT INTO sub (super_id)
SELECT super_id FROM #new_super;
END;
This is how the above works:
The same number of rows as being inserted into sub1 is first added to super. The generated super_id values are stored in a temporary storage (a table variable called #new_super).
The newly inserted super_ids are now inserted into sub1.
Nothing too difficult really, but the above will only work if you have no other columns in sub1 than those you've specified in your question. If there are other columns, the above trigger will need to be a bit more complex.
The problem is to assign the new super_ids to every inserted row individually. One way to implement the mapping could be like below:
CREATE TRIGGER sub_trg
ON sub1
INSTEAD OF INSERT
AS
BEGIN
DECLARE #new_super TABLE (
rownum int IDENTITY (1, 1),
super_id int
);
INSERT INTO super (subtype_discriminator)
OUTPUT INSERTED.super_id INTO #new_super (super_id)
SELECT 'SUB1' FROM INSERTED;
WITH enumerated AS (
SELECT *, ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS rownum
FROM inserted
)
INSERT INTO sub1 (super_id, other columns)
SELECT n.super_id, i.other columns
FROM enumerated AS i
INNER JOIN #new_super AS n
ON i.rownum = n.rownum;
END;
As you can see, an IDENTIY(1,1) column is added to #new_user, so the temporarily inserted super_id values will additionally be enumerated starting from 1. To provide the mapping between the new super_ids and the new data rows, the ROW_NUMBER function is used to enumerate the INSERTED rows as well. As a result, every row in the INSERTED set can now be linked to a single super_id and thus complemented to a full data row to be inserted into sub1.
Note that the order in which the new super_ids are inserted may not match the order in which they are assigned. I considered that a no-issue. All the new super rows generated are identical save for the IDs. So, all you need here is just to take one new super_id per new sub1 row.
If, however, the logic of inserting into super is more complex and for some reason you need to remember precisely which new super_id has been generated for which new sub row, you'll probably want to consider the mapping method discussed in this Stack Overflow question:
Using merge..output to get mapping between source.id and target.id
While Andriy's proposal will work well for INSERTs of a small number of records, full table scans will be done on the final join as both 'enumerated' and '#new_super' are not indexed, resulting in poor performance for large inserts.
This can be resolved by specifying a primary key on the #new_super table, as follows:
DECLARE #new_super TABLE (
row_num INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
super_id int
);
This will result in the SQL optimizer scanning through the 'enumerated' table but doing an indexed join on #new_super to get the new key.

Database Performance and maintenance with one thousand columns.

I need to create a table with one thousand fields(columns) and I don't know how to handle the performance and how to maintain it please help me with suggestions.
If most times most values are NULL then you should upgrade to SQL Server 2008 and use sparse columns, see Using Sparse Columns and Using Column Sets.
If your column values are not mostly NULL then I question the soundness of your data model.
First things you will have to do
Normalize
Define the entities and separate out to different tables. Draw an ER diagram and
you will get more ideas.
Dont exceed Table column beyond 15 if the columns are varchar or text because
then SQL will have to store the data in different pages. If the columns are
boolean then it can be around 30.
Define clustered index properly based on your data as this will optimize querying.
Since the nature of the question is without much details the answers are also generic and from a 100 feet view.
Again, please don't do this.
Check out the Entity Attribute Value model with respect to databases. It will help you store a large amount of sparse attributes on an entity and doesn't make databases cry.
The basic concept is shown below
create table #attributes
(
id int identity(1,1),
attribute varchar(20),
attribute_description varchar(max),
attribute_type varchar(20)
)
insert into #attributes values ('Column 1','what you want to put in column 1 of 1000','string')
insert into #attributes values ('Column 2','what you want to put in column 2 of 1000','int')
create table #entity
(
id int identity(1,1),
whatever varchar(max)
)
insert into #entity values ('Entity1')
insert into #entity values ('Entity2')
create table #entity_attribute
(
id int identity(1,1),
entity_id int,
attribute_id int,
attribute_value varchar(max)
)
insert into #entity_attribute values (1,1,'e1value1')
insert into #entity_attribute values (1,2,'e1value2')
insert into #entity_attribute values (2,2,'e2value2')
select *
from #entity e
join #entity_attribute ea on e.id = ea.entity_id
The difference between what goes in the #entity table and what goes in the #attribute table is somewhat dependent on the application but a general rule would be something that is never null and is accessed every time you need the entity, I would limit this to 10 or so items.
Let me guess this is a medical application?

Best way to move data between tables and generate mapping of old to new identity values

I need to merge data from 2 tables into a third (all having the same schema) and generate a mapping of old identity values to new ones. The obvious approach is to loop through the source tables using a cursor, inserting the old and new identity values along the way. Is there a better (possibly set-oriented) way to do this?
UPDATE: One additional bit of info: the destination table already has data.
Create your mapping table with an IDENTITY column for the new ID. Insert from your source tables into this table, creating your mapping.
SET IDENTITY_INSERT ON for your target table.
Insert into the target table from your source tables joined to the mapping table, then SET IDENTITY_INSERT OFF.
I created a mapping table based on the OUTPUT clause of the MERGE statement. No IDENTITY_INSERT required.
In the example below, there is RecordImportQueue and RecordDataImportQueue, and RecordDataImportQueue.RecordID is a FK to RecordImportQueue.RecordID. The data in these staging tables needs to go to Record and RecordData, and FK must be preserved.
RecordImportQueue to Record is done using a MERGE statement, producing a mapping table from its OUTPUT, and RecordDataImportQueue goes to RecordData using an INSERT from a SELECT of the source table joined to the mapping table.
DECLARE #MappingTable table ([NewRecordID] [bigint],[OldRecordID] [bigint])
MERGE [dbo].[Record] AS target
USING (SELECT [InstanceID]
,RecordID AS RecordID_Original
,[Status]
FROM [RecordImportQueue]
) AS source
ON (target.RecordID = NULL) -- can never match as RecordID is IDENTITY NOT NULL.
WHEN NOT MATCHED THEN
INSERT ([InstanceID],[Status])
VALUES (source.[InstanceID],source.[Status])
OUTPUT inserted.RecordID, source.RecordID_Original INTO #MappingTable;
After that, you can insert the records in a referencing table as folows:
INSERT INTO [dbo].[RecordData]
([InstanceID]
,[RecordID]
,[Status])
SELECT [InstanceID]
,mt.NewRecordID -- the new RecordID from the mappingtable
,[Status]
FROM [dbo].[RecordDataImportQueue] AS rdiq
JOIN #MappingTable AS mt
ON rdiq.RecordID = mt.OldRecordID
Although long after the original post, I hope this can help other people, and I'm curious for any feedback.
I think I would temporarily add an extra column to the new table to hold the old ID. Once your inserts are complete, you can extract the mapping into another table and drop the column.

Defining a one-to-one relationship in SQL Server

I need to define a one-to-one relationship, and can't seem to find the proper way of doing it in SQL Server.
Why a one-to-one relationship you ask?
I am using WCF as a DAL (Linq) and I have a table containing a BLOB column. The BLOB hardly ever changes and it would be a waste of bandwidth to transfer it across every time a query is made.
I had a look at this solution, and though it seems like a great idea, I can just see Linq having a little hissy fit when trying to implement this approach.
Any ideas?
One-to-one is actually frequently used in super-type/subtype relationship. In the child table, the primary key also serves as the foreign key to the parent table. Here is an example:
CREATE TABLE Organization
(
ID int PRIMARY KEY,
Name varchar(200),
Address varchar(200),
Phone varchar(12)
)
GO
CREATE TABLE Customer
(
ID int PRIMARY KEY,
AccountManager varchar(100)
)
GO
ALTER TABLE Customer
ADD FOREIGN KEY (ID) REFERENCES Organization(ID)
ON DELETE CASCADE
ON UPDATE CASCADE
GO
Why not make the foreign key of each table unique?
there is no such thing as an explicit one-to-one relationship.
But, by the fact that tbl1.id and tbl2.id are primary keys and tbl2.id is a foreign key referenceing tbl1.id, you have created an implicit 1:0..1 relationship.
Put 1:1 related items into the same row in the same table. That's where "relation" in "relational database" comes from - related things go into the same row.
If you want to reduce size of data traveling over the wire consider either projecting only the needed columns:
SELECT c1, c2, c3 FROM t1
or create a view that only projects relevant columns and use that view when needed:
CREATE VIEW V1 AS SELECT c1, c2, c3 FROM t1
SELECT * FROM t1
UPDATE v1 SET c1=5 WHERE c2=7
Note that BLOBs are stored off-row in SQL Server so you are not saving much disk IO by vertically-partitioning your data. If these were non-BLOB columns you may benefit form vertical partitioning as you described because you will do less disk IO to scan the base table.
How about this. Link the primary key in the first table to the primary key in the second table.
Tab1.ID (PK) <-> Tab2.ID (PK)
My problem was I have a 2 stage process with mandatory fields in both. The whole process could be classed as one episode (put in the same table) but there is an initial stage and final stage.
In my opinion, a better solution for not reading the BLOB with the LINQ query would be to create a view on the table that contains all the column except for the BLOB ones.
You can then create an EF entity based on the view.

Resources