SSIS flat file with joins - sql-server

I have a flat file which has following columns
Device Name
Device Type
Device Location
Device Zone
Which I need to insert into SQL Server table called Devices.
Devices table has following structure
DeviceName
DeviceTypeId (foreign key from DeviceType table)
DeviceLocationId (foreign key from DeviceLocation table)
DeviceZoneId (foreign key from DeviceZone table)
DeviceType, DeviceLocation and DeviceZone tables are already prepopulated.
Now I need to write ETL which reads flat file and for each row get DeviceTypeId, DeviceLocationId and DeviceZoneId from corresponding tables and insert into Devices table.
I am sure this is not new but its being a while I worked on such SSIS packages and help would be appreciated.

Load the flat content into a staging table and write a stored procedure to handle the inserts and updates in T-SQL.
Having FK relationships between the destination tables, can probably make a lot of trouble with a single data flow and a multicast.
The problem is that you have no control over the order of the inserts so the child record could be inserted before the parent.
Also, for identity columns on the tables, you cannot retrieve the identity value from one stream and use it in another without using subsequent merge joins.

The simplest way to do that, is by using Lookup Transformation to get the ID for each value. You must be aware that duplicates may lead to a problem, you have to make sure that the value is not found multiple times in the foreign tables.
Also, make sure to redirect rows that have no match into a staging table to check them later.
You can refer to the following article for a step by step guide to Lookup Transformation:
An Overview of the LOOKUP TRANSFORMATION in SSIS

Related

SSIS only extract Delta changes

After some advice. I'm using SSIS\SQL Server 2014. I have a nightly SSIS package that pulls in data from non-SQL Server db's into a single table (the SQL table is truncated beforehand each time) and I then extract from this table to create a daily csv file.
Going forward, I only want to extract to csv on a daily basis the records that have changed i.e. the Deltas.
What is the best approach? I was thinking of using CDC in SSIS, but as I'm truncating the SQL table before the initial load each time, will this be best method? Or will I need to have a master table in SQL with an initial load, then import into another table and just extract where there are different? For info, the table in SQL contains a Primary Key.
I just want to double check as CDC assumes the tables are all in SQL Server, whereas my data is coming from outside SQL Server first.
Thanks for any help.
The primary key on that table is your saving grace here. Obviously enough, the SQL Server database that you're pulling the disparate data into won't know from one table flush to the next which records have changed, but if you add two additional tables, and modify the existing table with an additional column, it should be able to figure it out by leveraging HASHBYTES.
For this example, I'll call the new table SentRows, but you can use a more meaningful name in practice. We'll call the new column in the old table HashValue.
Add the column HashValue to your table as a varbinary data type. NOT NULL as well.
Create your SentRows table with columns for all the columns in the main table's primary key, plus the HashValue column.
Create a RowsToSend table that's structurally identical to your main table, including the HashValue.
Modify your queries to create the HashValue by applying HASHBYTES to all of the non-key columns in the table. (This will be horribly tedious. Sorry about that.)
Send out your full data set.
Now move all of the key values and HashValues to the SentRows table. Truncate your main table.
On the next pull, compare the key values and HashValues from SentRows to the new data in the main table.
Primary key match + hash match = Unchanged row
Primary key match + hash mismatch = Updated row
Primary key in incoming data but missing from existing data set = New row
Primary key not in incoming data but in existing data set = Deleted row
Pull out any changes you need to send to the RowsToSend table.
Send the changes from RowsToSend.
Move the key values and HashValues to your SentRows table. Update hashes for changed key values, insert new rows, and decide how you're going to handle deletes, if you have to deal with deletes.
Truncate the SentRows table to get ready for tomorrow.
If you'd like (and you'll thank yourself later if you do) add a computed column to the SentRows table with default of GETDATE(), which will tell you when the row was added.
And away you go. Nothing but deltas from now on.
Edit 2019-10-31:
Step by step (or TL;DR):
1) Flush and Fill MainTable.
2) Compare keys and hashes on MainTable to keys and hashes on SentRows to identify new/changed rows.
3) Move new/changed rows to RowsToSend.
4) Send the rows that are in RowsToSend.
5) Move all the rows from RowsToSend to SentRows.
6) Truncate RowsToSend.

Speed up table scans using a middle-man data dump table

I'm a dev (not dba) working on a project where we have been managing "updates" to data via bulk insert. However we first insert to a non-indexed "pre-staging" table. This is due to the fact that we need to normalize a lot of denormalized data and make sure it's properly split up into our schema.
Naturally this makes the update and insert processes slow since we have to check if the information exists for each specific table with non-indexed codes or identifiers.
Since the "pre-staging" table is truncated we didn't include auto-generated IDs either.
I'm looking into ways to speed up table scans on that particular table in our stored procedures. What would be the best approach to do so? Indexes? Auto-generated IDs as clustered indexes? This last one is tricky because we cannot establish relationships with our "staging" data since it is truncated per data dump.

Postgres: How can we preserve data of foreign table which is creating using foreign data wrapper

I am trying to migrate Oracle database to Postgres using foreign_database_wrapper by creating foreign tables.
But since the foreign tables acts like a view of Oracle so at the time of executing any query it fetches data on fly from the Original source and hence it increases the processing time.
As of now, in order to maintain physical data at Postgres end, I am creating table and inserting those data in it.
eg: create table employee_details as select * from emp_det;
where employee_details is a physical table and emp_det is a foreign table
But I felt this process is kind of redundant and time to time we need to manipulate this table(new insertion, updation or deletion)
So if anyone could share some related way where I can preserve these data with some other mode.
Regards,
See the identical Github issue.
oracle_fdw does not store the Oracle data on the PostgreSQL side.
Each access to a foreign table directly accesses the Oracle database.
If you want a copy of the data physically located in the PostgreSQL database, you can either do it like you described, or you could use a materialized view:
CREATE MATERIALIZED VIEW emp_det_mv AS SELECT * FROM emp_det;
That will do the same thing, but simpler. To refresh the data, you can run
REFRESH MATERIALIZED VIEW emp_det_mv;

SQL Server : split records pointing to a unique varbinary value

I have an interesting problem for the smart people out there.
I have an external application I cannot modify writing pictures into a SQL Server table. The pictures are often non-unique, but linked to unique rows in other tables.
The table MyPictures looks like this (simplified):
Unique (ID) FileName (Varchar) Picture (Varbinary)
----------------------------------------------------------
xxx-xx-xxx1 MyPicture 0x66666666
xxx-xx-xxx2 MyPicture 0x66666666
xxx-xx-xxx3 MyPicture 0x66666666
This causes the same data to be stored over and over again, blowing up my database (85% of my DB is just this table).
Is there something on a SQL level I can do to only store the data once if filename & picture already exists in my table?
The only thing I can think of is to treat the current destination table as a 'staging' table, so allow all the rows the upstream process wants to write to it, but then have a second process that copies only distinct rows to the table(s) you're using on the SQL side and then deletes the rows from the table with the duplicates to reclaim your space.

SSIS - lookup surrogate key for parent

I have a table in my source DB that is self referencing
|BusinessID|...|ParentID|
This table is modeled in the DW as
|SurrogateID|BusinessID|ParentID|
First question is, should the ParentID in the DW reference the surrogate id or the business id. My idea is that it should reference the surrogate id.
Then my problem occurs, in my dataflow task of SSIS, how can I lookup the surrogate key of the parent?
If I insert all rows where ParentID is null first and then the ones that are not null I solve part of the problem.
But I still have to lookup the rows that may reference a parent that is also a child.
I.e. I do have to make sure that the parents are loaded first into the DB to be able to use the lookup transformation.
Do I have to resolve to a for-each with sorted input?
One trick I've used in this situation is to load the rows without the ParentID. I then used another data flow to create an update script based on the source data and the loaded data, then used a SQL task to run the created update script. It won't win prizes for elegance, but it does work.

Resources