I have inherited a collection of data that is divided into two parts: A master record file and many incremental update files. These records are saved in a single table.
I have two primary requirements:
I would like to properly update my database model to have an active table tracked as a temporal table.
I would like to start with my master record data and then apply each incremental update file from the raw table in some defined order (a date received column).
Is there a method for doing this without a cursor, while loop, or SSIS job to load each file again? If such a way exists, I would basically configure my new temporal table and then merge each incremental file in my specified order. Any approaches to consider?
Edit for additional clarification:
Let's say I have the following raw table:
Key Value
Update Flag
Sort Flag
X1
Add
01/01/2000
X1
Change
02/01/2000
X1
Change
03/01/2000
X1
Change
04/01/2000
My intent is to convert this to a temporal table, and then apply the updates in the order of my sort column.
So My Active will be:
Key Value
Update Flag
Sort Flag
X1
Change
04/01/2000
And my HIST table would be:
Key Value
Update Flag
Sort Flag
SysStrtTime
SysEndTime
X1
Add
01/01/2000
ST1
ST2
X1
Change
02/01/2000
ST2
ST3
X1
Change
03/01/2000
ST4
ST5
Where ST1-STN are just the system time at which the record is updated.
Related
My problem is understand the relation of primary keys to the fact table.
This is the structure I'm working in, the transfer works but it says the values I set as primary keys cannot be NULL
This is the structure I'm working in, the transfer works but it says the values I set as primary keys cannot be NULL
I'm using SSIS to transfer data from a CSV file to an OLEDB (SQL server 2019 over SSMS)
The actual problem is where/how I can get the values in the same task? I tried to do in in two different tasks but then it is in the table one after another ( this only worked when I allowed nulls for the primary keys and can't be a solution I think.)
Maybe the problem I have three transfer from the source
First dimension table
To second dimension table
To fact table. I think the primary keys are generated when I transfer the data to the DB so I think I can't get it in the same task.
dataflow 1
dataflow 2
input data
output data 5
I added the column salesid to the input to use it for the saleskey. Is there a better solution maybe with the third lookup you've mentioned?
You are attempting to load the fischspezi fact table as well as the product (produkt) and location (standort). The problem is, you don't have the keys from the dimensions.
I assume the "key" columns in your dimension are autogenerated/identity values? If that's the case, then you need to break your single data flow into two data flows. Both will keep the Flat File source and the multicast.
Data Flow Dimensions
This is the existing data flow, minus the path that leads to the Fact table.
Data Flow Fact
This data flow will work to populate the Fact table. Remove the two branches to the dimension tables. What we need to do here, is find the translated key values given our inputs. I assume produkt_ID and steuer_id should have been defined as NOT NULL and unique in the dimensions but the concept here is that we need to be able to use a value that comes in our file, product id 3892, and find the same row in the dimension table which has a key value of 1.
The tool for this, is the Lookup Transformation You're going to want 2-3 of those in your data flow right before the destination. The first one will lookup produktkey based on produkt_ID. The second will find standortkey based on steuer_id.
The third lookup you'd want here (and add back into the dimension load) would lookup the current row in the destination table. If you ran the existing package 10 times, you'd have 10x data (unless you have unique constraints defined). Guessing here, but I assume sale_id is a value in the source data so I'd have a lookup here to ensure I don't double load a row. If sales_id is a generated value, then for consistency, I'd rename the suffix to key to be in line with the rest of your data model.
I also encourage everyone to read Andy Leonard's Stairway to Integration Services series. Levels 3 &4 address using lookups and identifying how to update existing rows, which I assume will be some of the next steps in your journey.
Addressing comments
I would place them just over the fact destination and then join with a union all to fact table
No. There is no need to have either a join or a union all in your fact data flow. Flat File Source (Get our candidate data) -> Data Conversion(s) (Change data types to match the expected)-> Derived Columns (Manipulate the data as needed, add things like insert date, etc) -> Lookups (Translate source values to destination values) -> Destination (Store new data).
Assume Source looks like
produkt_ID
steuer_id
sales_id
umsatz
1234
1357
2468
12
2345
3579
4680
44
After dimension load, you'd have (simplified)
Product
produktkey
produkt_ID
1
1234
2
2345
Location
standortkey
steuer_id
7
1357
9
3579
The goal is to use that original data + lookups to have a set like
produkt_ID
steuer_id
sales_id
umsatz
produktkey
standortkey
1234
1357
2468
12
1
7
2345
3579
4680
44
2
9
The third lookup I propose (skip it for now) is to check whether sales_id exists in the destination. If it does, then you would want to see whether that existing record is the same as what we have in the file. If it's the same, then we do nothing. Otherwise, we likely want to update the existing row because we have new information - someone miskeyed the quantity and instead our sales should 120 and not 12. The update is beyond the scope of this question but it's covered nicely in the Stairway to Integration Services.
I am kind of stuck:
Scenario:
I have a SSIS-Package, which loads data incrementally. In the first step, I load rows from a source which have a) been inserted or b) updated into a staging table. I do this by using the last timestamp of the source table.
In the next step, I am trying to use a MERGE-Statement to update the data in another database (similar to a data warehouse). I have no control over this other database, otherwise my task would be quite easy.
Problem:
The data warehouse table includes an ID-Column ([cId], BIGINT), which it does not set by itself. I have tried to create a sequence, from which I pull a value whenever I insert a new row into the data warehouse (not when I update a row, since that row will already have an ID). However, as specified here, SQL Server will not let me use the next value from my sequence for the target of a MERGE-Statement. Since I have no control over the data warehouse, I cannot change this.
Another solution would be to get the next value from my sequence when I load the data into the staging table. This, however, will result in "holen" in my ID-sequence, because when I update a row in the data warehouse from my staging table, the [cId] column would not be updated, since that row already has an ID.
Does anyone have an idea how to solve this? I am basically trying to pull a new, unique BIGINT, whenever I do an insert inside my MERGE-Statement.
Thanks!
In SSIS, if an incoming dataset has multiple records for the same Business Key, how do I load it to the dimensions table with SCD type 2 without using the SCD Wizard.
Sample dataset
Customer ID Name Segment Postal Code
1 James Corporate 50026
2 Andrew Consumer 33311
3 Steven Consumer 90025
2 Andrew Consumer 33306
3 Steven Consumer 90032
1 James Corporate 50087
3 Steven Consumer 90000
In my case, if I try Loading the dimension table with other SSIS components (Lookup/Conditional Split) all the record show up a new row in the table because they are all coming in all at the same time.
I have ‘CurrentFlag’ as the indicator of the current record.
In SSIS, if I have an incoming dataset that has multiple records for the same Business Key, How do I get to recognize these, and set the CurrentFlag as necessary, whether or not a record in the target table has that Business Key already?
Thanks.
OK, this is a massive simplification because SCD's are very challenging to correctly implement. You will need to sit down and think critically about this. My answer below only handles ongoing daily processing - it does not explain how to handle historical files being re-processed, which could potentially result in duplicate records with different EffectiveStart and End Dates.
By definition, you will have an existing record source component (i.e., query from the database table) and an incoming data source component (i.e., a *.csv flatfile). You will need to perform a merge join to identify new records versus existing records. For existing records, you will need to determine if any of the columns have changed (do this in a Derived Column transformation).
You will need to also include two columns for EffectiveStartDate and EffectiveEndDate.
IncomingEffectiveStartDate = FileDate
IncomingEffectiveEndDate = 12-31-9999
ExistingEffectiveEndDate = FileDate - 1
Note on 12-31-9999: This is effectively the Y10K bug. But, it allows users to query the database between date ranges without having to consciously add ISNULL(GETDATE()) in the WHERE clause of a query in the event that they are querying between date ranges.
This will prevent the dates on the columns from overlapping, which could potentially result in multiple records being returned for a given date.
To determine if a record has changed, create a new column called RecordChangedInd of type Bit.
(ISNULL(ExistingColumn1, 0) != ISNULL(IncomingColumn1, 0) ||
ISNULL(ExistingColumn2, 0) != ISNULL(IncomingColumn2, 0) ||
....
ISNULL(ExistingColumn_N, 0) != ISNULL(IncomingColumn_N, 0) ? 1 : 0)
Then, in your split condition you can create two outputs: RecordHasChanged (this will be an INSERT) and RecordHasNotChanged (this will be an UPDATE to deactivate the exiting record and an INSERT).
You can conceivably route both inputs to the same INSERT destination. But, you will need to be careful suppress the update record's ExistingEffectiveEndDate value that deactivates the date.
My system collects a lot of data from different resources (each resource has text-ID), and send it to client bounded together in predefined groups. there some hundreds of different resources, each might set record in period of second up some hours. there less then hundred "view groups".
The data collector is single-threaded.
what is the best method to organize the data?
a. make different table for each source, where the name of the table is based on the source id?
b. make single table and add the source id as text-field (key if possible)?
c. table for each predefined display group, with the source id as text-field?
each record has value (float) and date (date). the query will be something like select * from ... where date < d1 and date > d2. In case of single table, it will be "and sourceId in(...)"
database type is unknown yet, it might be lightsql, postgres, mysql, mssql ...
I've got table A and table B in a database. I'm trying to move the column in table A to table B without losing any data. Not sure if its best to move the column somehow or make a whole new column in table B then copy the data over?
The problem is that the database is already in production and I don't want the clients to lose the data that is currently stored in column X in table A. I thought of making a migration to create the same column X in table B, and then somehow copying the data there. I am not sure how to do that, and I couldn't find a similar problem here.
if you have phpmyadmin you can do this pretty easily. this command should work:
INSERT INTO `tabletwo.columnb` (SELECT 'columna' FROM tableone)
Always back up the dbs, load local and try this, never live lol. I'm sure your aware. :)
Note: columna and columnb are placeholder for your actual column names
I think you can create the migration table to create de column on table B.
After you can use tinker ("php artisan tinker" on the terminal) to move the desired data on the table A to the B.