I have a table in my source DB that is self referencing
|BusinessID|...|ParentID|
This table is modeled in the DW as
|SurrogateID|BusinessID|ParentID|
First question is, should the ParentID in the DW reference the surrogate id or the business id. My idea is that it should reference the surrogate id.
Then my problem occurs, in my dataflow task of SSIS, how can I lookup the surrogate key of the parent?
If I insert all rows where ParentID is null first and then the ones that are not null I solve part of the problem.
But I still have to lookup the rows that may reference a parent that is also a child.
I.e. I do have to make sure that the parents are loaded first into the DB to be able to use the lookup transformation.
Do I have to resolve to a for-each with sorted input?
One trick I've used in this situation is to load the rows without the ParentID. I then used another data flow to create an update script based on the source data and the loaded data, then used a SQL task to run the created update script. It won't win prizes for elegance, but it does work.
Related
I have a flat file which has following columns
Device Name
Device Type
Device Location
Device Zone
Which I need to insert into SQL Server table called Devices.
Devices table has following structure
DeviceName
DeviceTypeId (foreign key from DeviceType table)
DeviceLocationId (foreign key from DeviceLocation table)
DeviceZoneId (foreign key from DeviceZone table)
DeviceType, DeviceLocation and DeviceZone tables are already prepopulated.
Now I need to write ETL which reads flat file and for each row get DeviceTypeId, DeviceLocationId and DeviceZoneId from corresponding tables and insert into Devices table.
I am sure this is not new but its being a while I worked on such SSIS packages and help would be appreciated.
Load the flat content into a staging table and write a stored procedure to handle the inserts and updates in T-SQL.
Having FK relationships between the destination tables, can probably make a lot of trouble with a single data flow and a multicast.
The problem is that you have no control over the order of the inserts so the child record could be inserted before the parent.
Also, for identity columns on the tables, you cannot retrieve the identity value from one stream and use it in another without using subsequent merge joins.
The simplest way to do that, is by using Lookup Transformation to get the ID for each value. You must be aware that duplicates may lead to a problem, you have to make sure that the value is not found multiple times in the foreign tables.
Also, make sure to redirect rows that have no match into a staging table to check them later.
You can refer to the following article for a step by step guide to Lookup Transformation:
An Overview of the LOOKUP TRANSFORMATION in SSIS
I'm actually managing and updating a Webapp which DB is on MySQL Server 5.x and all tables are InnoDB, my problem comes when I need to create a new tables with foreign key that references to an existing one, and which one has live data. When I try to execute the create command it throws the famous errno 1005, The problem is solved if I delete all info of father table , create son table and reload data on father one (this is for constraints, I think). This will be a pain if father table has a grandfather table a it contains data too.
I was wondering if there is a way to do this task easily, maybe a command that includes ignoring constraints?
From what I can see from the MySQL documentation, you could try using
SET foreign_key_checks = 0
I've come across posts of cases where that doesn't always work.
Your next best option is to make the data conform to what the FK constraint will require, then apply the constraint. That is not necessarily uncommon when adding FK constraints to tables that already have data.
[SCENARIO]
The scenario is, there is a requirement to send multiple records through XML file to the server, for insertion in the database. These records consists of multiple master and detail tables data linked together through primary and foreign keys.
Now the client cannot fill the Primary key and foreign key columns/data in those records before hand, this must be done by the server when XML data arrives there.
{Client} ---------------> {Our Server} --> {SQL Server}
[QUESTION]
What is the best way to temporarily link master and detail records together so that server must understand linked records and substitute temporary primary/foreign keys with GUID's or any auto-number/unique-key as per database schema. ?
Should I use simple sequential integer keys or GUID ?
Are there any industry standards ?
I have never heard of any standard way to do this, industry or otherwise. Whenver possible, you want to determine/generate the primary keys before adding rows to the database. Natural keys are ideal for this, and guids would work as well. A brief example:
-- Start with Parent row and several Children rows to insert
SET #GuidPK = newid()
INSERT parent row using #GuidPK
INSERT children rows using #GuidPK
If you cannot do this (which will happen if you're using an identity column as the parent's primary key), and are inserting a single parent (+ 0 or more children), it's still simple:
-- Start with Parent row and several Children rows to insert
INSERT Parent -- Presums one at a time!
SET #NewPK = scope_identity()
INSERT Children using #NewPK
However, if you are inserting multiple parents and their children all at once (and it sounds like this is what you're facing), it gets tricky. I have had reasonable success with variants of the following methodology.
First add every "new" parent to the parent table. Once this is done, query the parent table and extract the new Id assigned to each parent, and assign it to the appropriate child rows when they're loaded. Psuedo code:
INSERT ParentSet
SELECT NewIds of ParentSet just loaded
INSERT ChildSet using these NewIds
The trick is in identifying and extracting (only) the new parents we just entered. If you have a natrual key (unique product name, OrderId, maybe something based on datetime data was entered), use that. If not, you'll need to fake one. I've done tricks where I initially generated a guid for each parent to be added, set an arbitrary column to that parent during the intial insert, pulled the new ids by reading only for those guids, and then replacing the guids with the proper column value. More psuedo code:
Add a colum to the parent set, configure with a unique guid in each
INSERT parent ... column XYZ = new guid
SELECT NewId from Parent where XYZ in (list of guids generated)
UPDATE parent set XYZ = proper value where (filter or join based on NewId)
INSERT children (using the retrieved NewId)
I hope this helps, it's hard to explain without specific structures and sample data.
I am a bit of an SSIS newbie and while the whole system seems straightforward, I don't conceptually understand the process I need to go through in this scenario:
Need to map Invoice and InvoiceLine tables from a source database to two equivalent tables in a destination database - with different identity values.
For each invoice inserted across, I need to get the identity it was assigned and then insert all its lines referencing that new identity
There is a surrogate key on the invoices (the invoice number), however these might also clash with invoice numbers in the target system, hence they would also have to be renumbered.
This must be a common scenario in integration - is there a common solution?
Chris KL - you are correct that this is harder than one would expect. I have three methods for this, which work in different situations:
IF the data you are loading is small (hundreds or thousands but not hundreds OF thousands) then you can do this: use an OLEDB command that performs one insert for each parent row and returns the identity value back; then downstream from that join the output from that to the child rows, and insert them. Advantage: intuitive. Disadvantage: scales badly. This method is documented on the web and should Google for you.
If we are talking about a bigger system where you need bulk loading, then there are two other flavors:
a. If you have exclusive access to the table during the load (really exclusive, enforced in some way) then you can grab the max existing ID from the table, use an SSIS script task to number the rows starting above that max id, then Set Identity Insert On, stuff them in, and Set Identity Insert Off. You then have those script-generated keys in SSIS to assign to the child rows. Advantage: fast and simple, one trip to the DB. Disadvantage: possible errors if some other process inserts into your table at the same time. Brittle.
b. If you don't have exclusive access, then the only way I know of is with a round trip to the DB, thus: Insert all parent rows but keep track of a key for them that is not the identity column (a business key, for example). In a second dataflow, process the child records by using a Lookup transform that uses the business key to fetch the parent ID. Make sure the lookup is tuned appropriately vs. caching, and that thee business key is indexed.
OK, this is a good news / bad news situation I'm afraid. First the good news and a bit of background which you may know but I'll put it down in case you don't.
You generally can't insert anything into IDENTITY columns. Of course, like everything else in life there are times when you need to and that can be done with the IDENTITY_INSERT option.
SET IDENTITY_INSERT MyTable ON
INSERT INTO MyTable (
MyIdCol,
Etc…
)
SELECT SourceIdCol,
Etc…
FROM MySourceTable
SET IDENTITY_INSERT MyTable OFF
Now, you say that you have surrogate keys in the target but then you say that they may clash. So I'm a little confused… Are you using the keys from the source (e.g. IDENTITY columns) or are you generating new keys in the target? I would strongly advise against trying to merge the keyspaces in a single key column. If you need to retain the keys then I would suggest a multi-field key using something like SourceSystemId to keep them unique.
Finally the bad news: SSIS doesn't provide a simple means of using the IDENTITY_INSERT option. The only way I've been able to do it is by turning it on in a SQL task that executes before the insert task. You should be able to pass the table name into the script as a variable. Make sure to include another SQL task afterwards to turn it off because you can only use on one table at a time.
I've got a table structure I'm not really certain of how to create the best way.
Basically I have two tables, tblSystemItems and tblClientItems. I have a third table that has a column that references an 'Item'. The problem is, this column needs to reference either a system item or a client item - it does not matter which. System items have keys in the 1..2^31 range while client items have keys in the range -1..-2^31, thus there will never be any collisions.
Whenever I query the items, I'm doing it through a view that does a UNION ALL between the contents of the two tables.
Thus, optimally, I'd like to make a foreign key reference the result of the view, since the view will always be the union of the two tables - while still keeping IDs unique. But I can't do this as I can't reference a view.
Now, I can just drop the foreign key, and all is well. However, I'd really like to have some referential checking and cascading delete/set null functionality. Is there any way to do this, besides triggers?
sorry for the late answer, I've been struck with a serious case of weekenditis.
As for utilizing a third table to include PKs from both client and system tables - I don't like that as that just overly complicates synchronization and still requires my app to know of the third table.
Another issue that has arisen is that I have a third table that needs to reference an item - either system or client, it doesn't matter. Having the tables separated basically means I need to have two columns, a ClientItemID and a SystemItemID, each having a constraint for each of their tables with nullability - rather ugly.
I ended up choosing a different solution. The whole issue was with easily synchronizing new system items into the tables without messing with client items, avoiding collisions and so forth.
I ended up creating just a single table, Items. Items has a bit column named "SystemItem" that defines, well, the obvious. In my development / system database, I've got the PK as an int identity(1,1). After the table has been created in the client database, the identity key is changed to (-1,-1). That means client items go in the negative while system items go in the positive.
For synchronizations I basically ignore anything with (SystemItem = 1) while synchronizing the rest using IDENTITY INSERT ON. Thus I'm able to synchronize while completely ignoring client items and avoiding collisions. I'm also able to reference just one "Items" table which covers both client and system items. The only thing to keep in mind is to fix the standard clustered key so it's descending to avoid all kinds of page restructuring when the client inserts new items (client updates vs system updates is like 99%/1%).
You can create a unique id (db generated - sequence, autoinc, etc) for the table that references items, and create two additional columns (tblSystemItemsFK and tblClientItemsFk) where you reference the system items and client items respectively - some databases allows you to have a foreign key that is nullable.
If you're using an ORM you can even easily distinguish client items and system items (this way you don't need to negative identifiers to prevent ID overlap) based on column information only.
With a little more bakcground/context it is probably easier to determine an optimal solution.
You probably need a table say tblItems that simply store all the primary keys of the two tables. Inserting items would require two steps to ensure that when an item is entered into the tblSystemItems table that the PK is entered into the tblItems table.
The third table then has a FK to tblItems. In a way tblItems is a parent of the other two items tables. To query for an Item it would be necessary to create a JOIN between tblItems, tblSystemItems and tblClientItems.
[EDIT-for comment below] If the tblSystemItems and tblClientItems control their own PK then you can still let them. You would probably insert into tblSystemItems first then insert into tblItems. When you implement an inheritance structure using a tool like Hibernate you end up with something like this.
Add a table called Items with a PK ItemiD, And a single column called ItemType = "System" or "Client" then have ClientItems table PK (named ClientItemId) and SystemItems PK (named SystemItemId) both also be FKs to Items.ItemId, (These relationships are zero to one relationships (0-1)
Then in your third table that references an item, just have it's FK constraint reference the itemId in this extra (Items) table...
If you are using stored procedures to implement inserts, just have the stored proc that inserts items insert a new record into the Items table first, and then, using the auto-generated PK value in that table insert the actual data record into either SystemItems or ClientItems (depending on which it is) as part of the same stored proc call, using the auto-generated (identity) value that the system inserted into the Items table ItemId column.
This is called "SubClassing"
I've been puzzling over your table design. I'm not certain that it is right. I realise that the third table may just be providing detail information, but I can't help thinking that the primary key is actually the one in your ITEM table and the FOREIGN keys are the ones in your system and client item tables. You'd then just need to do right outer joins from Item to the system and client item tables, and all constraints would work fine.
I have a similar situation in a database I'm using. I have a "candidate key" on each table that I call EntityID. Then, if there's a table that needs to refer to items in more than one of the other tables, I use EntityID to refer to that row. I do have an Entity table to cross reference everything (so that EntityID is the primary key of the Entity table, and all other EntityID's are FKs), but I don't find myself using the Entity table very often.