How do I manage identities with ETL?

How do I manage identities with ETL? - sql-server

I need help figuring out a workflow and I'm not sure how to go about it... Let's say I'm transforming (ETL?) data from Table A to Table B. Table A has a composite primary key A.a+A.b+A.c, while Table B has just an automatically populated identity column. How can I map the composite keys from A back to the identities created when inserting into B?
Preferably I would like to not have any columns in table B related to A's composite key because there are many other tables that need to undergo the same operation but don't have the same composite key structure.

If I understand you correctly, you can't relate records from table B back to the records of table A after the transformation unless you somehow capture a mapping between A's composite key and B's identifier during the transformation.
You could add a column to A and pre-compute the identifiers to be used when inserting into B. Then you would have a mapping. This could also be done using a separate mapping table, if you don't want to add a column to A.
If you don't want to override the default assignment of identifiers, then you will have to capture them during the load. Oracle provides the returning clause for insert in PL/SQL for this purpose. I'm not sure about SQL Server. It may also be possible to accomplish this by using a trigger on B to insert into a separate mapping table or update a column in A. Though that's likely to slow down your load considerably.
If nothing else, you could create additional columns in B to hold the keys of A during the load, query out the mappings into a separate table afterwards, and then drop the extra columns.
I hope that helps.

Ask yourself exactly what you need the original keys for. The answer may vary depending on the source system. This may lead you to maintain a "source system" column and a "original source keys" column. The latter may need to be a comma-delimited list of the original keys.
Or, you may find that you never actually need to map back, so don't need to keep anything.

Related

PostgreSQL - table with foreign key column needs to be updated first

I have encountered a problem in my project using PostgreSQL.
Say there are two tables A and B, both A and B have a (unique) field named ID. The ID column of table A is declared as a primary key, while the ID column of table B is declared as a foreign key pointing back to table A.
My problem is that every time we have new data inputted into database, the values in table B tend to be updated prior to the ones in table A (this problem can not be avoided as the project is designed this way). So I have to modify the relationship between A and B.
My goal is to achieve a situation where I can insert data into A and B separately while having the ON DELETE CASCADE clause enabled. What's more, INSERT and DELETE queries may happen at the same time.
Any suggestions?

It sounds like you have a badly designed project, if you can't use deferred constraints. Your basic problem is that you can't guarantee internal consistency of the data because transactions may occur which do not move the data from one consistent state to another.
Here is what I would do to be honest:
Catalog affected keys.
Drop affected key constraints.
Write a periodic job that looks for orphaned rows. Use LEFT JOIN because antijoins do not perform as well in PostgreSQL.
The problem with a third table is it doesn't solve your basic problem, which is that writes are not atomically consistent. And once you sacrifice that a lot of your transactional controls go out the window.
Long term, the project needs to be rewritten.

SQL Server 2008 - Database Design Query

I have to load the data shown in the below image into my database.
For a particular row, either field PartID would be NULL OR field GroupID will be NULL, and the other available columns refers to the NON-NULL entity. I have following three options:
To use one database table, which will have one unified column say ID, which will have PartID and GroupID data. But, in this case I won't be able to apply foreign key constraint, as this column will be containing both entities' data.
To use one database table, which will have columns for both PartID and GroupID, which will contain the respective data. For each row, one of them will be NULL, But in this case I will be able to apply foreign key constraint.
To use two database tables, which will have similar structure, the only difference will be the column PartID and GroupID. In this case I will be able to apply foreign key constraint.
One thing to note here is that, the table(s) will be used in import processes to import about 30000 rows in one go and will also be heavily used in data retrieve operations. Also, the other columns will be used as pivot columns.
Can someone please suggest what should be best approach to achieve this?

I would use option 2 and add a constraint that only one can be non-null and the other must be null (just to be safe). I would not use option 1 because of the lack of a FK and the possibility of linking to the wrong table when not obeying the type identifier in the join.
There is a 4th option, which is to normalize them as "items" with another (surrogate) key and two link tables which link items to either parts or groups. This eliminates NULLs. There are further problems with that approach (items might be in both again or neither without any simple constraint), so unless that is necessary for other reasons, I wouldn't generally go down that path.
Option 3 could be fine - it really depends if these rows are a relation - i.e. data associated with a primary key. That's one huge problem I see with the data presented, the lack of a candidate key - I think you need to address that first.

IMO option 2 is the best - it's not perfectly normalized but will be the easiest to work with. 30K rows is not a lot of rows to import.

I would modify the table so it has one ID column and then add an IDType that is either "G" for Group or "P" for Part.

Too many lookup tables

What are the adverse effects of having too many lookup tables in the database?
I have to incorportate too many Enumerations, based on the applications.
What would experts advice?

Initially you have to ask yourself "how many is too many?". If there is a logical relation between two tables, there has to be a FK.
If you don't need the related tables anywhere within the database, you could consider to remove them and use a CHECK constraint with an "IN" clause to enforce data validity. Though, this would cause an alteration of the table with each new value within the enumeration.
My personal advice is to keep the FKs and the tables. It's a clear solution and the database is way better to maintain if there is a describing text available for all those numbers.

Let me tell how awful it is to have too few lookup tables. THe orginal designers at one place I worked decided to put all lookups into one table and define what the lookups were for using a typeid. This caused almost all queries to hit this table to get the lookup descriptive value causing a performance jam.
Further, without separate lookups, the fields that took the typeid were not constrained by the values appropriate to that field because a foreign key can only be on the the whole table not a chunk. So the filed that stored the clientid might accidentally contain the value for a user group. This caused data integrity problems and made reporting much more difficult as we had to intepret values that didn't make sense in context. There is no prize for using too few tables, in fact it is often an anti-pattern in database design.
Create 1000 lookup tables if that is what you need.

As Florian, I like a lot more to have tons of Foreign Keys then to have CHECK IN (..) - for a simple reason: you can insert other records on your tables.
Maintaning CHECK IN () is a much bigger problem. Imagine this scenario:
CREATE TABLE street
(
id serial not null,
st_type varchar(20) not null,
st_name varchar(100) not null,
constraint street_pk primary key (id)
constraint street_type_check check st_type in ('STREET','AVENUE','SQUARE')
);
You have 1000 rows with those types checked, correct? If you need to add another one, you will need to drop the constraint and recreate it.
IF you take a item off that list, like SQUARE, what will happen to the rows already commited (and checked at moment of insertion) that have that type? They will still keep an invalid type.
Tables and Foreign Keys are easier to maintain and keep track of.

The Whole point of lookup data is that there is a finite list of valid identifiers for a specific field. if those specific fields are used in procedures or where statements to determine the correct process path or the limit the select list, then there is no such thing as too many lookups.
if it is not a finite list of identifiers for a specific process or where clause then they should not be a lookup value.
two types of fields that come to mind which might be considered lookup values but don't necessarily need to be.
City and Province/state:
There is a finite list of these but because there are sooo many you might not want to make a lookup for these.

how can i have a unique column in many tables

I have ten or more(i don't know) tables that have a column named foo with same datatype.
how can i tell sql that values in all the tables should be unique.
I mean If(i have value "1" in table1) I should NOT be able to have value "1" in table2

Have a common ID's table, which these ten tables reference. That will work well in that it will ensure unique ID's, but doesn't mean you couldn't duplicate the ID's in the table if someone really wants to.
What I mean is a common ID's table ensures that you don't have duplicates for insert (by also inserting an ID into this common table), but the thing is the way to guarantee that it never happens is by building the business rules into the system or placing check constraints to cross reference the other tables (which would ensure uniqueness, but degrade performance).

The question is phrased vaguely; if you need to generate a column that's unique among several tables, use row GUIDs or a common ID generator table; if you need to enforce uniqueness (and the field values are already there), use triggers.
Generally, if you generate the values, you don't need to enforce anything. The generation logic, if done right, will take care of that. If you are inserting, say, user input, then you can and should enforce uniqueness during insertion. As a validation rule or something.

You can define the field as a GUID (or a UNIQUEIDENTIFIER in SQL server). Then it will always be unique no matter what.

How about setting a check constraint on each table, such that ID % 10 = N (where N is the table number, from 0-9). And use IDENTITY(N,10) each time.

I would suggest that possibly your design is flawed. Why are these separate tables? It ouwld be better to put them in one table with one id field and another filed to identify whatever is making these spearate tables (cusotmer id for instance). Then you can read about partioning tables if you want them to be split by customer for performance reasons.

Foreign key referencing composite table

I've got a table structure I'm not really certain of how to create the best way.
Basically I have two tables, tblSystemItems and tblClientItems. I have a third table that has a column that references an 'Item'. The problem is, this column needs to reference either a system item or a client item - it does not matter which. System items have keys in the 1..2^31 range while client items have keys in the range -1..-2^31, thus there will never be any collisions.
Whenever I query the items, I'm doing it through a view that does a UNION ALL between the contents of the two tables.
Thus, optimally, I'd like to make a foreign key reference the result of the view, since the view will always be the union of the two tables - while still keeping IDs unique. But I can't do this as I can't reference a view.
Now, I can just drop the foreign key, and all is well. However, I'd really like to have some referential checking and cascading delete/set null functionality. Is there any way to do this, besides triggers?

sorry for the late answer, I've been struck with a serious case of weekenditis.
As for utilizing a third table to include PKs from both client and system tables - I don't like that as that just overly complicates synchronization and still requires my app to know of the third table.
Another issue that has arisen is that I have a third table that needs to reference an item - either system or client, it doesn't matter. Having the tables separated basically means I need to have two columns, a ClientItemID and a SystemItemID, each having a constraint for each of their tables with nullability - rather ugly.
I ended up choosing a different solution. The whole issue was with easily synchronizing new system items into the tables without messing with client items, avoiding collisions and so forth.
I ended up creating just a single table, Items. Items has a bit column named "SystemItem" that defines, well, the obvious. In my development / system database, I've got the PK as an int identity(1,1). After the table has been created in the client database, the identity key is changed to (-1,-1). That means client items go in the negative while system items go in the positive.
For synchronizations I basically ignore anything with (SystemItem = 1) while synchronizing the rest using IDENTITY INSERT ON. Thus I'm able to synchronize while completely ignoring client items and avoiding collisions. I'm also able to reference just one "Items" table which covers both client and system items. The only thing to keep in mind is to fix the standard clustered key so it's descending to avoid all kinds of page restructuring when the client inserts new items (client updates vs system updates is like 99%/1%).

You can create a unique id (db generated - sequence, autoinc, etc) for the table that references items, and create two additional columns (tblSystemItemsFK and tblClientItemsFk) where you reference the system items and client items respectively - some databases allows you to have a foreign key that is nullable.
If you're using an ORM you can even easily distinguish client items and system items (this way you don't need to negative identifiers to prevent ID overlap) based on column information only.
With a little more bakcground/context it is probably easier to determine an optimal solution.

You probably need a table say tblItems that simply store all the primary keys of the two tables. Inserting items would require two steps to ensure that when an item is entered into the tblSystemItems table that the PK is entered into the tblItems table.
The third table then has a FK to tblItems. In a way tblItems is a parent of the other two items tables. To query for an Item it would be necessary to create a JOIN between tblItems, tblSystemItems and tblClientItems.
[EDIT-for comment below] If the tblSystemItems and tblClientItems control their own PK then you can still let them. You would probably insert into tblSystemItems first then insert into tblItems. When you implement an inheritance structure using a tool like Hibernate you end up with something like this.

Add a table called Items with a PK ItemiD, And a single column called ItemType = "System" or "Client" then have ClientItems table PK (named ClientItemId) and SystemItems PK (named SystemItemId) both also be FKs to Items.ItemId, (These relationships are zero to one relationships (0-1)
Then in your third table that references an item, just have it's FK constraint reference the itemId in this extra (Items) table...
If you are using stored procedures to implement inserts, just have the stored proc that inserts items insert a new record into the Items table first, and then, using the auto-generated PK value in that table insert the actual data record into either SystemItems or ClientItems (depending on which it is) as part of the same stored proc call, using the auto-generated (identity) value that the system inserted into the Items table ItemId column.
This is called "SubClassing"

I've been puzzling over your table design. I'm not certain that it is right. I realise that the third table may just be providing detail information, but I can't help thinking that the primary key is actually the one in your ITEM table and the FOREIGN keys are the ones in your system and client item tables. You'd then just need to do right outer joins from Item to the system and client item tables, and all constraints would work fine.

I have a similar situation in a database I'm using. I have a "candidate key" on each table that I call EntityID. Then, if there's a table that needs to refer to items in more than one of the other tables, I use EntityID to refer to that row. I do have an Entity table to cross reference everything (so that EntityID is the primary key of the Entity table, and all other EntityID's are FKs), but I don't find myself using the Entity table very often.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight