We have a project for loading data from and external source into a Data Vault Data Warehouse. The data are salary statements between and employer and an employee.
When starting to modelling this we find two business key the company id of the employer and the social security number (SSN) of the employee. Based on we get two hubs one for the employer and one for the employee. When adding a link between these two hubs we noticed that as there may (will) be more that one salary statement for each combination of employer and employee. This means we can't model this relationship with two hubs and one link.
Logically this could be handled by adding a third salary statement hub. Then we could have a link for all these three hubs. Our problem is that we don't have any business key for the salary statement!
My only thought as a workaround is to generate an artificial business key for the salary statement using company id, SSN and period of the salary statement. This don't really feel right to generate a business key in the Data Warehouse but do we have any other options? Could this maybe be modeled differently with Data Vault?
Any thoughts and ideas highly appreciated.
What you've noticed here is a situation where Data Vault gets really difficult.
You have a situation where each data object, don't have a business key.
The Data Vault architecture needs business keys.
You generally have 4 options.
Having a business object (in this case, a salary statements) without a business key is an anti-pattern. Convince the developer of the salary system to deliver a business key or unique transaction number for each salary statement.
Create a composite key, like you mentioned.
The biggest issue with this approach is: can you be sure that the composite key always is unique?
Let's say you use company id, SSN and period. What if a mistake was made in the salary system and they had to make an extra payment in the same period?
In this situation you would have 2 rows for the same composite key (company id, SSN and period).
Create your own business key.
Write a small program that takes the data from the salary system, and adds its own business key.
This could be as simple as a database table with a primary key, and then use that primary key as a business key.
Don't use Data Vault for this object. If an object don't fit in Data Vault, or if there is another structure that fits the data better, then use that.
Related
I am very confused about how to identify the primary key in the first normal form
In the first example: I can understand the reason that SR_ID and Cus_No are the primary keys. But why Mngr_ID is not a primary key? Why Mngr_ID is depended on SR_ID and Cus_No
In another example: Why staff_No is not a primary key?
Because in the lecture, the first step that my professor examine is start from finding PK. But I am not sure how to do this? In the this example, all other attributes can depend on property_no and IDate, so they are the pk. But I don't understand why staff_no is depended on property_no and IDate
Well the primary key is the column or combination of columns that make the row in the table unique.
In the first example, mgr Id is not part of the key because the manager is directly assigned to a user, thus having only one manager per user. So if you know the user, the manager is irrelevant as it is derived from the user. In the other hand, if a user could have more than one manager, then you would pull the manager Id from this table and create a manager sales rep table to keep that relationship. Either way, the manager is indeed not needed to make the sales rep - Customer relationship unique.
The second example is more confusing as it depends on interpretation of the (not very good) issue description. If only one person can inspect the house per day, then yes staff is no needed as part of the primary key. However, since the problem says that there could be more than one people sharing the car, it may be possible that two people could inspect the house together in one day. That would mean that the staff should be part of the key in that case.
In the end, like everything in software development, it all depends on what you are trying to accomplish with your application.
What is the line that you should draw when normalising data, in terms of data duplication? i.e would you say that 2 employees who share the same birthday or have the same timestamp for a shift is data duplication? and therefore should be placed into another data table?
Birth date has full and non-transitive dependency to a person which means that it should be stored within the same table where you keep your employees and it would comply with third normal form (3NF).
Work shifts are not an attribute of an employee which means that they are a different entity and stay in relation with employee entity.
There is no a particular 'limit' when following the normalisation to data, since the main restriction that is given for every relational database table is to have an unique parimary key. Hence, if all other columns contain the same data, but the primary key is still different, it is a different row of a table.
The actual restrictions can come in two form. One is either the programming or systhematic approach, where the restriction on what kind of data is inputed is given from a program which interacts with the database or already defined script handed down physically for the admin of the database.
Other, more database-orriented approach would be to create primary keys composed of multiple columns. That way a row is unique only if for both columns the data is unique. It should be noted that a primary key is not necessary the same as an unique key, which should be different for every instance.
You have misunderstood what normalization does.
Two attributes having the same value (i.e. two employees having the same birthday) is not redundancy.
Rather having the same attribute in the two tables (i.e. two tables having birthday column, therefore repeating every employee's birthday information) is.
Normalization is a quality decision and denormalization is a performance decision. For my school projects, my teachers recommended me to normalize at least till 3NF. So that may be a good guideline.
Suppose there is a table keeping info about Vendors and Customers in one table named Partners (since one partner can be vendor at one point of time and customer at other).
Partners table have usual stuff: company name, short name, address, city, country. Now, for domestic partners there is DomesticVatNumber and for non-domestic there is InternationalVatNumber. Usually, vat number would be perfect candidate for primary key but the problem here is that not all domestic partners have InternationalVatNumber and international ones dont have DomesticVatNumber.
I am trying to see best ways to design this in db. Is surrogate key the only option in this case or should i maybe reconsider having domestic and international partners in same table? Should i maybe split them into 2 tables: DomesticPartners (which always have DomesticVatNumber) and InternationalPartners (which always have InternationalVatNumber) and then put primary key on DomesticVat and InternationalVat columns respectively?
What are pros/cons of each approach?
Personally, I would never make a primary key out of something assigned by an external party, nor would I use a value that the user would ever see. I would always use a meaningless key (either an identity column or a unique identifier).
Given what you are saying, I wouldn't split them into separate tables since you would then have to either have any table that referenced your partner table in a foreign key would either have to have two nullable columns setup to do this or have one column but no foreign key relationship (shudder...).
The best option is to have one table, have the domestic and international VAT numbers as separate fields in the table but not a primary key. Since they will both be nullable, you would have limited options for a unique constraint on them.
Just my 2 cents
As your business grows, your systems get more complex, and it makes more sense to have one table. An example can be an ENTITIES table which stores everyone and everything, including vendors and customers. This can include individuals, groups and businesses, clients and staff, etc. Later on you will be glad you did it this way, because it reduces the number of complex joins you are going to have with multiple tables. You can use ENTITY_NO as a surrogate key and ENTITY_TYPE to differentiate entities. VAT number fields can be indexed separately and made nullable.
I have two types of accounts (customer and provider), I chose the single-table strategy for persistence. Customer creates Orders (one2many) and provider bids on the orders in auction style (many2many relationship, because he can bid on many orders as well as other providers). My question is, is it possible to have these relationships simultaneously ? Because logically it can work. But MDA code generators don't like it. If so, what drawbacks I could come across with this datamodel.
Thanks in advance.
The disadvantage is that you can't enforce referential integrity in the database between the accountID in the accounts table and the accountID in the bids table (which I assume represents the accountID of the provider bidding on the order) because not all accountID values are allowable.
But, don't give up on the single-table solution for accounts, which may well be the correct one for your problem (I can't say for sure not completely understand the relation between providers and customers). Here's what you need to do to both use the single table solution and allow referential integrity:
Remove isProvider and isCustomer from Accounts.
Add two new tables Providers and Customers. Each table will have an accountID column which is both the primary key in that table and a foreign key back to the original account table.
Migrate any additional columns from Accounts that are unique to either Providers or Customers into the appropriate table.
Now, the accountID in the Orders table should be a foreign key into Customers, not Accounts. Similarly, the accountID column in Bids becomes a foreign key into Providers rather than Accounts.
Relational integrity and single-table storage for accounts is provided for.
"I chose the single-table strategy for persistence" - that's actually not that good a reason for combining them, in my opinion. Customers and providers are fundamentally different beasts.
The fact that you're having troubles is a clear indication that you're most likely doing it the wrong way - that's true of most things in the IT industry (and probably life itself but you don't need me proselytising on that).
I would separate them out into different tables to resolve this particular problem.
If you really want part of the data to be shared, you could put the common things in yet another table and reference it from the customers and providers tables.
You may want this if a single entity can be both a customer and provider - in that case, you would want the two different table entries to share the same information (such as balance, reputation and so on).
I have a few database tables that really only require a unique id that references another table e.g.
Customer Holiday
******** *******
ID (PK) ---> CustomerID (PK)
Forename From
Surname To
....
These tables such as Holiday, only really exist to hold information regarding a Customer. Therefore, do I need to specify a separate field to hold the ID for the holiday? i.e.
Holiday
*******
ID (PK)
CustomerID (FK)
...
Or would I be ok, in this instance, to just set the CustomerID as the primary key in the table?
Regards,
James.
This really depends on what you are doing.
if each customer can have only 1 holiday, then yes, you could make the customerid the primary key.
If each customer can have multiple holidays, then no, you would want to add a new id column, make it the primary. This allows you to select holidays by each customer AND to select individual records by their unique id.
Additionally if each customer can only have 1 holiday, I'd just add the holiday information to the table, as a one-to-one relationship is typically un-necessary.
If I understand your question correctly, you could only use the Customer table as a primary key in Holiday if there will never be any other holiday for that customer in the table. In other words, two holidays for one customer breaks using the Customer id as a primary key.
If there will ever be an object-oriented program associated with this database, each entity (each row) must have a unique key.
Your second design assures that each instance of Holiday can be uniquely identified and processed by an OO application using a simple Object-Relational Mapping.
Generally, it's best to assure that every entity in the database has a unique, immutable, system-assigned ("surrogate") key. Other "natural" keys can have unique indexes, constraints, etc., to fit the business logic.
Previous answer correct, but also remember, you could have 2 seperate primary keys in each table, and the "holiday" table would have the foreign key to CustomerId.
Then you could manage the assignment of holidays to customers in your code, to make sure that only one holiday can be assigned to a customer, but this brings in the problem concurrency, being 2 people adding a holiday to a customer at the same time will most probably result in a customer having 2 holidays.
You could even place holiday fields in the customer table if a customer can only be created with a holiday, but this design is messy, and not really advised
So once again, option in your question 2 still the best way to go, just giving you your options.
In practice I've found that every table should have a unique primary key identifying the records in those tables. All relationships with other tables should be explicitly declared.
This helps others understand the relationships better, especially if they use a tool to reverse-engineer the schema into a visual representation.
In addition, it gives you more flexibility to expand your solution in the future. You may only have one holiday per customer now, but this is much more difficult to change if you make customer ID the primary key.
If you want to mandate the uniqueness of customer in the holiday table, create a unique index on that foreign key. In fact, this could improve performance when querying on customer ID (although I'm guessing you won't see enough records to notice this improvement).