Is it mandatory that using Hub keys it should be able to uniquely indentify a record in satellite - data-modeling

I am a newbie in data vault modelling, I could not find a satisfying answer for the below query, pls help.
In Data Vault modelling, whether the below statement should met? If yes, should it be met 100%?
"The BKs in the Hub or Hubs should be such that, it should be sufficient to unique identify a record in the Satellite"
Thanks,

The primary key of a satellite should be the key of the hub or link it attaches to plus the load date that row is inserted.
So you can have multiple entries in the satellite for one business key but you shouldn't have multiple entries for one point in time.
The above would also not be true in the case of a many to many relationship where there needs to be an additional field in the satellite primary key to make the rows unique (the classic example given is line number of multiple items within the same sale\invoice).

Related

Normalization (3NF) on a simple table of two columns

I have a table with only two attributes (DeliveryPerson and DeliveryTime). Each person can deliver a “product” at a specific Delivery Time. As you can see below John for example has delivered three products at different delivery times.
According to my task, I must put this table in 3NF, but I am confused because I cannot set “deliveryPerson” as a primary key because there are repeated values in this column. Is there any way of setting up this table to satisfy 3NF? If that is not possible, is it correct to have a table like this in a DB without a Primary key?
Thank you very much!
Normalisation is not about adding Primary Keys to a table where you've already decided the columns, it's about deciding what tables and columns you need in the first place. The inability to define a Primary Key on this table is the problem you've been asked to solve; the solution will involve creating new tables.
Rather than looking at the table, look at the data you're trying to model:
There are four (and probably any number of) delivery people
Each delivery person can have one or more (maybe even zero) delivery times
A normalised database will represent each of those separately. I'll leave the details for you to work out, rather than feeding you the full answer.
There are plenty of tutorials available which will probably explain it better than me.

What is the specific step to identify Primary key in first normal form

I am very confused about how to identify the primary key in the first normal form
In the first example: I can understand the reason that SR_ID and Cus_No are the primary keys. But why Mngr_ID is not a primary key? Why Mngr_ID is depended on SR_ID and Cus_No
In another example: Why staff_No is not a primary key?
Because in the lecture, the first step that my professor examine is start from finding PK. But I am not sure how to do this? In the this example, all other attributes can depend on property_no and IDate, so they are the pk. But I don't understand why staff_no is depended on property_no and IDate
Well the primary key is the column or combination of columns that make the row in the table unique.
In the first example, mgr Id is not part of the key because the manager is directly assigned to a user, thus having only one manager per user. So if you know the user, the manager is irrelevant as it is derived from the user. In the other hand, if a user could have more than one manager, then you would pull the manager Id from this table and create a manager sales rep table to keep that relationship. Either way, the manager is indeed not needed to make the sales rep - Customer relationship unique.
The second example is more confusing as it depends on interpretation of the (not very good) issue description. If only one person can inspect the house per day, then yes staff is no needed as part of the primary key. However, since the problem says that there could be more than one people sharing the car, it may be possible that two people could inspect the house together in one day. That would mean that the staff should be part of the key in that case.
In the end, like everything in software development, it all depends on what you are trying to accomplish with your application.

Where is the limit to what would be considered data duplication in a database?

What is the line that you should draw when normalising data, in terms of data duplication? i.e would you say that 2 employees who share the same birthday or have the same timestamp for a shift is data duplication? and therefore should be placed into another data table?
Birth date has full and non-transitive dependency to a person which means that it should be stored within the same table where you keep your employees and it would comply with third normal form (3NF).
Work shifts are not an attribute of an employee which means that they are a different entity and stay in relation with employee entity.
There is no a particular 'limit' when following the normalisation to data, since the main restriction that is given for every relational database table is to have an unique parimary key. Hence, if all other columns contain the same data, but the primary key is still different, it is a different row of a table.
The actual restrictions can come in two form. One is either the programming or systhematic approach, where the restriction on what kind of data is inputed is given from a program which interacts with the database or already defined script handed down physically for the admin of the database.
Other, more database-orriented approach would be to create primary keys composed of multiple columns. That way a row is unique only if for both columns the data is unique. It should be noted that a primary key is not necessary the same as an unique key, which should be different for every instance.
You have misunderstood what normalization does.
Two attributes having the same value (i.e. two employees having the same birthday) is not redundancy.
Rather having the same attribute in the two tables (i.e. two tables having birthday column, therefore repeating every employee's birthday information) is.
Normalization is a quality decision and denormalization is a performance decision. For my school projects, my teachers recommended me to normalize at least till 3NF. So that may be a good guideline.

What's the best practice for a table that refers n tables with a match in one of them?

I'm working on a database design, and I face a situation where notifications will be sent according to logs in three tables, each log contains different data. NOTIFICATIONS table should then refer these three tables, and I thought of three possible designs, each seems to have flaws in it:
Each log table will have a unique incremented id, and NOTIFICATIONS table will have three different columns as FK's. The main flaw in this design is that I can't create real FK's since two of the three fields will be NULL for each row, and the query will have to "figure out" what kind of data is actually logged in this row.
The log tables will have one unique incremented id for all of them. Then I can make three OUTER JOINS with these tables when I query NOTIFCATIONS, and each row will have exactly one match. This seems at first like a better design, but I will have gaps in each log table and the flaws in option 1 still exist.
Option 1/2 + creating three notifications tables instead of one. This option will require the app to query notifications using UNION ALL.
Which option makes a better practice? Is there another way I didn't think of? Any advice will be appreciated.
I have one solution that sacrifices the referential integrity to help you achieve what you want.
You can keep a GUID data type as the primary key in all three log tables. In the Notification table you just need to add one foreign key column which won't point to any particular table. So only you know that it is a foreign key, SQL Server doesn't and it doesn't enforce referential integrity. In this column you store the GUID of notification. The notification can be in any of the three logs but since the primary key of all three logs is GUID, you can store the key in your Notification table.
Also you add another column in the Notification table to tell which of the three logs this GUID belongs to. Now you can uniquely know which row in the required log table you have to go to in order to find this notification info.
The problem is that you have three separate log tables. Instead you should have had only log table which would have an extra column specifying what kind of logging is it. That way you'd have only one table - referential integrity would have stayed and design would have been simple.
Use one table holding notification ids. Each of the three original tables hold subtypes of notification ids with FKs on their own ids to that table. Search re subtyping/subtables in databases. This is a standard design pattern/idiom.
(There are entities. We group them conceptually. We call the groups kinds or types. We say of a particular entity that it is a whatever kind or type of entity, or even that it "is a" whatever. We can have groups that contain all the entities of another group. Since the larger is a superset of the smaller we say that the larger type is a supertype of the smaller type, and the smaller is a subtype of the larger.)
There are idioms you can use to help constrain your tables declaratively. The main one is to have a subtype tag in the supertype table, and even also in the subtype tables (where each table has only one tag value).
I eventually faced two main options:
Following the last suggestion in this answer.
Choosing a less normalized structure for the database, AKA fake/no FK's. To be precise, in my case it would be my second option above with fake FK's.
I chose option #2 as a DBA whom I consulted enlightened me on the idea that database normalization should be done according to possible structure breakage. In my case, although notifications are created based on logs, these FK's are not necessary for querying the notifications nor for querying the log and the app do not have to ensure this relationship for a proper functioning. Thus, following option #1 may be "over-normalization".
Thanks all for your answers and comments.

data distribution based on primary key

Currently in one of my projects, we're supporting 32k entities, however it's reaching its limits for performance, and hence we're thinking of distributing it to different databases based on their integer primary keys. E.g. the first 35k will go to one db, the next 35k to the next db and so on (based on (primary key % #db) logic).
However, this will present a problem when we're inserting an entity into db. Since we don't know its primary key value beforehand, how do we figure out which db to insert it into?
One possibility is maintaining a global id table in only one db. So we insert into it first, get the primary key value and then use it to choose a db for further detailed insertion. But this solution is not uniform and hence difficult to maintain and extend. So any other thoughts on how to go about it?
Found this nice article that talks about how Flickr solved this problem:
http://code.flickr.com/blog/2010/02/08/ticket-servers-distributed-unique-primary-keys-on-the-cheap/

Resources