How should missing information in relational databases be modeled properly? - database

Some argue that NULLs in relational databases are evil and propose alternative approaches to missing information, usually leading to relations in 6NF. Then, essentially, missing information is being expressed by missing tuples in relations. However, in my understanding, that violates the Closed World Assumption, because if a tuple can possibly appear in some relation but doesn't, that means that the proposition corresponding to that tuple is false, which is different from being unknown.
For example, consider following relation:
|---------------|
| NAME | AGE |
|--------+------|
| andrey | 30 |
|---------------|
Here, the predicate is "the person called NAME is AGE years old", so we know that Andrey is 30 years old. If we do not know his age, then we could express that with a NULL:
|---------------|
| NAME | AGE |
|--------+------|
| andrey | NULL |
|---------------|
However, if NULLs are prohibited, we could not do that. Instead, we could remove the tuple in the relation with NAME andrey. But under the Closed World Assumption, that would imply that andrey does not have any age at all (maybe he is eternal), which is different from saying that andrey does have some age, which we simply do not know.
So, my question is: is it really possible to model missing information correctly without NULLs and not violating the Closed World Assumption? Or do I miss something?

Related

A primary key with more than 3 attributes

Consider a relation that depicts a tutorial room booking in a university. Each faculty assigns a person to handle the booking for all tutorial classes for that faculty. The person’s email address is given to the university’s booking system as a contact person.
BOOKING(b_date, b_starttime, b_endtime, unit_code, contact_person, room_no, tutor_id).
Exercise: Identify candidate key(s) and primary key for the relation if the following business rules are applicable:
1) Tutorial classes can be either 1 hour or 2 hours long.
2) A tutor can only teach one tutorial class in a given unit.
3) There are no parallel sessions of tutorial classes.
My answer: 1 of candidate key is (tutor_id, d_date). My suggestion key can make each tuple unique BUT there is no constraint that prevents users input incorrectly as below:
tutor_id | d_date | b_starttime | b_endtime | unit_code |
AAAA | Mon | 6 PM | 8PM | FIT1111 |
BBBB | Mon | 6 PM | 7PM | FIT1111 |
OR
tutor_id | d_date | b_starttime | b_endtime | unit_code |
AAAA | Mon | 6 PM | 8PM | FIT1111 |
BBBB | Mon | 7 PM | 8PM | FIT1111 |
Consequently, my key does not meet the business rule no.3. Then, I add 2 more attributes to candidate key (tutor_id, d_date, b_starttime, b_endtime).
My questions is when we choose candidate key, apart from guaranteeing the uniqueness of each tuple, do we need to prevent users from possibly wrong input which may break the business rules? If yes, when we set, for example, 4 attributes (A, B, C, D) as the primary key, whether the DBMS blocks users from wrong input action as in the table above?
Thanks.
My questions is when we choose candidate key, apart from guaranteeing the uniqueness of each tuple, do we need to prevent users from possibly wrong input which may break the business rules?
Yes, we should declare enough constraints to disallow all invalid states. And yes, declaring one or even all CKs (candidate keys) is not necessarily enough. (Why would it be? Do you think you have a reason why it should?)
A CK doesn't just state the uniqueness of a set of attributes. It also states the non-uniqueness of any proper subset of that set.
CKs are used in normalization to higher NFs (normal forms), which gets rid of certain problematic constraints that a table must always be the join of other tables that you could have instead. It drops your current table & introduces those tables & their CK constraints & sometimes FK (foreign key) constraints, that subrows must appear elsewhere as CKs. But still other constraints can hold that should be declared.
If yes, when we set, for example, 4 attributes (A, B, C, D) as the primary key, whether the DBMS blocks users from wrong input action as in the table above?
You mention that a particular CK constraint wouldn't prevent certain invalid states. But since you should always declare all CKs, you should find them all before you start looking for invalid states that aren't prevented. Also "Then, I add 2 more attributes to candidate key" makes no sense because 1. that is not how we find CKs and 2. a CK plus other attributes cannot be a CK.
(Read & act on hits googling 'stackexchange homework' & post a new question showing your work following your textbook finding the CKs.)

Functional Dependencies of a User Message Relation

UPDATE: Relational Model may not work in the way I want it to, see:Database normalization for facebook-like messaging system
Time for NoSQL!
I am having trouble putting a database into 2nf. For that, you must determine all functional dependencies before you can decide if an attribute is prime or non prime.
Have a look here:
--------------------------------------------
to | from | msg | time
--------|--------|----------------|---------
joe | jim | hello | 1
jim | joe | hey | 2
jim | joe | how are you | 3
victor | bryce | i love carrots | 4
joe | jim | im doin great | 5
bryce | jim | hello | 6
NOTE: Time will be unique. It will be transacted.
Does time->message despite
time1->"hello"
time6->"hello"
Because I have heard as long as there are unique instances of message, its fine. However, I am confused by this.
Also, I want to add a message id column. Is that good practice?
A functional dependency asks, "If I know one value for 'X', do I know one and only one value for 'Y'?", where 'X' and 'Y' are attributes of a relation. ('X' and 'Y' refer to sets of attributes.)
If the values of the attribute "time" are actually unique, then knowing one value for "time" means you know one and only one value for "msg". That means the functional dependency time->msg holds for this relation.
In contrast, the functional dependency "to"->"msg" does not hold in this relation, because knowing the value "joe" means I know two values for "msg": 'hello', and 'im doin great'. It doesn't hold for this relation, so we say "to"->"msg" is not a functional dependency in this relation.
For exactly the same reason, "to, from"->"msg" does not hold in this relation. So "to, from"->"msg" is not a functional dependency in this relation.
Also, I want to add a message id column. Is that good practice?
Adding attributes that are not in the original relation has to do with data compression, not with normalization. Normalization never introduces new attributes or new dependencies. Adding "msg_id" as an attribute introduces two new functional dependencies (depending on what "msg_id" means): "msg_id"->"msg", and "time"->"msg_id".
So adding a "msg_id" attribute might be a good idea sometimes (less often than you think), but it has nothing to do with normalization. Assuming you intend to project "msg_id, msg" as a new table and remove "msg" from the original relation, you need to declare "msg" as unique, too.

How to store different types of address data in db?

I have to create a database combined with 4 types of xls files, for example A, B, C and D. Every year new file is created, starting from 2004. A have 7 sheets with 800-1000 rows, B - D have one sheet with max 200 rows.
Everyone knows that people are lazy, so in excel files, address data are stored differently in each sheet. One of them, from 2008, have address data stored separately, but every other sheets have this data combined into one column.
Sooo, here is a question - how should I design a datatable? Something like this?
+---------+----------+----------+-------------+--------------------------------+
| Street | House Nr | City | Postal Code | Combined Address |
+---------+----------+----------+-------------+--------------------------------+
| Street1 | 20 | Somwhere | 00-000 | null |
| Street2 | 98 | Elswhere | 99-999 | null |
| null | null | null | null | Somwhere 00-000, street3 29 |
| null | null | null | null | st. Street2 65 12-345 Elswhere |
+---------+----------+----------+-------------+--------------------------------+
There will be a lot of nulls, so maybe best solution would be 2 different tables?
Most important thing is that users will search by using this data, and in the future add data into that database without excel files.
There are at least two different angles of view here: Normalization and efficiency, leading to different results.
Normalization
If this is the most important criterion you would make even three tables. Obviously Combined Address needs a place of it's own, but also Postal Code and City have to be stored into another table, because there is a dependency between them. Just one of the two, most probably Postal Code will stay. (Yes, there even is sth. about Street and Postal Code too, but I'm clearly not going to be pedantic.)
Efficiency
Normalization as an end in itself doesn't necessarily make the best result. If you permit yourself to be a bit sloppy on that and leave it the way it is in the model you posted, things might become easier in coding. You could use a trigger to make sure Combined Address is never null or use a (materialized) view that pretends it is and just search in Combined Address for the time being.
Imagine the effort if you use different tables and there is a need for referencing these addresses in other tables (Which table to use when? How to provide a unique id? Clearly a problem.).
So, decide on what's more important.
If I'm not mistaken we are taking about some 2000 rows or some 8000 rows if it is '7 sheets with 800-1000 rows each' actually. Even if the latter applies this isn't a number that makes data correction impracticable. If the number of different input pattern in the combined column is low, you might be able to do this (partly) automatically and just have some-one prove reading.
So you might want to think about a future redesign as well and choose what's more convenient in this case.

Normalizing a Table 6

I'm putting together a database that I need to normalize and I've run into an issue that I don't really know how to handle.
I've put together a simplified example of my problem to illustrate it:
Item ID___Mass___Procurement__Currency__________Amount
0__________2kg___inherited____null________________null
1_________13kg___bought_______US dollars_________47.20
2__________5kg___bought_______British Pounds______3.10
3_________11kg___inherited____null________________null
4__________9kg___bought_______US dollars__________1.32
(My apologies for the awkward table; new users aren't allowed to paste images)
In the table above I have a property (Amount) which is functionally dependent on the Item ID (I think), but which does not exist for every Item ID (since inherited items have no monetary cost). I'm relatively new to databases, but I can't find a similar issue to this addressed in any beginner tutorials or literature. Any help would be appreciated.
I would just create two new tables ItemProcurement and Currencies.
If I'm not wrong, as per the data presented, the amount is part of the procurement of the item itself (when the item has not been inherited), for that reason I would group the Amount and CurrencyID fields in the new entity ItemProcurement.
As you can see, an inherited item wouldn't have an entry in the ItemProcurement table.
Concerning the main Item table, if you expect just two different values for the kind of procurement, then I would use a char(1) column (varying from B => bougth, I => inherited).
I would looks like this:
The data would then look like this:
TABLE Items
+-------+-------+--------------------+
| ID | Mass | ProcurementMethod |
|-------+-------+--------------------+
| 0 | 2 | I |
+-------+-------+--------------------+
| 1 | 13 | B |
+-------+-------+--------------------+
| 2 | 5 | B |
+-------+-------+--------------------+
TABLE ItemProcurement
+--------+-------------+------------+
| ItemID | CurrencyID | Amount |
|--------+-------------+------------+
| 1 | 840 | 47.20 |
+--------+-------------+------------+
| 2 | 826 | 3.10 |
+--------+-------------+------------+
TABLE Currencies
+------------+---------+-----------------+
| CurrencyID | ISOCode | Description |
|------------+---------+-----------------+
| 840 | USD | US dollars |
+------------+---------+-----------------+
| 826 | GBP | British Pounds |
+------------+---------+-----------------+
Not only Amount, everything is dependent on ItemID, as this seems to be a candidate key.
The dependence you have is that Currency and Amount are NULL (I guess this means Unknown/Invalid) when the Procurement is 'inherited' (or 0 cost as pointed by #XIVsolutions and as you mention "inherited items have no monetary cost")
In other words, iems are divided into two types (of procurements) and items of one of the two types do not have all attributes.
This can be solved with a supertype/subtype split. You have a supertype table (Item) and two subtype tables (ItemBought and ItemInherited), where each one of them has a 1::0..1 relationship with the supertype table. The attributes common to all items will be in the supertype table and every other attribute in the respecting subtype table:
Item
----------------------------
ItemID Mass Procurement
0 2kg inherited
1 13kg bought
2 5kg bought
3 11kg inherited
4 9kg bought
ItemBought
---------------------------------
ItemID Currency Amount
1 US dollars 47.20
2 British Pounds 3.10
4 US dollars 1.32
ItemInherited
-------------
ItemID
0
3
If there is no attribute that only inherited items have, you even skip the ItemInherited table altogether.
For other questions relating to this pattern, look up the tag: Class-Table-Inheritance. While you're at it, look up Shared-Primary-Key as well. For a more concpetual treatment, google on "ER Specialization".
Here is my off-the-cuff suggestion:
UPDATE: Mass would be a Float/Decimal/Double depending upon your Db, Cost would be whatever the optimal type is for handling money (in SQL Server 2008, it is "Money" but these things vary).
ANOTHER UPDATE: The cost of an inherited item should be zero, not null (and in fact, there sometime IS an indirect cost, in the form of taxes, but I digress . . .). Therefore, your Item Table should require a value for cost, even if that cost is zero. It should not be null.
Let me know if you have questions . . .
Why do you need to normalise it?
I can see some data integrity challenges, but no obvious structural problems.
The implicit dependency between "procurement" and the presence or not of the value/currency is tricky, but has nothing to do with the keys and so is not a big deal, practically.
If we are to be purists (e.g. this is for homework purposes), then we are dealing with two types of item, inherited items and bought items. Since they are not the same type of thing, they should be modelled as two separate entities i.e. InheritedItem and BoughtItem, with only the columns they need.
In order to get a combined view of all items (e.g. to get a total weight), you would use a view, or a UNION sql query.
If we are looking to object model in the database, then we can factor out the common supertype (Item), and model the subtypes (InheritedItem, BoughtItem) with foreign-keys to the supertype table (ypercube explanation below is very good), but this is very complicated and less future-proof than only modelling the subtypes.
This last point is the subject of much argument, but practically, in my experience, modelling concrete supertypes in the database leads to more pain later than leaving them abstract. Okay, that's probably waaay beyond what you wanted :).

Which normal form does this table violate?

Consider this table:
+-------+-------+-------+-------+
| name |hobby1 |hobby2 |hobby3 |
+-------+-------+-------+-------+
| kris | ball | swim | dance |
| james | eat | sing | sleep |
| amy | swim | eat | watch |
+-------+-------+-------+-------+
There is no priority on the types of hobbies, thus all the hobbies belong to the same domain. That is, the hobbies in the table can be moved on any hobby# column. It doesn't matter on which column, a particular hobby can be in any column.
Which database normalization rule does this table violate?
Edit
Q. Is "the list of hobbies [...] in an arbitrary order"?
A. Yes.
Q. Does the table have a primary key?
A. Yes, suppose the key is an AUTO_INCREMENT column type named user_id.
The question is if the columns hobby# are repeating groups or not.
Sidenote: This is not a homework. It's kind of a debate, which started in the comments of the question SQL - match records from one table to another table based on several columns. I believe this question is a clear example of the 1NF violation.
However, the other guy believes that I "have fallen fowl of one of the fallacies of 1NF." That argument is based on the section "The ambiguity of Repeating Groups" of the article Facts and Fallacies about First Normal Form.
I am not writing this to humiliate him, me, or whomever. I am writing this, because I might be wrong, and there is something I am clearly missing and maybe this guy is not explaining it good enough to me.
You say that the hobbies belong to the same domain and that they can move around in the columns. If by this you mean that for any specific name the list of hobbies is in an arbitrary order and kriss could just as easily have dance, ball, swim as ball, swim, dance, then I would say you have a repeating group and the table violates 1NF.
If, on the other hand, there is some fundamental semantic difference between a particular person's first and second hobbies, then there may be an argument for saying that the hobbies are not repeating groups and the table may be in 3NF (assuming that hobby columns are FK to a hobby table). I would suggest that this argument, if it exists, is weak.
One other factor to consider is why there are precisely 3 hobbies and whether more or fewer hobbies are a potential concern. This factor is important not so much for normalization as for flexibility of design. This is one reason I would split the hobbies into rows, even if they are semantically different from one-another.
Your three-hobby table design probably violates what I usually call the spirit of the original 1NF (probably for the reasons given by dportas and others).
It turns out however, that it is extremely difficult to find [a set of] formal and precise "measurable" criteria that accurately express that original "spirit". That's what your other guy was trying to explain talking about "the ambiguity of repeating groups".
Stress "formal", "precise" and "measurable" here. Definitions for all other normal forms exist that satisfy "formal", "precise" and "measurable" (i.e. objectively observable). For 1NF it's just hard (/impossible ???) to do. If you want to see why, try this :
You stated that the question was "whether those three hobby columns constitute a repeating group". Answer this question with "yes", and then provide a rigorous formal underpinning for your answer.
You cannot just say "the column names are the same, except for the numbered suffix". Making a violation of such a rule objectively observable/measurable would require to enumerate all the possible ways of suffixing.
You cannot just say "swim, tennis" could equally well be "tennis, swim", because getting to know that for sure requires inspecting the external predicate of the table. If that is just "person <name> has hobby <hobby1> and also has <hobby2>" , then indeed both are equally valid (aside : and due to the closed world assumption it would in fact require all possible permutations of the hobbies to be present in the table !!!). However, if that external predicate is "person <name> spends the most time on <hobby1> and the least on <hobby2>", then "swim, tennis" could NOT equally well be "tennis,swim". But how do you make such interpretations of the external predicate of the table objective (for ALL POSSIBLE PREDICATES) ???
etc. etc.
This clearly "looks" like a design error.
It's not not a design error when this data is simply stored and retrieved. You need only 3 of the hobbies and you don't intend to use this data in any other way than retrieve.
Let's consider this relationship:
Hobby1 is the main hobby at some point in a person's life (before 18 years of age for example)
Hobby2 is the hobby at another point (19-30)
Hobby3 is her hobby at a another one.
Then this table seems definitely well designed and while the 1NF convention is respected the naming arguably "sucks".
In the case of an indiscriminate storage of hobbies this is clearly wrong in most if not all cases I can think of right now. Your table has duplicate rows which goes against the 1NF principles.
Let's not consider the reduced efficiency of SQL requests to access data from this table when you need to sort the results for paging or any other practical reason.
Let's take into consideration the effort required to work with your data when your database will be used by another developer or team:
The data here is "scattered". You have to look in multiple columns to aggregate related data.
You are limited to only 3 of the hobbies.
You can't use simple rules to establish unicity (same hobby only once per user).
You basically create frustration, anger and hatred and the Force is disturbed.
Well,
The point is that, as long as all hobby1, hobby2 and hobby3 values are not null, AND names are unique, this table could be considered more or less as abbiding by 1NF rules (see here for example ...)
But does everybody has 3 hobbies? Of course not! Do not forget that databases are basically supposed to hold data as a representation of reality! So, away of all theories, one cannot say that everybody has 3 hobbies, except if ... our table is done to hold data related to people that have three hobbies without any preference between them!
This said, and supposing that we are in the general case, the correct model could be
+------------+-------+
| id_person |name |
+------------+-------+
for the persons (do not forget to unique key. I do not think 'name' is a good one
+------------+-------+
| id_hobby |name |
+------------+-------+
for the hobbies. id_hobby key is theoretically not mandatory, as hobby name can be the key ...
+------------+-----------+
| id_person |id_hobby |
+------------+-----------+
for the link between persons and hobbies, as the physical representation of the many-to-many link that exists between persons and their hobbies.
My proposal is basic, and satisfies theory. It can be improved in many ways ...
Without knowing what keys exist and what dependencies the table is supposed to satisfy it's impossible to determine for sure what Normal Form it satisfies. All we can do is make guesses based on your attribute names.
Does the table have a key? Suppose for the sake of example that Name is a candidate key. If there is exactly one value permitted for each of the other attributes for each tuple (which means that no attribute can be null) then the table is in at least First Normal Form.
If any of the columns in the table accept nulls then then the table violates first normal form. Assuming no nulls, #dportas has already provided the correct answer.
The table does not violate first normal form.
First normal form does not have any prohibition against multiple columns of the same type. As long as they have distinct column names, it is fine.
The prohibition against "Repeating Groups" concerns nested records - a structure which is common in hierarchical databases, but typically not possible in relational databases.
The table using repeating groups would look something like this:
+-------+--------+
| name |hobbies |
+-------+--------+
| kris |+-----+ |
| ||ball | |
| |+-----+ |
| ||swim | |
| |+-----+ |
| ||dance| |
| |+-----+ |
+-------+--------+
| james |+-----+ |
| ||eat | |
| |+-----+ |
| ||sing | |
| |+-----+ |
| ||sleep| |
| |+-----+ |
+-------+--------+
| amy |+-----+ |
| ||swim | |
| |+-----+ |
| ||eat | |
| |+-----+ |
| ||watch| |
| |+-----+ |
+-------+--------+
In a table conforming to 1NF all values can be located though table name, primary key, and column name. But this is not possible with repeated groups, which require further navigation.

Resources