Determining Super Key - database

According to Wikipedia
Today's Court Bookings
Each row in the table represents a court booking at a tennis club that has one hard court (Court 1) and one grass court (Court 2)
A booking is defined by its Court and the period for which the Court is reserved
Additionally, each booking has a Rate Type associated with it. There are four distinct rate types:
SAVER, for Court 1 bookings made by members
STANDARD, for Court 1 bookings made by non-members
PREMIUM-A, for Court 2 bookings made by members
PREMIUM-B, for Court 2 bookings made by non-members
The table's superkeys are:
S1 = {Court, Start Time}
S2 = {Court, End Time}
S3 = {Rate Type, Start Time}
S4 = {Rate Type, End Time}
S5 = {Court, Start Time, End Time}
S6 = {Rate Type, Start Time, End Time}
S7 = {Court, Rate Type, Start Time}
S8 = {Court, Rate Type, End Time}
ST = {Court, Rate Type, Start Time, End Time}, the trivial superkey
Note that even though in the above table Start Time and End Time
attributes have no duplicate values for each of them, we still have to
admit that in some other days two different bookings on court 1 and
court 2 could start at the same time or end at the same time. This is
the reason why {Start Time} and {End Time} cannot be considered as the
table's superkeys.
How is S1 = {Court, Start Time}, a super key?
Say on day 1, a member books court 1 from 11:00 to 12:00, and on day 2, a non member books court 1 from 11:00 to 12:00.
the records in the table would be
{1,11:00,12:00, SAVER} and {1,11:00,12:00, STANDARD}
Clearly S1 = {Court, Start Time}, is not superkey. Or am I wrong?

This example is a poor choice because to understand what the table is supposed to hold involves unstated, although common sense, assumptions. It expects you to see that the table is only for one day--"Today"--and infer that on any day there will be no overlapping bookings. Ie no start-end time period for a court overlaps another one for the same court. (The text mentions different days when they mean different table values; but it doesn't matter to the example whether different values have to be on different days.)
It is also a poor choice for 3NF vs BCNF in particular. Of course it is subject to certain FDs (functional dependencies) and their associated JDs (join dependencies) relevant to 3NF vs BCNF. But the non-overlap of bookings is a separate constraint irrelevant to 3NF vs BCNF.
Say on day 1, a member books court 1 from 11:00 to 12:00, and on day 2, a non member books court 1 from 11:00 to 12:00.
When we say that a table value "satisfies" a constraint (eg FD) or "is subject to" a constraint or "has" a constraint or that a constraint "holds in" a table value we mean that the value makes the constraint true. When we say this about a table variable (base table) we mean that it is so for the variable's value in every database state. For this table, describing the current booking situation for "Today", any particular booking situation will be about one day--Today. So the kind of overlapping involving different days in your quote is not relevant to the constraints. Similarly each table value from different times in the same day will satisfy the constraints itself regardless of how the bookings have changed.
Under those circumstances, for any state of the table the four specified sets of columns are CKs (candidate keys):
S1 = {Court, Start Time}
S2 = {Court, End Time}
S3 = {Rate Type, Start Time}
S4 = {Rate Type, End Time}
Because bookings don't overlap, a subrow value for each of these column sets is unique under those columns. So they are superkeys. Since that's true for no smaller subsets of each, they are CKs. Since its true for no other column sets, there are not other CKs. Since every superset of a superkey is a superkey, the other listed sets are the other (non-CK) superkeys.
PS There are a few sections on that entry's talk page about the Tennis/Booking example and confusions on the page. The page has other poor examples. Eg it restructures the non-BCNF 3NF design to a BCNF design, but not by standard lossless decomposition to projections of the original (that join back to it). (It introduces a new column.) Eg it then also talks about preserving dependencies but that only makes sense when decomposing to projections of the original.

Related

How to normalize the schema to BCNF

I am having some issues with normalization. I have a schema REPAYMENT which looks like this:
Now, from what I've gathered the functional dependencies that hold in the schema is
{borrower_id} --> {name, address, request_date, loan_amount}
{request_date} --> {repayment_date, loan_amount}
{loan_amount] --> {repayment_amount}
(correct me if I'm wrong?)
I'm supposed to normalise the schema to BCNF, but I'm a bit confused. Is the candidate key request_date and borrower_id?
It can be used to register information on the re- payments on micro loans. A borrower, his name and address, are identified with an unique borrower_id. Borrowers can have multiple loans at the same time, but each of those loans ( specified by loan_amount, repayment_date and repayment_amount) have a different re- quest date. Thus a loan can be identified with the borrower ID and the request date of the loan. The borrower can repay multiple (different) loans on the same date, but each loan can only be repaid once (on one date with one amount). There is a system which for each request date and amount of a loan determines the repayment date and amount to be repaid. The loan amount requested and the repaid amount are not the same since there is an interest rate that applies.
From the definition of candidate key:
In the relational model of databases, a candidate key of a relation is
a minimal superkey for that relation; that is, a set of attributes
such that:
The relation does not have two distinct tuples (i.e. rows or records in common database language) with the same values for these
attributes (which means that the set of attributes is a superkey)
There is no proper subset of these attributes for which (1) holds (which means that the set is minimal).
Now your question :
Is the candidate key request_date and borrower_id?
It is a superkey, but not minimal one. Here's how we compute the candidate key.
Which attribute occurs only on the left side, considering all the F . D's ?
ITS borrower_id.This means that it must be a part of every key of this given schema. Now let us compute its closure.
Because of {borrower_id} --> {name, address, request_date, loan_amount}:
closure(borrower_id) = borrower_id, name, address, request_date, loan_amount.
Because of {request_date} --> {repayment_date, loan_amount} and closure(borrower_id) has request_date, this means
closure(borrower_id) = borrower_id, name, address, request_date, loan_amount, repayment_date
And finally because of {loan_amount] --> {repayment_amount} and closure(borrower_id) has loan_amount, this means
closure(borrower_id) = borrower_id, name, address, request_date, loan_amount, repayment_date, repayment_amount
Because closure of borrower_id contains all the attributes, borrower_id is a key and since it is minimal, it is indeed the candidate key and the only one.
Now let us decompose the schema into BCNF. The algorithm is:
Given a schema R.
Compute keys for R.
Repeat until all relations are in BCNF.
Pick any R' having a F.D A --> B that violates BCNF.
Decompose R' into R1(A,B) and R2(A,Rest of attributes).
Compute F.D's for R1 and R2.
Compute keys for R1 and R2.
Since {request_date} --> {repayment_date, loan_amount} and request_date is not a key, it violates BCNF so we split schema into two relations:
R1(request_date,repayment_date,loan_amount)
R2(borrower_id,name,address,request_date,repayment_amount)
Clearly R1 is in BCNF. But R2 is NOT in BCNF , because we missed the following F.D. which is:
address --> name
and we know address is not the key, so we split the R2 further as:
R3(borrower_id,address,request_date,repayment_amount)
R4(address,name)
Now, clearly both R3 and R4 are in BCNF. Had we not split the R2 further, we end up storing the same combination of address and name for every loan the person takes, which is redundancy.

DB Schema: Versioned price model vs invoice-related data

I am creating some db model for rental invoice generation.
The invoice consists of N booking time ranges.
Each booking belongs to a price model. A price model is a set of rules which determine a final price (base price + season price + quantity discout + ...).
That means the final price for the N bookings within an invoice can be a complex calculation, and of course I want to keep track of every aspect of the final price calculation for later review of an invoice.
The problem is, that a price model can change in the future. So upon invoice generation, there are two possibilities:
(a) Never change a price model. Just make it immutable by versioning it and refer to a concrete version from an invoice.
(b) Put all the price information, discounts and extras into the invoice. That would mean alot of data, as an invoice contains N bookings which may be partly in the range of a season price.
Basically, I would break down each booking into its days and for each day I would have N rows calculating the base price, discounts and extra fees.
Possible table model:
Invoice
id: int
InvoiceBooking # Each booking. One invoice has N bookings
id: int
invoiceId: int
(other data, e.g. guest information)
InvoiceBookingDay # Days of a booking. Each booking has N days
id: int
invoiceBookingId: id
date: date
InvoiceBookingDayPriceItem # Concrete discounts, etc. One days has many items
id: int
invoiceBookingDayId: int
price: decimal
title: string
My question is, which way should I prefer and why.
My considerations:
With solution (a), the invoice would be re-calculated using the price model information each time the data is viewed. I don't like this, as algorithms can change. It does not feel natural for the "read-only" nature of an invoice.
Also the version handling of price models is not a trivial task and the user needs to know about the version concept, which adds application complexity.
With solution (b), I generate a bunch of nested data and it adds alot of complexity to the schema.
Which way would you prefer? Am I missing something?
Thank you
There is a third option which I recommend. I call it temporal (time) versioning and the layout of the table is really quite simple. You don't describe your pricing data so I'll just show a simple example.
Table: DailyPricing
ID EffDate Price ...
A 01/01/2015 17.50 ...
B 01/01/2015 20.00 ...
C 01/01/2015 22.50 ...
B 01/01/2016 19.50 ...
C 07/01/2016 24.00 ...
This shows that all three price schedules (A, B and C just represent whatever method you use to distinguish between price levels) were given a price on Jan 1, 2015. On Jan 1, 2016, the price of plan B was reduced. In July, the price of plan C was increased.
To get the current price of a plan, the query is this:
select dp.Price
from DailyPricing dp
where dp.ID = 'A'
and dp.Effdate =(
select Max( dp2.EffDate )
from DailyPricing dp2
where dp2.ID = dp.ID
and dp2.EffDate >= :DateOfInterest);
The DateOfInterest variable would be loaded with the current date/time. This query returns the one price that is currently in effect. In this case, the price set Jan 1, 2015 as that has never changed since taking effect. If the search had been for plan B, the price set on Jan 1, 2016 would have been returned and for plan C, the price set on July 1, 2016. These are the latest prices set for each plan; that is, the current prices.
Such a query would more likely be in a join with probably the invoice table so you could perform the price calculation.
select ...
from Invoices i
join DailyPricing dp
on dp.ID = i.ID
and dp.Effdate =(
select Max( dp2.EffDate )
from DailyPricing dp2
where dp2.ID = dp.ID
and dp2.EffDate >= i.InvoiceDate )
where i.ID = 1234;
This is a little more complex than a simple query but you are asking for more complex data (or, rather, a more complex view of the data). However, this calculation is probably only executed once and the final price stored back in to the invoice data or elsewhere.
It would be calculated again only if the customer made some changes or you were going through an audit, rechecking the calculation for accuracy.
Notice something, however, that is subtle but very important. If the query above were being executed for an invoice that had just been created, the InvoiceDate would be the current date and the price returned would be the current price. If, however, the query was being run as a verification on an invoice that was two years old, the InvoiceDate would be two years ago and the price returned would be the price that was in effect two years ago.
In other words, the query to return current data and the query to return past data is the same query.
That is because current data and past data remain in the same table, differentiated only by the date the data takes effect. This, I think, is about the simplest solution to what you want to do.
How about A and B?
It's not best practice to re-calculate any component of an invoice, especially if the component was printed. An invoice and invoice details should be immutable, and you should be able to reproduce it without re-calculating.
If you ever have a problem with figuring out how you got to a certain amount, or if there is a bug in your program, you'll be glad you have the details, especially if the calculations are complex.
Also, it's a good idea to keep a history of your pricing models so you can validate how you got to a certain price. You can make this simple to your users. They don't have to see the history -- but you should record their changes in the history log.

SSAS- MDX Assign fact row to dimension member base on calculation

I am looking to calculate in the calc script something, so I can allocate a row from a fact table to a dimension member.
The business scenario is the following. I have a fact table that record customer credit and debit ( customer can do a lot of little loan) and a dimension Customer.I want to classify my customer base on his history of credit and debit on a given period.Classification of customer change over time.
Example
The rule is, if a customer balance (for a given period ) is over - 50 000, the classification is "large", if he have more than a record and have done a payement in the last 3 month he is a "P&P.If he doesn't own any money and have done a payement in the last 3 month its "regular".
My question is more about direction than a specific code,which way is the best to implement this kind of rule ?
Best Regards
Vincent Diallo-Nort
I'd create a fact table with a balance auto-updated status every day:
check the rolling balance yesterday plus today's records.
when the balance = 0, then remove a record.
Plus add a flow fact table with payments only.
Add measures:
LastChild aggregation for the first fact table.
Sum aggregation for the second fact table.
When it's done, you may apply a MDX calculation:
case
when [Measure].[Balance] > 50000
then "Large"
when [Measure].[Payments] + ([Date].[Calendar].CurrentMember.Lag(1),[Measure].[Payments]) + ([Date].[Calendar].CurrentMember.Lag(2),[Measure].[Payments]) > 0
then "P&P"
else "Regular"
end
In order to give you answer in detail you have to provide more information about your data structure.

3nf functional dependency in wiki's example

I read the wiki about the 3nf
https://en.wikipedia.org/wiki/Third_normal_form
it is the example that wiki give
Tournament Winners
Tournament Year Winner Winner Date of Birth
Indiana Invitational 1998 Al Fredrickson 21 July 1975
Cleveland Open 1999 Bob Albertson 28 September 1968
Des Moines Masters 1999 Al Fredrickson 21 July 1975
Indiana Invitational 1999 Chip Masterson 14 March 1977
it say that the non-prime attribute Winner Date of Birth is transitively dependent on the candidate key {Tournament, Year} via the non-prime attribute Winner
I think functional dependency is that
for two row X1 , X2 if X1.col1 = X2.col1 and
X1.col2 = X2.col2, then col1 -> col2
I cannot understand that Winner Date of Birth->Winner(there may be someone who have same birthday and same name?)
and winner can -> candidate key {Tournament, Year} given the winner name Al Fredrickson, it may be Indiana Invitational 1998 or Des Moines Masters 1999)
So, how does it jump to the conclusion?
Informally, a functional dependency means one value on the left side cannot produce multiple values on the right, even when the left side exists in more than one row.1
So, in Wikipedia example, there is a functional dependency Winner -> Winner Date of Birth, simply because the same winner cannot have different dates of birth even when he/she exists in multiple rows (because he/she won multiple tournaments).
Since...
{Tournament, Year} -> Winner (since one tournament cannot have multiple winners)
and Winner -> Winner Date of Birth (as explained above)
and not Winner -> {Tournament, Year} (since one person can win multiple tournaments)
...then by definition there is a transitive dependency.
I cannot understand that Winner Date of Birth->Winner(there may be someone who have same birthday and same name?)
You flipped the direction. The functional dependency is not "from" the single value, it's "toward" it. Therefore Winner -> Winner Date of Birth, but not Winner Date of Birth -> Winner.
BTW, There cannot be two people with different names in this model. A better (more realistic) model would probably use a surrogate key to identify people, allowing for duplicated names.
1 Which is compliant with the mathematical concept of "function". No matter how many times you "call" a function (i.e. how many rows contain the f.d. left side), it always produces the same "result" (the f.d. right side). If it could produce multiple results, it would not be a function, it would be a "relation".
From what I understand:
For any {Tournament, Year} you have only one winner. Each winner has only one date of birth. Wiki claims that this can lead to vulnerability:
Assume you have entered a new row: {"Stupid tournament", "2013", "Al Fredrickson", "21 July 2012"} - you've entered an incorrect date of birth!
If you keep another table {WinnerID, WinnerBithday}, you'll prevent that.
What if an entry comes for Same Winner with Different Date of Birth? It is possible then How to prevent them?
From the base
Because each row in the table needs to tell us who won a particular
Tournament in a particular Year, the composite key {Tournament, Year}
is a minimal set of attributes guaranteed to uniquely identify a row. That is, {Tournament, Year} is a candidate key for the table.
If relation R is going to add same Winner Name with different Date of Birth then it will create another Unique record for the table but it is should not be done. We need unique record but this shows same winner with two different Dates of Birth can be exist in a table.
Even if we think of Duplication of Dates of Birth (for winners) we can
split that table in another table and can store {winner,winner date of
birth} to prevent duplication like wiki has shown.
reference
as there is nothing to stop the same person from being shown with
different dates of birth on different records.
That's why it needs to create another table to prevent duplication.

Determining the functional dependencies of a relationship and their normal forms

I'm studying for a database test, and the study guide there are some (many) exercises of normalization of DB, and functional dependence, but the teacher did not make any similar exercise, so I would like someone help me understand this to attack the other 16 problems.
1) Given the following logical schema:
Relationship product_sales
POS Zone Agent Product_Code Qualification Quantity_Sold
123-A Zone-1 A-1 P1 8 80
123-A Zone-1 A-1 P1 3 30
123-A Zone-1 A-2 P2 3 30
456-B Zona-1 A-3 P1 2 20
456-B Zone-1 A-3 P3 5 50
789-C Zone-2 A-4 P4 2 20
Assuming that:
• Points of Sale are grouped into Zone.
• Each Point of Sale there are agents.
• Each agent operates in a single POS.
• Two agents of the same points of sale can not market the same product.
• For each product sold by an agent, it is assigned a Qualification depending on the product and
the quantity sold.
a) Indicate 4 functional dependencies present.
b) What is the normal form of this structure.
To get you started finding the 4 functional dependencies, think about which attributes depend on another attribute:
eg: does the Zone depend on the POS? (if so, POS -> Zone) or does the POS depend on the Zone? (in which case Zone -> POS).
Four of your five statements tell you something about the dependencies between attributes (or combinations of several attributes).
As for normalisation, there's a (relatively) clear tutorial here. The phrase "the key, the whole key, and nothing but the key" is also a good way to remember the 1st, 2nd and 3rd normal forms.
In your comment, you said
Well, According to the theory I've read I think it may be, but I have
many doubts: POS → Zone, {POS, Agent} → Zone, Agent → POS, {Agent,
Product_code, Quantity_Sold} → Qualification –
I think that's a good effort.
I think POS->Zone is right.
I don't think {POS, Agent} → Zone is quite right. If you look at the sample data, and you think about it a bit, I think you'll find that Agent->POS, and that Agent->Zone.
I don't think {Agent, Product_code, Quantity_Sold} → Qualification is quite right. The requirement states "For each product sold by an agent, it is assigned a Qualification depending on the product and the quantity sold." The important part of that is "a Qualification depending on the product and the quantity sold". Qualification depends on product and quantity, so {Product_code, Quantity}->Qualification. (Nothing in the requirement suggests to me that the qualification might be different for identical orders from two different agents.)
So based on your comment, I think you have these functional dependencies so far.
POS->Zone
Agent->POS
Agent->Zone
Product_code, Quantity->Qualification
But you're missing at least one that has a significant effect on determining keys. Here's the requirement.
Two agents of the same points of sale can not market the same product.
How do you express the functional dependency implied in that requirement?

Resources