Non trivial functional dependency... Not certain here - database

Question:
Consider the following relation:
Order (Product_ID, Product_Name, Customer_ID, Customer_Name, Date,
Item_Price, Amount, Total)
OBS: If the same customer sends multiple orders on the same day they are combined. Therefore, we only have one order per customer per day.
Determine all non-trivial functional dependencies of this relation and provide your reasoning.
I do understand what non trivial FD's are. Basically if X -> Y and Y is not a subset of X, then we have an FD. So when it in this questions asks me to write down all of non trivial FD's, i'm uncertain in regards to how many there really are. So i'd like to mention some that i think are non trivial FD's and then i would appreciate some corrections or if someone can add the ones im missing with explanation on why that's a non trivial FD that would be nice too.
product_id -> product_name..
if we know product id we know the product name
product_id -> item_price...
if we know product id we also know item price(?)
product_id -> item_price, product_name...
if we know product id we know both item price and product name..
product_id, item_price -> product_name...
in this one it's unnecessary to have item price, but it's still a non trivial fd i would assume?
Would these be considered valid non trivial FD's?
there seems to be endless (almost) combinations.. do need to mention them all? also is product name a primary key? if it is, what does that mean then?

Related

How to relate a product dimension with a sales fact

I have been studying datawarehouse in the last couple days, particularly, i have been reading The Data Wharehouse Toolkit - The Definitive Guide to Dimensional Modeling by Kimball and Ross.
Uppon that reading, i came to the 1st exapmle where there is a sales fact and it related to a product dimension, as you can see in the bellow image:
I think i can grasp the gist of how this relationship allows us to rotate the "cube" slicing and dicing data, however this is where i get lost:
In this example and many others on the web product is a one-to-one relationship with sales, which is fine i guess for most cases. But this generates a sales registtry for at least each kind of product that was in one sale.
So supposing i bought 1 banana, 2 apples and 1 orange, this would yield at least 3 sales registry. Again, which is fine i guess as it is storing the sale's ticket ID in the sales fact, we still can relate all itens in a given sale.
However if this was an use case: relate products on sales say i want to get every sale that had a banana and get stuff like: how many items each of these sales had, their price cost, their profit, stuff like that...
Wouldn't be better if the fact-product relation were Fact-one_to_many-Product relationship? Where fact would hold the sale's ticket ID and products would have its foreign key referencing where they are from or something?
I reckon these metrics should be in the fact table, and not in the product table as i think i would want. So, is this me not fighting my urge to normalize it or does it make sense in the way i would want to do that kind of filtering -> [given all sales with X product, get data from other products in the same sale].
If i were to follow the guidelines, product dimension would have one registry for every exclusive kind of product the store would have correct? And all the measurements i want i would store it on the fact itself, like price cost, sales price, profit, etc...
On the other hand, if i were to one-to-many product dimension would have many copies of each product. Which is bad, i think. However, i think it would give me better queries in that regard.
As you can see, i'm a beginer and really in the early stages of this path, so if you would endulge me in a Explain Like I'm Five kind of answer I would appreciate.
EDITED:
Sorry #Nick.McDermaid, you are right. I meant from the perspective of the sales fact where for every sale fact i will have only one product, but are correct that for one product it can have N sales related. And so, we have one record of product in the database for every different product on our store. This is the right way to do it, how to rightfully model it. Also, the many indicator is the "sales quantity" i'm guessing.
Anyhow, while this allows for slicing and dicing when/if we have sales as the point of view, but what if i want to for example:
Get all sales that had a banana in it, with all the other items in those sales. We can still do it with this structure but its harder than if the products were repeated and we had the sale id as a foreign key in the product table.
Cuz ultimetly i want to get all the sales(and products within that sale) that had a banana. And then take metrics out of them.
What you are somewhat hinting at would be a degenerate dimension, consisting of the sales id/invoice #/purchase order # of the transaction that took place. The whole purpose of a degenerate dimension is to group items that are related by a meaningless piece of data. For example, a PO # of A1234 is meaningless on its own, it doesn't tell you anything about the purchase. However, it can be used to identify other meaningful data, such as the date of purchase of the products for the customer. In that context, the PO # is defined by the collection of the entities it brings together to describe an event.
Another critical concept in data-warehousing is the abstraction of the schema in the database from the model in the cube. You don't join and group data in a cube model. You slice and filter. There are no foreign keys in a cube model. Those are used in the underlying data schema, but all of that work is handled behind the scenes of the cube model.

database work flow analysis

I have analyzed my as below and I have two point I am confused about
1- currently when I insert an items within orders I give order_ID as a PK
so each item within the order as its own PK... and same customer_id (FK) for all Item within the order so ... in that case the invoice number is same as the customer_ID
Is that what should think goes on or there is something wrong with this work flow ?
2-in some cases I don't need to record the customer information I just want to
insert the orders without their customer info.. I don't have clear idea about how think should happen :S
3- IF I want to apply a discount on some customer orders where the discount should I allow user to able to apply on the item per orders level ?
or on the whole order ? and where the discount column should be stored
1 - This design seems fine for the problem. You are stating that for every product ordered, this is the customer line. You can pick the ID, name, tax/fiscal number, address, etc.
2 - If you don't need any kind of customer information, make the customer_id on the Orders table accept NULL. It is the cleanest way to do it.
If for some reason NULLs are not an option or you want to keep on the database some basic anonymous customer data, you can create a line on customers for anonymous users (Ex.: ID: 1 / Name: ANONYMOUS ...) and place that ID on the order line.
3 - Placing the discount per product ordered might be the better idea.
If you want to apply a discount for the complete order, you just need to place that discount on every single order line.
If you want to apply a discount for a single product, you just need to place that discount on that product line.
If you want to apply a discount for a single product with a limited ammount of quantity (Ex.: Discount of 50% but limited to 1 purchase), and the customers buys more than that limit, you just need to place 2 orders for the same product. One with the discount and max quantity and the other without discount and the rest of the quantity.
Placing this on the order level wouldn't work for single product discounts.
This were answers to your questions. I would also question your design in points like:
If the customer changes address, is your order supposed to change too? If not, you should use the normal approach for orders, which is to have a order header table, with fixed information, like the address and the foreign key to the customer, and a order line table, like your orders line, but with a foreign key to the order header. That also avoids repetition.
Are you expecting up to 2 phone numbers? If you do, the design is ok. If you don't know how many phone numbers to expect, maybe a table with phone numbers, with foreign keys to the customer might be a better approach.

A product has one price, a price has one product - hasone nhibernate relationship?

I was a little miffed about the one-to-one relationship explanation on the 'I Think You Mean A Many To One' article.
In this instance for example, a product has one price because the business in question is small, niche, localized and supports only a single currency. Multiple prices per product make no sense in this case? I'm doubtful I'm grasping the concept correctly though, because everywhere I read says it will probably be a many-to-one even if you think it isn't?
Can somebody enlighten me please? :)
In an attempt to gain more reputation so that I can help in comments instead of an "answer" The one-to-many vs one-to-one is this
View a one-to-one as an extension of the table you are looking at.
Table B extends Table A. Meaning the information wasn't necessarily relevant enough to include in the table directly, but has a bidirectional relationship with each other. Basically meaning that As Table A, I am not dependent on the information in Table B, but Table B's information is very dependent on me. For the price example it means that Table A has a row related to a row in table B. So if you entering unique information in your Price table around every item to match in Table A, then this would be useful. As in say you had a description column about the item in your price table. Otherwise the price table in this case may just be irrelevant to have in the schema.
in a one-to-many relationship Table B usually has no reference back to Table A. So in the case of price, the items you are looking at do have a price, but prices aren't exclusive to items. So to better define, A number of things may have the price 9.99, but 9.99 only needs to exist in your pricing table once.
I am not familiar with the article you refer to. However, price is a classic example of a slowly changing dimension. Price may be constant at any point in time, but over time, the price changes.
Such dimensions are typically implemented by having effective and end dates for the period in question.
Now, at a given point in time, a product probably does have only one price. Things that affect the price -- coupons, discounts for the purchaser, volume discounts, for example -- are not properties of the product. These are properties of the transaction.
That said, there may be circumstances where a fixed volume discount does not make sense. So, the "price" for a product might include volume, as well as time.
In any case, I would agree with you that price is not a good example of a 1-1 relationship. There are other factors such as time and volume that affect it.

Database design for a product voting system

I'm am making a system where a user can vote up or down on a product, I need to be able to explicitly work out the amount of ups and downs a product has, as well as a total score for a recent period.
Each vote can optionally have a comment with it, and users need the ability to echo/boost other peoples comments (kinda like a retweet), and this will also add/subtract the total score of the product depending on the parent vote being retweeted.
Here are my current proposed tables:
Product
ID, name, category_id
Vote
ID, user_id, product_id, parent_id, comment, score, datetime
User
ID, username etc.
I am thinking I will possibly need a comments table to do this effectively? The votes' score field is either 1 or -1 as per some advice I read on StackOverflow which would allow me to gather the SUM() of that column to calculate total votes, another possibility would be to have separate vote_up and vote_down tables...but I am just not sure.
Depending on what you want to do, this can be an incredibly sophisticated problem, but here's my take on the simplest way (eg. what i can throw together in the 10 min before I leave work ;-P)
I would try the StackOverflow/HotOrNot style approach, and Store their ranking as an unsigned integer.
PRODUCTS(
id,
category_id,
name,
rating INTEGER UNSIGNED NOT NULL DEFAULT 0
);
Then in your 'VOTES' table, you store the Vote (up/down). I think the table you have for your 'VOTES' table looks fine( although I would use either an enumeration as the SCORE datatype, or some strategy to ensure that a vote can't be manipulated via XSS. eg. someone modifies the vote so that their vote up is +10,000 instead of +1, then that would not be cool )
For a small fun app, you can probably get by with incrementing or decrementing the count when the user clicks, but if you are doing anything with aspirations of scaling out, then you would do the vote calculation and ranking via some batch process that runs every 10-15 minutes.
Also at this level, you would start using an algorithm to weight the vote values. For example, if the same user votes (up or down) the same product more than once a day(or once every) then the votes after the first should not count towards calculating the rank of the product.
For Example, here is how Quora's Ranking Algorithm works
If the user is a "Power User" or has an account that is more active, maybe their vote is more important than a new users vote. I think on Yelp, if you don't have more than one or two reviews, your rating and reviews don't get counted until you meet some minimum number of reviews. Really, the skies the limit.
PS. I would also recommend checking out this o'Reilly book on some of the strategies for solving these kinds of problems
If you expect a large number of users simultaneously voting, you really need to consider performance....
A naive approach might look something like this (my apologies if i oversimplify your example, and if T-SQL isn't your poison):
create table Products(
ProductId BIGINT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
Score INT NOT NULL...
ProductDetails...
where you will be performing updates to Products by summing up/down vote tables. BAD!
If you have a large number of users voting, deadlocks are sure to occur by constantly inserting/updating/selecting against the same table.
A better approach would be to drop the Score column altogether, only insert into the up/down vote tables, and select as needed. There's no reason you can't calculate the sum in code (i.e. PHP, C#, or whatever), and it avoids ever having to update the Products table (at least for calculating the Score). In other words, storing the Score on the Products table buys you nothing, and is just unnecessary overhead.
I speak from experience when I say updates "can" be bad in high volume systems. Updating is expensive when compared to inserting at the end or selects (assuming your table is properly indexed), and it's very easy to unknowingly take out a substantial lock in situations like this.
Book Beginning CakePHP has an Ajax tutorial on implementing a working voting (up/down) system on comments. I did the tutorial a few years ago. I am not sure how secure it is or if it would be a good foundation for your project, but it would probably be worth having a look at for some ideas.

What is the convention for designating the primary relationship in a one-to-many relation between tables?

I understand how to design a database schema that has simple one-to-many relationships between its tables. I would like to know what the convention or best practice is for designating one particular relationship in that set as the primary one. For instance, one Person has many CreditCards. I know how to model that. How would I designate one of those cards as the primary for that person? Solutions I have come up with seem inelegant at best.
I'll try to clarify my actual situation. (Unfortunately, the actual domain would just confuse things.) I have 2 tables each with a lot of columns, let's say Person and Task. I also have Project which has only a couple of properties. One Person has many Projects, but has a primary Project. One Project has many Tasks, but sometimes has one primary Task with alternates, and other times has no primary task and instead a sequence of Tasks. There are no Tasks that are not part of a Project, but it isn't strictly forbidden.
PERSON (PERSON_ID, NAME, ...)
TASK (TASK_ID, NAME, DESC, EST, ...)
PROJECT (NAME, DESC)
I can't seem to figure a way to model the primary Project, primary Task, and the Task sequence all at the same time without introducing either overcomplexity or pure evil.
This is the best I've come up with so far:
PERSON (PERSON_ID, NAME, ...)
TASK (TASK_ID, NAME, DESC, EST, ...)
PROJECT (PROJECT_ID, PERSON_FK, TASK_FK, INDEX, NAME, DESC)
PERSON_PRIMARY_PROJECT (PERSON_FK, PROJECT_FK)
PROJECT_PRIMARY_TASK (PROJECT_FK, TASK_FK)
It just seems like too many tables for a simple concept.
Here's a question I've found that deals with a very similar situation: Database Design: Circular dependency.
Unfortunately, there didn't seem to be a consensus about how to handle the situation, and the "correct" answer was to disable the database consistency checking mechanism. Not cool.
Well, it seems to me that a Person has two relationships with a CreditCard. One is that the person owns it, and the other is that they consider it their primary CreditCard. That tells me you have a one-to-one and a one-to-many relationship. The return relationship for the one-to-one is already in the CreditCard because of the one-to-many relationship its in.
This means I'd add primary_cc_id as a field in Person and leave CreditCard alone.
Two strategies:
Use a bit column to indicate the preffered card.
Use a PrefferedCardTable associating each Person with the ID of its preffered card.
One person can have many credit cards; Then you'd need an identifier on each credit card to actually link that specific credit card to one individual - which I assume you've already made in your model (some kind of ID that links the person to that credit card).
Primary credit card (I assume you mean e.g. as a default credit card?) That would have to be some sort of manual operation (e.g. that you have a third table, that links them together and a column that specifies which one would be the default).
Person (SSN, Name)
CreditCard (CCID, AccountNumber)
P_CC (SSN, CCID, NoID)
So that would mean that if you connect a person to a credit card, you'd have to specify the NoID, as say '1', then design your query to per default find the credit card that belongs to this individual with NoID '1'.
This is of course just one way of doing it, maybe you'd want to limit by 0, 1 - and then sort them by the date the credit card was added to that person.
Maybe if you'd elaborate and give more information about your columns and ideas it'd make it easier.
So here what I tried out with Northwind and C# Windows App ,and I had just one query executed.
My Code:
DataClasses1DataContext context = new DataClasses1DataContext();
DataLoadOptions dlo = new DataLoadOptions();
dlo.LoadWith<Product>(b => b.Category);
context.LoadOptions = dlo;
context.DeferredLoadingEnabled = false;
context.Log = Console.Out;
List<Product> test = context.Products.ToList();
MessageBox.Show(test[0].Category.CategoryName);
Result:
SELECT [t0].[ProductID], [t0].[ProductName], [t0].[SupplierID], [t0].[CategoryID], [t0].[QuantityPerUnit], [t0].[UnitPrice], [t0].[UnitsInStock], [t0].[UnitsOnOrder], [t0].[ReorderLevel], [t0].[Discontinued], [t2].[test], [t2].[CategoryID] AS [CategoryID2], [t2].[CategoryName], [t2].[Description], [t2].[Picture]
FROM [dbo].[Products] AS [t0]
LEFT OUTER JOIN (
SELECT 1 AS [test], [t1].[CategoryID], [t1].[CategoryName], [t1].[Description], [t1].[Picture]
FROM [dbo].[Categories] AS [t1]
) AS [t2] ON [t2].[CategoryID] = [t0].[CategoryID]

Resources