When to use columns and when a separate table? - Database Design - database

Most people don't recommend EAV and I know some of the reasons.
However what is the difference between an EAV-approach and such an approach?
Table computer:
id, price, description
Table connections:
id, name (possible values: LAN, USB, HDMI, ..., all all about 10)
Table connections_computer
comp_id, conn_id
Or is that EAV, too? If yes, what would be a normalized alternative?
Consider, that I want to do searches like that:
All computers, that have BOTH a LAN and a HDMI connection. In this case I would need 1 join / filter attribute, when having it as 1 column / attribute, searching would be easy, but I would have many NULL values.
Any recommendation how to do?

Your example is a plain many-to-many. In the EAV an attribute name is a value of a column, instead of being a name of the column. For example,
computer: {computer_id, attr_name, attr_value}
insert into computer (computer_id, attr_name, attr_value)
values
(1, 'connection', 'HDMI')
, (1, 'connection', 'USB')
, (1, 'memory', '2 GB')
;
would be an EAV approach.
Here is an example of the price you pay for the EAV approach (flexibility) in a RDBMS.

Related

How to implement many-to-many-to-many database relationship?

I am building a SQLite database and am not sure how to proceed with this scenario.
I'll use a real-world example to explain what I need:
I have a list products that are sold by many stores in various states. Not every Store sells a particular Product at all, and those that do, may only sell it in one State or another. Most stores sell a product in most states, but not all.
For example, let's say I am trying to buy a vacuum cleaner in Hawaii. Joe's Hardware sells vacuums in 18 states, but not in Hawaii. Walmart sells vacuums in Hawaii, but not microwaves. Burger King does not sell vacuums at all, but will give me a Whopper anywhere in the US.
So if I am in Hawaii and search for a vacuum, I should only get Walmart as a result. While other stores may sell vacuums, and may sell in Hawaii, they don't do both but Walmart does.
How do I efficiently create this type of relationship in a relational database (specifically, I am currently using SQLite, but need to be able to convert to MySQL in the future).
Obviously, I would need tables for Product, Store, and State, but I am at a loss on how to create and query the appropriate join tables...
If I, for example, query a certain Product, how would I determine which Store would sell it in a particular State, keeping in mind that Walmart may not sell vacuums in Hawaii, but they do sell tea there?
I understand the basics of 1:1, 1:n, and M:n relationships in RD, but I am not sure how to handle this complexity where there is a many-to-many-to-many situation.
If you could show some SQL statements (or DDL) that demonstrates this, I would be very grateful. Thank you!
An accepted and common way is the utilisation of a table that has a column for referencing the product and another for the store. There's many names for such a table reference table, associative table mapping table to name some.
You want these to be efficient so therefore try to reference by a number which of course has to uniquely identify what it is referencing. With SQLite by default a table has a special column, normally hidden, that is such a unique number. It's the rowid and is typically the most efficient way of accessing rows as SQLite has been designed this common usage in mind.
SQLite allows you to create a column per table that is an alias of the rowid you simple provide the column followed by INTEGER PRIMARY KEY and typically you'd name the column id.
So utilising these the reference table would have a column for the product's id and another for the store's id catering for every combination of product/store.
As an example three tables are created (stores products and a reference/mapping table) the former being populated using :-
CREATE TABLE IF NOT EXISTS _products(id INTEGER PRIMARY KEY, productname TEXT, productcost REAL);
CREATE TABLE IF NOT EXISTS _stores (id INTEGER PRIMARY KEY, storename TEXT);
CREATE TABLE IF NOT EXISTS _product_store_relationships (storereference INTEGER, productreference INTEGER);
INSERT INTO _products (productname,productcost) VALUES
('thingummy',25.30),
('Sky Hook',56.90),
('Tartan Paint',100.34),
('Spirit Level Bubbles - Large', 10.43),
('Spirit Level bubbles - Small',7.77)
;
INSERT INTO _stores (storename) VALUES
('Acme'),
('Shops-R-Them'),
('Harrods'),
('X-Mart')
;
The resultant tables being :-
_product_store_relationships would be empty
Placing products into stores (for example) could be done using :-
-- Build some relationships/references/mappings
INSERT INTO _product_store_relationships VALUES
(2,2), -- Sky Hooks are in Shops-R-Them
(2,4), -- Sky Hooks in x-Mart
(1,3), -- thingummys in Harrods
(1,1), -- and Acme
(1,2), -- and Shops-R-Them
(4,4), -- Spirit Level Bubbles Large in X-Mart
(5,4), -- Spiirit Level Bubble Small in X-Mart
(3,3) -- Tartn paint in Harrods
;
The _product_store_relationships would then be :-
A query such as the following would list the products in stores sorted by store and then product :-
SELECT storename, productname, productcost FROM _stores
JOIN _product_store_relationships ON _stores.id = storereference
JOIN _products ON _product_store_relationships.productreference = _products.id
ORDER BY storename, productname
;
The resultant output being :-
This query will only list stores that have a product name that contains an s or S (as like is typically case sensitive) the output being sorted according to productcost in ASCending order, then storename, then productname:-
SELECT storename, productname, productcost FROM _stores
JOIN _product_store_relationships ON _stores.id = storereference
JOIN _products ON _product_store_relationships.productreference = _products.id
WHERE productname LIKE '%s%'
ORDER BY productcost,storename, productname
;
Output :-
Expanding the above to consider states.
2 new tables states and store_state_reference
Although no real need for a reference table (a store would only be in one state unless you consider a chain of stores to be a store, in which case this would also cope)
The SQL could be :-
CREATE TABLE IF NOT EXISTS _states (id INTEGER PRIMARY KEY, statename TEXT);
INSERT INTO _states (statename) VALUES
('Texas'),
('Ohio'),
('Alabama'),
('Queensland'),
('New South Wales')
;
CREATE TABLE IF NOT EXISTS _store_state_references (storereference, statereference);
INSERT INTO _store_state_references VALUES
(1,1),
(2,5),
(3,1),
(4,3)
;
If the following query were run :-
SELECT storename,productname,productcost,statename
FROM _stores
JOIN _store_state_references ON _stores.id = _store_state_references.storereference
JOIN _states ON _store_state_references.statereference =_states.id
JOIN _product_store_relationships ON _stores.id = _product_store_relationships.storereference
JOIN _products ON _product_store_relationships.productreference = _products.id
WHERE statename = 'Texas' AND productname = 'Sky Hook'
;
The output would be :-
Without the WHERE clause :-
make Stores-R-Them have a presence in all states :-
The following would make Stores-R-Them have a presence in all states :-
INSERT INTO _store_state_references VALUES
(2,1),(2,2),(2,3),(2,4)
;
Now the Sky Hook's in Texas results in :-
Note This just covers the basics of the topic.
You will need to create combine mapping table of product, states and stores as tbl_product_states_stores which will store mapping of products, state and store. The columns will be id, product_id, state_id, stores_id.

Best way to use compound Index to query with multiple combination of query parameters?

I am building a functionality to estimate Inventory for my Ads serve platform.The fields on which I am trying to estimate with their cardinality is as below:
FIELD: CARDINALITY
location: 10000 (bengaluru, chennai etc..)
n/w speed : 6 (w, 4G, 3G, 2G, G, NA)
priceRange : 10 (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
users: contains number of users falling under any of the above combination.
Ex. {'location':'bengaluru', 'n/w':'4G', priceRange:8, users: 1000}
means 1000 users are from bengaluru having 4G and priceRange = 8
So total combination can be 10000 * 6 * 10 = 600000 and in future more fields can be added to around 29(currently it is 3 location, n/w, priceRange) and total combination can reach the order of 10mn. Now I want to estimate how many users fall under
Now queries I will need are as follows:
1) find all users who are from location:bengaluru , n/w:3G, priceRange: 6
2) find all users from bengaluru
3) Find all users falling under n/w: 3G and priceRange: 8
What is the best possible way to approach to this?
Which database can be best suited for this requirement.What indexes I need to build. Will compound index help? If yes then How ? Any help is appreciated.
Here's my final answer:
Create table Attribute(
ID int,
Name varchar(50));
Create table AttributeValue(
ID int,
AttributeID int,
Value varchar(50));
Create table userAttributeValue(
userID int,
AttributeID varchar(20),
AttributeValue varchar(50));
Create table User(
ID int);
Insert into user (ID) values (1),(2),(3),(4),(5);
Insert into Attribute (ID,Name) Values (1,'Location'),(2,'nwSpeed'),(3,'PriceRange');
Insert into AttributeValue values
(1,1,'bengaluru'),(2,1,'chennai'),
(3,2, 'w'), (4, 2,'4G'), (5,2,'3G'), (6,2,'2G'), (7,2,'G'), (8,2,'NA'),
(9,3,'1'), (10,3,'2'), (11,3,'3'), (12,3,'4'), (13,3,'5'), (14,3,'6'), (15,3,'7'), (16,3,'8'), (17,3,'9'), (18,3,'10');
Insert into UserAttributeValue (userID, AttributeID, AttributeValue) values
(1,1,1),
(1,2,5),
(1,3,9),
(2,1,1),
(2,2,4),
(3,2,6),
(2,3,13),
(4,1,1),
(4,2,4),
(4,3,13),
(5,1,1),
(5,2,5),
(5,3,13);
Select USERID
from UserAttributeValue
where (AttributeID,AttributeValue) in ((1,1),(2,4))
GROUP BY USERID
having count(distinct concat(AttributeID,AttributeValue))=2
Now if you need a count wrap userID in count and divide by the attributes passed in as each user will have 1 record per attribute and to get the "count of users" you'd need to divide by the number of attributes.
This allows for N growth of Attributes and the AttributeValues per user without changes to UI or database if UI is designed correctly.
By treating each datapoint as an attribute and storing them in once place we can enforce database integrity.
Attribute and AttributeValue tables becomes lookups for UserAttributevalue so you can translate the IDs back to attribute name and the value.
This also means we only have 4 tables user, attribute, attributeValue, and UserAttributeValue.
Technically you don't have to store attributeID on the userAttributeValue, but for performance reasons on later joins/reporting I think you'll find it beneficial.
You need to add proper Primary Key's, Foreign keys, and indexes to the tables. They should be fairly self explanatory. On UserAttributeValue I would have a few Composite indexes each with a different order of the unique key. Just depends on the type of reporting/analysis you'll be doing but adding keys as performance tuning is needed is commonplace.
Assumptions:
You're ok with all datavalues being varchar data in all cases.
If needed you could add a datatype, precision, and scale on the attribute table and allow the UI to cast the attribute value as needed. but since they are all in the same field in the database they all have to be the same datatype. and of the same precision/scale.
Pivot tables to display the data across will likely be needed and you know how to handle those (and engine supports them!)
Gotta say I loved the metal exercise; but still would appreciate feedback from others on SO. I've used this approach in 1 systems I've developed and it's been in two I've supported. There are some challenges but it does follow 3rd normal form db design (except for the replicated attributeID in userAttributevalue but that's there for performance gain in reporting/filtering.

Designing Organizational structure

This question was asked in a interview. Design a organizational structure, where an employee can have direct reports, and indirect reports (that is reportees of reportee). The design should be such that, in a single query it should be able to retrieve either direct or indirect reportees or both.
I suggested,
Employee
----------
id
name
Reportee
------
emp_id FK
reportee_id FK
isDirect
The interviewer said the optimistic solution is
Employee
-------
id
name
reporting_path like (a>b>c)
Adding additional table, takes more space, but query will be executed faster. I said that due to string matching, the path based approach is bad and yields bad performance.
So which approach is optimistic?
The interviewer's approach is dumb because it does not use referential integrity.
For a purely hierarchical model (an employee cannot report to more than one boss), then this is the best approach:
create table employees (
employee_id int primary key,
name varchar(whatever) not null,
supervisor_id int null references employees(employee_id)
);
insert into employees (employee_id, name, supervisor_id) values
(1, 'Big Boss Bill', null),
(2, 'Vice President Victor', 1),
(3, 'Underling Ulysses', 2),
(4, 'Subordinate Sam', 2);
You can then use Recursive Common Table Expressions to query reports.
Some example queries here:
http://blog.databasepatterns.com/2014/02/trees-paths-recursive-cte-postgresql.html

Custom Fields with SQL Server 2008

I have several entities that I require users be able to add custom fields to.
If I had an entity called customer with base variables like {Name, DateOfBirth, StoreId}
and another one called Store with {Name}
Then I would want it so that the owner of that store could login and add a new variable for all their customers called favourite colour which is a dropdown with red, green or blue as options.
Now I have had a look at EAV and come up with a solution that looks like this
Attribute {StoreId, Name, DataType},
Value {AttributeId, EntityName, EntityId, Value}
I'm wondering is there some solution that will work best for SQL Server 2008 especially given that I'll want to be able to view and query this information easily.
I've heard that you can query within the xml datatype. Is that a better way to go?
I will also probably want users to be able to add custom fields that are foreign keys at some point too.
Will be looking at this all day so will ask questions quickly.
EAV in general is an anti-pattern that results in dismal performance and chokes the scalability. Now if you decide to go with EAV, the SQL Server Customer Advisory Team has published a white paper with common pitfalls and problems and how to avoid them: Best Practices for Semantic Data Modeling for Performance and Scalability.
Querying an XML data type is possible in SQL, but if your XML has no schema then querying it it will be slow. If it has a schema and the schema is EAV, then it will have all the problems of relational EAV plus some of its own for XML performance. again the good folks of the CAT team have published a couple of white papers on the topic: XML Best Practices for Microsoft SQL Server 2005 and Performance Optimizations for the XML Data Type in SQL Server 2005. They are valid for SQL 2008 too.
I've been using the XML features of SQL 2005 / 2008 for a while. I've come to rely on XML columns quite a bit.
What you want to do sounds like the perfect candidate for XML.
For instance, the following snippet defines your 2 entities (#customers and #stores), with a column called "attrs" that can be expanded to include more attributes.
I hope this helps!
declare #customers as table ( id int, attrs xml);
INSERT INTO #customers VALUES
(1,'<Attrs Name="Peter" DateOfBirth="1996-01-25" StoreId="10" />'),
(2,'<Attrs Name="Smith" DateOfBirth="1993-05-02" StoreId="20" />')
;
declare #stores as table ( id int, attrs xml);
insert into #stores VALUES
(10, '<Attrs Name="Store1" />'),
(20, '<Attrs Name="Store2" />')
;
With c as (
select id as CustomerID,
attrs.value('(/Attrs[1])/#Name', 'nvarchar(100)') as Name,
attrs.value('(/Attrs[1])/#DateOfBirth', 'date') as DateOfBirth,
attrs.value('(/Attrs[1])/#StoreId', 'int') as StoreId
from #customers
), s as (
select id as StoreID,
attrs.value('(/Attrs[1])/#Name', 'nvarchar(100)') as Name
from #stores
)
select *
from c left outer join s on (c.StoreId=s.StoreID);
Excellent answers already. I'll only add the suggestion that you maintain metadata for the custom fields as well. This would make a UI for entering the custom fields easier - you'd be able to limit the set of custom fields for a Customer, for instance, and to specify that DateOfBirth is to be a date, and that StoreID is meant to match the ID of an actual store.
Some of this metadata could be maintained as XML schemas. I've seen that done, with the schemas stored in the database, and used to validate custom fields being input. I do not know if those schemas can also be used to strongly-type the XML data.

SQL2005: Linking a table to multiple tables and retaining Ref Integrity?

Here is a simplification of my database:
Table: Property
Fields: ID, Address
Table: Quote
Fields: ID, PropertyID, BespokeQuoteFields...
Table: Job
Fields: ID, PropertyID, BespokeJobFields...
Then we have other tables that relate to the Quote and Job tables individually.
I now need to add a Message table where users can record telephone messages left by customers regarding Jobs and Quotes.
I could create two identical tables (QuoteMessage and JobMessage), but this violates the DRY principal and seems messy.
I could create one Message table:
Table: Message
Fields: ID, RelationID, RelationType, OtherFields...
But this stops me from using constraints to enforce my referential integrity. I can also forsee it creating problems with the devlopment side using Linq to SQL later on.
Is there an elegant solution to this problem, or am I ultimately going to have to hack something together?
Burns
Create one Message table, containing a unique MessageId and the various properties you need to store for a message.
Table: Message
Fields: Id, TimeReceived, MessageDetails, WhateverElse...
Create two link tables - QuoteMessage and JobMessage. These will just contain two fields each, foreign keys to the Quote/Job and the Message.
Table: QuoteMessage
Fields: QuoteId, MessageId
Table: JobMessage
Fields: JobId, MessageId
In this way you have defined the data properties of a Message in one place only (making it easy to extend, and to query across all messages), but you also have the referential integrity linking Quotes and Jobs to any number of messages. Indeed, both a Quote and Job could be linked to the same message (I'm not sure if that is appropriate to your business model, but at least the data model gives you the option).
About the only other way I can think of is to have a base Message table, with both an Id and a TypeId. Your subtables (QuoteMessage and JobMessage) then reference the base table on both MessageId and TypeId - but also have CHECK CONSTRAINTS on them to enforce only the appropiate MessageTypeId.
Table: Message
Fields: Id, MessageTypeId, Text, ...
Primary Key: Id, MessageTypeId
Unique: Id
Table: MessageType
Fields: Id, Name
Values: 1, "Quote" : 2, "Job"
Table: QuoteMessage
Fields: Id, MessageId, MessageTypeId, QuoteId
Constraints: MessageTypeId = 1
References: (MessageId, MessageTypeId) = (Message.Id, Message.MessageTypeId)
QuoteId = Quote.QuoteId
Table: JobMessage
Fields: Id, MessageId, MessageTypeId, JobId
Constraints: MessageTypeId = 2
References: (MessageId, MessageTypeId) = (Message.Id, Message.MessageTypeId)
JobId = Job.QuoteId
What does this buy you, as compared to just a JobMesssage and QuoteMessage table? It elevates a Message to a first class citizen, so that you can read all Messages from a single table. In exchange, your query path from a Message to it's relevant Quote or Job is 1 more join away. It kind of depends on your app flow whether that's a good tradeoff or not.
As for 2 identical tables violating DRY - I wouldn't get hung up on that. In DB design, it's less about DRY, and more about normalization. If the 2 things you're modeling have the same attributes (columns), but are actually different things (tables) - then it's reasonable to have multiple tables with similar schemas. Much better than the reverse of munging different things together.
#burns
Ian's answer (+1) is correct [see note]. Using a many to many table QUOTEMESSAGE to join QUOTE to MESSAGE is the most correct model, but will leave orphaned MESSAGE records.
This is one of those rare cases where a trigger can be used. However, caution needs to be applied to ensure that the a single MESSAGE record cannot be associated with both a QUOTE and a JOB.
create trigger quotemessage_trg
on quotemessage
for delete
as
begin
delete
from [message]
where [message].[msg_id] in
(select [msg_id] from Deleted);
end
Note to Ian, I think there is a typo in the table definition for JobMessage, where the columns should be JobId, MessageId (?). I would edit your quote but it might take me a few years to gain that level of reputation!
Why not just have both QuoteId and JobId fields in the message table? Or does a message have to be regarding either a quote or a job and not both?

Resources