Best way to extend information on a relational database - database

Let's say that we have to store information of different types of product in a database. However, these products have different specifications. For example:
Phone: cpu, ram, storage...
TV: size, resolution...
We want to store each specification in a column of a table, and all the products (whatever the type) must have a different ID.
To comply with that, now I have one general table named Products (with an auto increment ID) and one subordinate table for each type of product (ProductsPhones, ProductsTV...) with the specifications and linked with the principal with a Foreign Key.
I find this solution inefficient since the table Products has only one column (the auto incremented ID).
I would like to know if there is a better approach to solve this problem using relational databases.

The short answer is no. The relational model is a first-order logical model, meaning predicates can vary over entities but not over other predicates. That means dependent types and EAV models aren't supported.
EAV models are possible in SQL databases, but they don't qualify as relational since the domain of the value field in an EAV row depends on the value of the attribute field (and sometimes on the value of the entity field as well). Practically, EAV models tend to be inefficient to query and maintain.
PostgreSQL supports shared sequences which allows you to ensure unique auto-incremented IDs without a common supertype table. However, the supertype table may still be a good idea for FK constraints.
You may find some use for your Products table later to hold common attributes like Type, Serial number, Cost, Warranty duration, Number in stock, Warehouse, Supplier, etc...

Having Products table is fine. You can put there all the columns common across all types like product name, description, cost, price just to name some. So it's not just auto increment ID. Having an internal ID of type int or long int as the primary key is recommended. You may also add another field "code" or whatever you want to call it for user-entered or user-friendly which is common with product management systems. Make sure you index it if used in searching or query criteria.
HTH

While this can't be done completely relationally, you can still normalize your tables some and make it a little easier to code around.
You can have these tables:
-- what are the products?
Products (Id, ProductTypeId, Name)
-- what kind of product is it?
ProductTypes (Id, Name)
-- what attributes can a product have?
Attributes (Id, Name, ValueType)
-- what are the attributes that come with a specific product type?
ProductTypeAttributes (Id, ProductTypeId, AttributeId)
-- what are the values of the attributes for each product?
ProductAttributes (ProductId, ProductTypeAttributeId, Value)
So for a Phone and TV:
ProductTypes (1, Phone) -- a phone type of product
ProductTypes (2, TV) -- a tv type of product
Attributes (1, ScreenSize, integer) -- how big is the screen
Attributes (2, Has4G, boolean) -- does it get 4g?
Attributes (3, HasCoaxInput, boolean) -- does it have an input for coaxial cable?
ProductTypeAttributes (1, 1, 1) -- a phone has a screen size
ProductTypeAttributes (2, 1, 2) -- a phone can have 4g
-- a phone does not have coaxial input
ProductTypeAttributes (3, 2, 1) -- a tv has a screen size
ProductTypeAttributes (4, 2, 3) -- a tv can have coaxial input
-- a tv does not have 4g (simple example)
Products (1, 1, CoolPhone) -- product 1 is a phone called coolphone
Products (2, 1, AwesomePhone) -- prod 2 is a phone called awesomephone
Products (3, 2, CoolTV) -- prod 3 is a tv called cooltv
Products (4, 2, AwesomeTV) -- prod 4 is a tv called awesometv
ProductAttributes (1, 1, 6) -- coolphone has a 6 inch screen
ProductAttributes (1, 2, True) -- coolphone has 4g
ProductAttributes (2, 1, 4) -- awesomephone has a 4 inch screen
ProductAttributes (2, 2, False) -- awesomephone has NO 4g
ProductAttributes (3, 3, 70) -- cooltv has a 70 inch screen
ProductAttributes (3, 4, True) -- cooltv has coax input
ProductAttributes (4, 3, 19) -- awesometv has a 19 inch screen
ProductAttributes (4, 4, False) -- awesometv has NO coax input
The reason this is not fully relational is that you'll still need to evaluate the value type (bool, int, etc) of the attribute before you can use it in a meaningful way in your code.

Related

Preventing aggregation along dimensions' attributes

Say I have this schema (sorry for the slightly convoluted example):
CREATE TABLE Sales
(
ID INT PRIMARY KEY,
Shop NVARCHAR(MAX),
ShopLocationLeft NVARCHAR(MAX),
ShopLocationRight NVARCHAR(MAX),
Amount DECIMAL
)
INSERT INTO Sales VALUES
(1, 'Shop #1', 'New', 'York', 10000),
(2, 'Shop #2', 'New', 'Delhi', 1000),
(3, 'Shop #3', 'North', 'York', 5000)
Then I create a cube with a Shop dimension with 3 attributes:
Name (column Shop)
Location Left (column ShopLocationLeft)
Location Right (column ShopLocationRight)
I can explore the cube along this dimension:
SELECT
[Amount] ON COLUMNS,
[Shop].[Name].Children ON ROWS
FROM
[Sales]
To get:
Amount
Shop #1 10000
Shop #2 1000
Shop #3 5000
So far so good.
But using other attributes like Location Left:
SELECT
[Amount] ON COLUMNS,
[Shop].[Location Left].Children ON ROWS
FROM
[Sales]
We get:
Amount
New 11000
North 5000
So the cube is allowing exploration and aggregation 1 level deeper than the dimension, along the attributes, making them some kind of sub-dimensions.
Which in this case has no business meaning.
I was expecting that, like an SQL SELECT, this would display the Location Left column instead:
Amount
New 10000
New 1000
North 5000
Because for me this dimension has 3 points:
('Shop #1', 'New', 'York')
('Shop #2', 'New', 'Delhi')
('Shop #3', 'North', 'York')
Which should be considered atomic entities that can't be broken down further.
I understand that this behavior can be useful (e.g. for first and last name) but in this case it does not make any sense.
Or if I had defined an n-levels hierarchy for an attribute (e.g. country -> city -> location) it would be logical too as I would have explicitly asked for a deeper exploration and aggregation.
How to prevent this behavior when it would lead to non relevant results?
If you have an attribute Location Left in your Shop dimension you can choose ID as the Key column and Location Left as the Name column of this attribute (in the Dimension structure tab - right click on the Location Left attribute and select properties, then you will look for KeyColumn and NameColumn properties). If you do this ,you will see 'New' being displayed multiple times in the results.
If you have an attribute say Location Left and choose the same Location Left both as the Key column and as the Name column, you will see only one entry per Location Left Name.

Best way to use compound Index to query with multiple combination of query parameters?

I am building a functionality to estimate Inventory for my Ads serve platform.The fields on which I am trying to estimate with their cardinality is as below:
FIELD: CARDINALITY
location: 10000 (bengaluru, chennai etc..)
n/w speed : 6 (w, 4G, 3G, 2G, G, NA)
priceRange : 10 (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
users: contains number of users falling under any of the above combination.
Ex. {'location':'bengaluru', 'n/w':'4G', priceRange:8, users: 1000}
means 1000 users are from bengaluru having 4G and priceRange = 8
So total combination can be 10000 * 6 * 10 = 600000 and in future more fields can be added to around 29(currently it is 3 location, n/w, priceRange) and total combination can reach the order of 10mn. Now I want to estimate how many users fall under
Now queries I will need are as follows:
1) find all users who are from location:bengaluru , n/w:3G, priceRange: 6
2) find all users from bengaluru
3) Find all users falling under n/w: 3G and priceRange: 8
What is the best possible way to approach to this?
Which database can be best suited for this requirement.What indexes I need to build. Will compound index help? If yes then How ? Any help is appreciated.
Here's my final answer:
Create table Attribute(
ID int,
Name varchar(50));
Create table AttributeValue(
ID int,
AttributeID int,
Value varchar(50));
Create table userAttributeValue(
userID int,
AttributeID varchar(20),
AttributeValue varchar(50));
Create table User(
ID int);
Insert into user (ID) values (1),(2),(3),(4),(5);
Insert into Attribute (ID,Name) Values (1,'Location'),(2,'nwSpeed'),(3,'PriceRange');
Insert into AttributeValue values
(1,1,'bengaluru'),(2,1,'chennai'),
(3,2, 'w'), (4, 2,'4G'), (5,2,'3G'), (6,2,'2G'), (7,2,'G'), (8,2,'NA'),
(9,3,'1'), (10,3,'2'), (11,3,'3'), (12,3,'4'), (13,3,'5'), (14,3,'6'), (15,3,'7'), (16,3,'8'), (17,3,'9'), (18,3,'10');
Insert into UserAttributeValue (userID, AttributeID, AttributeValue) values
(1,1,1),
(1,2,5),
(1,3,9),
(2,1,1),
(2,2,4),
(3,2,6),
(2,3,13),
(4,1,1),
(4,2,4),
(4,3,13),
(5,1,1),
(5,2,5),
(5,3,13);
Select USERID
from UserAttributeValue
where (AttributeID,AttributeValue) in ((1,1),(2,4))
GROUP BY USERID
having count(distinct concat(AttributeID,AttributeValue))=2
Now if you need a count wrap userID in count and divide by the attributes passed in as each user will have 1 record per attribute and to get the "count of users" you'd need to divide by the number of attributes.
This allows for N growth of Attributes and the AttributeValues per user without changes to UI or database if UI is designed correctly.
By treating each datapoint as an attribute and storing them in once place we can enforce database integrity.
Attribute and AttributeValue tables becomes lookups for UserAttributevalue so you can translate the IDs back to attribute name and the value.
This also means we only have 4 tables user, attribute, attributeValue, and UserAttributeValue.
Technically you don't have to store attributeID on the userAttributeValue, but for performance reasons on later joins/reporting I think you'll find it beneficial.
You need to add proper Primary Key's, Foreign keys, and indexes to the tables. They should be fairly self explanatory. On UserAttributeValue I would have a few Composite indexes each with a different order of the unique key. Just depends on the type of reporting/analysis you'll be doing but adding keys as performance tuning is needed is commonplace.
Assumptions:
You're ok with all datavalues being varchar data in all cases.
If needed you could add a datatype, precision, and scale on the attribute table and allow the UI to cast the attribute value as needed. but since they are all in the same field in the database they all have to be the same datatype. and of the same precision/scale.
Pivot tables to display the data across will likely be needed and you know how to handle those (and engine supports them!)
Gotta say I loved the metal exercise; but still would appreciate feedback from others on SO. I've used this approach in 1 systems I've developed and it's been in two I've supported. There are some challenges but it does follow 3rd normal form db design (except for the replicated attributeID in userAttributevalue but that's there for performance gain in reporting/filtering.

Designing Organizational structure

This question was asked in a interview. Design a organizational structure, where an employee can have direct reports, and indirect reports (that is reportees of reportee). The design should be such that, in a single query it should be able to retrieve either direct or indirect reportees or both.
I suggested,
Employee
----------
id
name
Reportee
------
emp_id FK
reportee_id FK
isDirect
The interviewer said the optimistic solution is
Employee
-------
id
name
reporting_path like (a>b>c)
Adding additional table, takes more space, but query will be executed faster. I said that due to string matching, the path based approach is bad and yields bad performance.
So which approach is optimistic?
The interviewer's approach is dumb because it does not use referential integrity.
For a purely hierarchical model (an employee cannot report to more than one boss), then this is the best approach:
create table employees (
employee_id int primary key,
name varchar(whatever) not null,
supervisor_id int null references employees(employee_id)
);
insert into employees (employee_id, name, supervisor_id) values
(1, 'Big Boss Bill', null),
(2, 'Vice President Victor', 1),
(3, 'Underling Ulysses', 2),
(4, 'Subordinate Sam', 2);
You can then use Recursive Common Table Expressions to query reports.
Some example queries here:
http://blog.databasepatterns.com/2014/02/trees-paths-recursive-cte-postgresql.html

When to use columns and when a separate table? - Database Design

Most people don't recommend EAV and I know some of the reasons.
However what is the difference between an EAV-approach and such an approach?
Table computer:
id, price, description
Table connections:
id, name (possible values: LAN, USB, HDMI, ..., all all about 10)
Table connections_computer
comp_id, conn_id
Or is that EAV, too? If yes, what would be a normalized alternative?
Consider, that I want to do searches like that:
All computers, that have BOTH a LAN and a HDMI connection. In this case I would need 1 join / filter attribute, when having it as 1 column / attribute, searching would be easy, but I would have many NULL values.
Any recommendation how to do?
Your example is a plain many-to-many. In the EAV an attribute name is a value of a column, instead of being a name of the column. For example,
computer: {computer_id, attr_name, attr_value}
insert into computer (computer_id, attr_name, attr_value)
values
(1, 'connection', 'HDMI')
, (1, 'connection', 'USB')
, (1, 'memory', '2 GB')
;
would be an EAV approach.
Here is an example of the price you pay for the EAV approach (flexibility) in a RDBMS.

How to be use Sphinx to search across large, JOINed tables?

I have several different tables in my database and I'm trying to use Sphinx to do fast full-text searches. For ease of discussion, let's say the main records of interest are packing slips, one of which is included when an order ships. How do I use Sphinx to execute complex queries across all of these tables without completely denormalizing the database?
Each packing slip lists the order number, shipper, recipient, and the tracking number of each box included with the shipment. A separate table contains information about the order items. An additional table contains the customer address information. So, orders contain boxes and boxes contain items. (Example schema listed at the bottom of this question).
I would like to be able to query Sphinx to answers to questions like:
How many people who live on a street named "Maple" ordered an item with "large" in the description?
Which orders contain include the word "blue" in either the box description or order items' description?
To answer these types of questions, I need to refer to several tables. Since Sphinx doesn't have JOINs, one option is to denormalize the database. Denormalizing using a view, so that each row represents an order item--plus all of the data of it's parent box and order, would result in billions of very wide rows. So I've been creating a separate index for each table instead. But that doesn't allow me to query across tables as a SQL JOIN would. Is there another solution?
Example database
CREATE TABLE orders (
id integer PRIMARY KEY,
date_ordered date,
customer_po varchar
);
INSERT INTO orders VALUES (1, '2012-12-13', NULL);
INSERT INTO orders VALUES (2, '2012-12-14', 'DF312442');
CREATE TABLE parties (
id integer PRIMARY KEY,
order_id integer NOT NULL REFERENCES orders(id),
party_type varchar,
company varchar,
city varchar,
state char(2)
);
INSERT INTO parties VALUES (1, 1, 'shipper', 'ACME, Inc.', 'New York', 'NY');
INSERT INTO parties VALUES (2, 1, 'recipient', 'Wylie Coyote Corp.', 'Flagstaff', 'AZ');
INSERT INTO parties VALUES (3, 2, 'shipper', 'Cyberdyne', 'Las Vegas', 'NV');
-- Please disregard the fact that this design permits multiple shippers and multiple recipients
-- per order. This is a vastly simplified version of the system I'm working on.
CREATE TABLE boxes (
id integer PRIMARY KEY,
order_id integer NOT NULL REFERENCES orders(id),
tracking_num varchar NOT NULL,
description varchar NOT NULL,
);
INSERT INTO boxes VALUES (1, 1, '1234567890', 'household goods');
INSERT INTO boxes VALUES (2, 1, '0987654321', 'kitchen appliances');
INSERT INTO boxes VALUES (3, 2, 'ABCDE12345', 'audio equipment');
CREATE TABLE box_contents (
id integer PRIMARY KEY,
order_id integer NOT NULL REFERENCES orders(id),
box integer NOT NULL REFERENCES boxes(id),
qty_units integer,
description varchar
);
INSERT INTO box_contents VALUES (1, 1, 1, 4, 'cookbook');
INSERT INTO box_contents VALUES (2, 1, 1, 2, 'baby bottle');
INSERT INTO box_contents VALUES (3, 1, 2, 1, 'television');
INSERT INTO box_contents VALUES (4, 2, 3, 2, 'lamp');
You put the JOIN in the sql_query that builds the index. The tables remain normalized, but you denormalize when building the index.
Its only a basic example, but your query would be something like.. .
sql_query = SELECT o.id,customer_po,UNIX_TIMESTAMP(date_ordered) AS date_ordered, \
GROUP_CONCAT(DISTINCT party_type) AS party_type, \
GROUP_CONCAT(DISTINCT company) AS company, \
GROUP_CONCAT(DISTINCT city) AS city, \
GROUP_CONCAT(DISTINCT description) AS description \
FROM orders o \
INNER JOIN parties p ON (o.id = p.order_id) \
INNER JOIN box_contents b ON (o.id = b.order_id) \
GROUP BY o.id \
ORDER BY NULL
Update: alternatively can use sql_joined_field to do the same but avoid actual sql_query joins. Sphinx then does the join process for you

Resources