I have a scenario that, there are three types of functionalities has same set of fields (except their primary key).
The below is the sample. I would like know, whether it is a better idea to group the common fields in a single table? If we create a common table, how can we give the FK reference to the corresponding primary key table? What would be the better approach?
tblCountry
tblState
tblCity
countryid
StateId
CityId
Name
CountryId
StateId
officiallanguage
officiallanguage
officiallanguage
officialFlag
officialFlag
officialFlag
officialFlower
officialFlower
officialFlower
officialAnimal
officialAnimal
officialAnimal
officialBird
officialBird
officialBird
...
...
...
...
...
...
...
...
...
etc
etc
etc
Your intended third normal form (3NF) is good as it is.
From a simplicity point of view (affecting joins) it is as good as it can be. And foreign keys between country, state and city are obivous and trivial.
Now to save you from copying some column names you could put all three elements country, state and city into a single table - effectively making it second normal form. With this the meaning of your column names start to roleplay. With this I mean the officialLanguage can either be country, state or city related. From the stored table design this is no longer obvious. Only by interpretation of the multi column key.
So in short by saving on some typing / copying you will complicate any further work using a single table with convoluted meaning instead of using three tables with clear meanings.
Now towards data selection this is an issue only if there are no aliases.
Consider selecting officialLanguage of a city in a country.
SELECT
name,
officialLanguage,
name,
officialLanguage
FROM city
INNER JOIN state
ON state.stateid = city.stateid
INNER JOIN country
ON country.countryid = state.countryid
;
This will fail as the columns chosen are ambigiuous.
Now consider this query (where the aliases are shorted just to demonstrate the aliases - personally I try to use up-to-10-letters aliases):
SELECT
cit.name AS city_name,
cit.officialLanguage AS city_language,
cou.name AS country_name,
cou.officialLanguage aS country_language
FROM city AS cit
INNER JOIN state AS sta
ON sta.stateid = cit.stateid
INNER JOIN country AS cou
ON cou.countryid = sta.countryid
;
It is very clear and concise. I can use country table in different queries without having to pre-select those countries from a table with intermingled objects like country, state and city.
The only downside to this approach is the multi join of properly indexed tables.
Also as there are quite a few countries, states and cities across the world this single table approach can be a performance issue down the line.
4NF (or at least BCNF or 3.5NF as it is otherwise known) is best for fast performance in joins with the trade off that joins can become complex (to write) when properly indexed. However for database engines these are easiest to read.
2NF (or Excel tables as I call those) are easiest to read for humans. Which require complicated join and/or conditions (WHERE clause) to properly identify just a subset.
For the database design best use at minimum 3NF or better, then prepare views to turn the data back to 2NF to make your data human-readable.
You're asking for advice around when you should consider further normalisation on a schema.
This answer from #KnutBoehnert is quite detailed, my answer is really a supplement to that.
The goal of normalizing the database structure is NOT to reduce the duplication names of fields to a single reference, but to reduce the duplication of the data that is stored within those fields. To provide a definitive answer for a given schema would normally require you to provide a set of records, however your dataset is simple enough to correlate to real world norms that we can talk about some hypotheticals.
This is different to inheritance or composition in OO programming, where I would strongly encourage you to encapsulate these fields either as properties on a common base class between the objects that represent records from these tables, or as an interface that they all compose into their structure.
Only when Country/State/City commonly have the same values for the duplicated field names is there is a strong argument to refactor this structure to introduce a separate table to hold the values of Language,Flag,Flower,Animal,Bird.
If you did create a separate table for this information, how often would the records in it be re-used? For instance, how many different countries are going to reference the same record? most likely none as the flag is usually unique for each country, but certainly the combination of all those fields will be uinique per country. For state the same statement is usually true. If there is no re-use of the records in this new table, if the relationship is always 1:1 then the database and the queries are optimised by leaving them within the table the way you have it now.
1:1 relationships do have a place, especially if for a given cenceptual record there are distinct use cases where one set of fields is updated or queried in isolation from another set and that these two sets have very different query rates, however on its own without any further supportive reasons, 1:1 relationships in a schema can be simplified by merging the two tables in the one record.
If for instance all states within a country, and all cities within a state are expeceted to have the same Language, then you would not need the field at the lower levels at all and you could use joins to access the Language field from the Country record. But in the physical world there are many countries that have states that have different local languages or dialects to that of the country as a whole, so I don't think this applies to this particular schema.
In fact I see no reason for your current schema to change, the structure as it is now even allows you use null values to indicate that the value should be coalesced from the parent level.
In Australia for instance, all of the records for State and City are more than likely to have a Language value of English, we can use null coalescing statements to prevent the need to enter and maintain this value in each State and City record, so in Australia, of all states and cities left a value of null in the officialLanguage field, we could still coalesce it from the parent related country:
SELECT
country.Name AS country_name,
country.officialLanguage AS country_language
state.name AS state_name,
IsNull(state.officialLanguage,country.officialLanguage) AS state_language
city.name AS city_name,
COALESCE(city.officialLanguage, state.officialLanguage, country.officialLanguage) AS city_language,
FROM tblCity AS city
INNER JOIN tblState AS state
ON state.Stateid = city.Stateid
INNER JOIN tblCountry AS country
ON country.Countryid = state.Countryid
To query the records for just the city table in this way could look like this:
I'm not sure it makes sense to coalesce the other fields, but it could be done
SELECT
CityId,
city.StateId,
state.CountryId,
...
COALESCE(city.officialLanguage, state.officialLanguage, country.officialLanguage) AS officialLanguage,
COALESCE(city.officialFlag, state.officialFlag, country.officialFlag) AS officialFlag,
COALESCE(city.officialFlower, state.officialFlower, country.officialFlower) AS officialFlower,
COALESCE(city.officialAnimal, state.officialAnimal, country.officialAnimal) AS officialAnimal,
COALESCE(city.officialBird, state.officialBird, country.officialBird) AS officialBird,
FROM tblCity AS city
INNER JOIN tblState AS state
ON state.Stateid = city.Stateid
INNER JOIN tblCountry AS country
ON country.Countryid = state.Countryid
In many real world applications of this null coalescing concept the application layer would handle the display of these fields differently, but the storage is what we are most concerned about for this question today.
Related
I am new to data modeling and i'm having trouble coming up with a data model that can store logic.
The data model would be used to store location and marketing attributes.
When a customer visits one of the company's websites, they would enter in their zip code, and based on their location the attributes would be used to arrange the online catalog of items.
The catalog of items would be separate from the database, so the data model would only produce the output of attributes used to arrange the items. Each item in the catalog has attributes such as ItemNumber, Price, Condition, Manufacture, and marketing segments (Age:Adult, Education: College, Income:High, etc.).
**For example:**
**Input zip code**: 90210
**Output Attributes**: (ItemNumber:123456, Segment:HighIncome, Condition:New)
This example is saying for zip 90210, first show item #123456, followed by all of the items with the HighIncome segment, and then display all of the non-refurbished items.
So far I have 2 tables with a many to many relationship and I would like to add an additional table(s) so I can incorporate logic (AND & OR).
The first table would have location and other information about which of the company's site the user is on.
Table Location(
Location_Unique_Identifier number
ZipCode varchar2
State varchar2
Site varchar2
..
)
The second table would have the attributes types (Manufacture, Price, Condition, etc.) and the attribute values (IBM, 10.00, Refurbished, etc.).
Table Attributes(
Attribute_Unique_Identifier number
Attribute_Type varchar2
Attribute_Value varchar2
..
..
)
In-between these two tables to break up the many to many relationship I would add the logic table. This table should allow me to output
item#123456 AND (item#768900 OR Condition:New)
The problem I am having with the logic table is trying to make it flexible enough to handle an unknown amount of AND/ORs and to handle the grouping.
This is a typical scenario of JOIN two( many ) tables together to do AND/OR/XOR or something else logical.
The best choice is to build a meterailized view that denormalize the attributes from multiple tables together into one table(this table is called a view).
In your case, the view may be:
table location_join_attributes{
number,
zipcode,
state,
site,
Manufacture,
Price,
Condition,
......
}
Then you will operate your logical statement on this table/view as(modified from your example):
item#123456 OR (item#768900 AND Condition:New) AND (more condition)
If we do not have this view, this operation will firstly fetch out all the records have item#768900, and then filter among the second table to know which of them have condition:new. It will take a long time to finish. If the condition is complex, the performance is terrible.
For quick query, you should build secondary indexes on the columns you operate.
On the scalability side, if your business logic changes, you may build a new view, and the older one will be discarded. The original tables do not change, which is also one of the advantages of a materialized view has.
I have a lot trouble finding the best design solution for this situation. I have two tables with a common base. Currently I have designed it like this: I have an order table (the common base):
[order_table]
order_id
order_type
company
created
I have another table with reference to the order table:
[product_order]
order_id fk
product_id
quantity
price
I have second table with reference to the order table:
[special_order]
order_id fk
description
price_estimate
color
size
Both tables share the same order_id which i like. I often have to do large queries on order_table using the information available in that table lets say 'company = 200'. But for each result I also need its data from product_order or special_order depending on which type it is. So the only optimal solution I see is to left joining the query with both tables on order_id and filter the information afterwards. The only other option I see is to add the common columns to each table, but then I would have a lot of reorganizing afterwards to get them in correct order.
Is there a better way to organize the data?
So those extra tables are extra attributes to a specific order-id (1:1)?
I'd consider adding all the fields to the common tables, or at least the fields from the most used sub-table.
If not appropriate, you may want to add "Type" to the common table and let a trigger manage insert/delete of related records to avoid the fuzz with orphans etc.
Use views with your left joins (wouldn't inner be better?) to fetch the different types.
I want to save a geographical data in a relational db and be able to query for data based on their location (country, state or similar not coordinates).
My current solution is to have 4 extra fields (all countries I'm interested in have 2 or 3 administrative divisions) in my table and filter on strings. But I realize that this is a bad solution and would like to normalize my table.
I will also use that data to determine which page my users wants to visit, so it must be simple to lookup a request like "/usa/california/san_fransisco/..."
The only other solution I can come up with is to store those 4 extra fields in another table and link them with a foreign key but that would still mean some data duplication as country name would duplicated in allot of rows.
Is there any better way of doing this?
Normalizing is definitely the way to go. Databases are designed to function that way. Yes the query might look long but it's not that bad. It might look something like this:
select * --or whatever fields you need
from Customer
left outer join City on (Customers.CityID = City.CityID)
left outer join State on (City.StateID = State.StateID)
left outer join Country on (State.CountryID = Country.CountryID)
where CustomerID = 1234
You're on the right track with putting the info in tables. Their called lookup tables. If you want to go the full relational route, you can have the entity link with a foreign key to the city lookup table. The city table links to the state table. The state table links to the country table. You could also store a text version of the complete location in the entity's original table for data display.
My current solution is to have 4 extra fields (all countries I'm interested in have 2 or 3 administrative divisions) in my table and filter on strings. But I realize that this is a bad solution and would like to normalize my table.
I don't think that this is a bad solution. Storing simple geographical/address-based information per row and using WHERE to fetch all records that match is fairly standard procedure. Using a foreign key to link to a separate table is going to be additional work and won't be any faster.
The searching/request using a RESTful interface (as you suggested) is a good idea, however.
Go the normalized route. Joining tables is NOT slow, or complicated. PK of each table will be an integer with a clustered index. Foreign keys will have an index. The join is going to fly.
If you want to list cities in a drop down list, you don't want duplicates. You may list all the cities under a state. De-normalized will slow your query with "distinct", i guarantee you that is slower going the de-normalized route. ironic?
But there is a case for de-normalized. There are millions of addresses. It will probably not be feasible to enter all addresses in your application. So you are going to rely on..... free text input from the user. In this case you don't care about exact correctness or duplicates, you are forced to just accept whatever is data is thrown at you due to the impossibility of having exhaustive data to validate against. And you would rather not bother inserting to "lookup" tables as you don't trust the input to begin with.
You could go for a re-cursive model if you want ultra flexibility to handle different countries. Some countries may not have states, counties, etc. They all have their own hierarchy.
What is the best way to store settings for certain objects in my database?
Method one: Using a single table
Table: Company {CompanyID, CompanyName, AutoEmail, AutoEmailAddress, AutoPrint, AutoPrintPrinter}
Method two: Using two tables
Table Company {CompanyID, COmpanyName}
Table2 CompanySettings{CompanyID, utoEmail, AutoEmailAddress, AutoPrint, AutoPrintPrinter}
I would take things a step further...
Table 1 - Company
CompanyID (int)
CompanyName (string)
Example
CompanyID 1
CompanyName "Swift Point"
Table 2 - Contact Types
ContactTypeID (int)
ContactType (string)
Example
ContactTypeID 1
ContactType "AutoEmail"
Table 3 Company Contact
CompanyID (int)
ContactTypeID (int)
Addressing (string)
Example
CompanyID 1
ContactTypeID 1
Addressing "name#address.blah"
This solution gives you extensibility as you won't need to add columns to cope with new contact types in the future.
SELECT
[company].CompanyID,
[company].CompanyName,
[contacttype].ContactTypeID,
[contacttype].ContactType,
[companycontact].Addressing
FROM
[company]
INNER JOIN
[companycontact] ON [companycontact].CompanyID = [company].CompanyID
INNER JOIN
[contacttype] ON [contacttype].ContactTypeID = [companycontact].ContactTypeID
This would give you multiple rows for each company. A row for "AutoEmail" a row for "AutoPrint" and maybe in the future a row for "ManualEmail", "AutoFax" or even "AutoTeleport".
Response to HLEM.
Yes, this is indeed the EAV model. It is useful where you want to have an extensible list of attributes with similar data. In this case, varying methods of contact with a string that represents the "address" of the contact.
If you didn't want to use the EAV model, you should next consider relational tables, rather than storing the data in flat tables. This is because this data will almost certainly extend.
Neither EAV model nor the relational model significantly slow queries. Joins are actually very fast, compared with (for example) a sort. Returning a record for a company with all of its associated contact types, or indeed a specific contact type would be very fast. I am working on a financial MS SQL database with millions of rows and similar data models and have no problem returning significant amounts of data in sub-second timings.
In terms of complexity, this isn't the most technical design in terms of database modelling and the concept of joining tables is most definitely below what I would consider to be "intermediate" level database development.
I would consider if you need one or two tables based onthe following criteria:
First are you close the the record storage limit, then two tables definitely.
Second will you usually be querying the information you plan to put inthe second table most of the time you query the first table? Then one table might make more sense. If you usually do not need the extended information, a separate ( and less wide) table should improve performance on the main data queries.
Third, how strong a possibility is it that you will ever need multiple values? If it is one to one nopw, but something like email address or phone number that has a strong possibility of morphing into multiple rows, go ahead and make it a related table. If you know there is no chance or only a small chance, then it is OK to keep it one assuming the table isn't too wide.
EAV tables look like they are nice and will save futue work, but in reality they don't. Genreally if you need to add another type, you need to do future work to adjust quesries etc. Writing a script to add a column takes all of five minutes, the other work will need to be there regarless of the structure. EAV tables are also very hard to query when you don;t know how many records you wil need to pull becasue normally you want them on one line and will get the information by joining to the same table multiple times. This causes performance problmes and locking especially if this table is central to your design. Don't use this method.
It depends if you will ever need more information about a company. If you notice yourself adding fields like companyphonenumber1 companyphonenumber2, etc etc. Then method 2 is better as you would seperate your entities and just reference a company id. If you do not plan to make these changes and you feel that this table will never change then method 1 is fine.
Usually, if you don't have data duplication then a single table is fine.
In your case you don't so the first method is OK.
I use one table if I estimate the data from the "second" table will be used in more than 50% of my queries. Use two tables if I need multiple copies of the data (i.e. multiple phone numbers, email addresses, etc)
This is a scenario I've seen in multiple places over the years; I'm wondering if anyone else has run across a better solution than I have...
My company sells a relatively small number of products, however the products we sell are highly specialized (i.e. in order to select a given product, a significant number of details must be provided about it). The problem is that while the amount of detail required to choose a given product is relatively constant, the kinds of details required vary greatly between products. For instance:
Product X might have identifying characteristics like (hypothetically)
'Color',
'Material'
'Mean Time to Failure'
but Product Y might have characteristics
'Thickness',
'Diameter'
'Power Source'
The problem (one of them, anyway) in creating an order system that utilizes both Product X and Product Y is that an Order Line has to refer, at some point, to what it is "selling". Since Product X and Product Y are defined in two different tables - and denormalization of products using a wide table scheme is not an option (the product definitions are quite deep) - it's difficult to see a clear way to define the Order Line in such a way that order entry, editing and reporting are practical.
Things I've Tried In the Past
Create a parent table called 'Product' with columns common to Product X and Product Y, then using 'Product' as the reference for the OrderLine table, and creating a FK relationship with 'Product' as the primary side between the tables for Product X and Product Y. This basically places the 'Product' table as the parent of both OrderLine and all the disparate product tables (e.g. Products X and Y). It works fine for order entry, but causes problems with order reporting or editing since the 'Product' record has to track what kind of product it is in order to determine how to join 'Product' to its more detailed child, Product X or Product Y. Advantages: key relationships are preserved. Disadvantages: reporting, editing at the order line/product level.
Create 'Product Type' and 'Product Key' columns at the Order Line level, then use some CASE logic or views to determine the customized product to which the line refers. This is similar to item (1), without the common 'Product' table. I consider it a more "quick and dirty" solution, since it completely does away with foreign keys between order lines and their product definitions. Advantages: quick solution. Disadvantages: same as item (1), plus lost RI.
Homogenize the product definitions by creating a common header table and using key/value pairs for the customized attributes (OrderLine [n] <- [1] Product [1] <- [n] ProductAttribute). Advantages: key relationships are preserved; no ambiguity about product definition. Disadvantages: reporting (retrieving a list of products with their attributes, for instance), data typing of attribute values, performance (fetching product attributes, inserting or updating product attributes etc.)
If anyone else has tried a different strategy with more success, I'd sure like to hear about it.
Thank you.
The first solution you describe is the best if you want to maintain data integrity, and if you have relatively few product types and seldom add new product types. This is the design I'd choose in your situation. Reporting is complex only if your reports need the product-specific attributes. If your reports need only the attributes in the common Products table, it's fine.
The second solution you describe is called "Polymorphic Associations" and it's no good. Your "foreign key" isn't a real foreign key, so you can't use a DRI constraint to ensure data integrity. OO polymorphism doesn't have an analog in the relational model.
The third solution you describe, involving storing an attribute name as a string, is a design called "Entity-Attribute-Value" and you can tell this is a painful and expensive solution. There's no way to ensure data integrity, no way to make one attribute NOT NULL, no way to make sure a given product has a certain set of attributes. No way to restrict one attribute against a lookup table. Many types of aggregate queries become impossible to do in SQL, so you have to write lots of application code to do reports. Use the EAV design only if you must, for instance if you have an unlimited number of product types, the list of attributes may be different on every row, and your schema must accommodate new product types frequently, without code or schema changes.
Another solution is "Single-Table Inheritance." This uses an extremely wide table with a column for every attribute of every product. Leave NULLs in columns that are irrelevant to the product on a given row. This effectively means you can't declare an attribute as NOT NULL (unless it's in the group common to all products). Also, most RDBMS products have a limit on the number of columns in a single table, or the overall width in bytes of a row. So you're limited in the number of product types you can represent this way.
Hybrid solutions exist, for instance you can store common attributes normally, in columns, but product-specific attributes in an Entity-Attribute-Value table. Or you could store product-specific attributes in some other structured way, like XML or YAML, in a BLOB column of the Products table. But these hybrid solutions suffer because now some attributes must be fetched in a different way
The ultimate solution for situations like this is to use a semantic data model, using RDF instead of a relational database. This shares some characteristics with EAV but it's much more ambitious. All metadata is stored in the same way as data, so every object is self-describing and you can query the list of attributes for a given product just as you would query data. Special products exist, such as Jena or Sesame, implementing this data model and a special query language that is different than SQL.
There's no magic bullet that you've overlooked.
You have what are sometimes called "disjoint subclasses". There's the superclass (Product) with two subclasses (ProductX) and (ProductY). This is a problem that -- for relational databases -- is Really Hard. [Another hard problem is Bill of Materials. Another hard problem is Graphs of Nodes and Arcs.]
You really want polymorphism, where OrderLine is linked to a subclass of Product, but doesn't know (or care) which specific subclass.
You don't have too many choices for modeling. You've pretty much identified the bad features of each. This is pretty much the whole universe of choices.
Push everything up to the superclass. That's the uni-table approach where you have Product with a discriminator (type="X" and type="Y") and a million columns. The columns of Product are the union of columns in ProductX and ProductY. There will be nulls all over the place because of unused columns.
Push everything down into the subclasses. In this case, you'll need a view which is the union of ProductX and ProductY. That view is what's joined to create a complete order. This is like the first solution, except it's built dynamically and doesn't optimize well.
Join Superclass instance to subclass instance. In this case, the Product table is the intersection of ProductX and ProductY columns. Each Product has a reference to a key either in ProductX or ProductY.
There isn't really a bold new direction. In the relational database world-view, those are the choices.
If, however, you elect to change the way you build application software, you can get out of this trap. If the application is object-oriented, you can do everything with first-class, polymorphic objects. You have to map from the kind-of-clunky relational processing; this happens twice: once when you fetch stuff from the database to create objects and once when you persist objects back to the database.
The advantage is that you can describe your processing succinctly and correctly. As objects, with subclass relationships.
The disadvantage is that your SQL devolves to simplistic bulk fetches, updates and inserts.
This becomes an advantage when the SQL is isolated into an ORM layer and managed as a kind of trivial implementation detail. Java programmers use iBatis (or Hibernate or TopLink or Cocoon), Python programmers use SQLAlchemy or SQLObject. The ORM does the database fetches and saves; your application directly manipulate Orders, Lines and Products.
This might get you started. It will need some refinement
Table Product ( id PK, name, price, units_per_package)
Table Product_Attribs (id FK ref Product, AttribName, AttribValue)
Which would allow you to attach a list of attributes to the products. -- This is essentially your option 3
If you know a max number of attributes, You could go
Table Product (id PK, name, price, units_per_package, attrName_1, attrValue_1 ...)
Which would of course de-normalize the database, but make queries easier.
I prefer the first option because
It supports an arbitrary number of attributes.
Attribute names can be stored in another table, and referential integrity enforced so that those damn Canadians don't stick a "colour" in there and break reporting.
Does your product line ever change?
If it does, then creating a table per product will cost you dearly, and the key/value pairs idea will serve you well. That's the kind of direction down which I am naturally drawn.
I would create tables like this:
Attribute(attribute_id, description, is_listed)
-- contains values like "colour", "width", "power source", etc.
-- "is_listed" tells us if we can get a list of valid values:
AttributeValue(attribute_id, value)
-- lists of valid values for different attributes.
Product (product_id, description)
ProductAttribute (product_id, attribute_id)
-- tells us which attributes apply to which products
Order (order_id, etc)
OrderLine (order_id, order_line_id, product_id)
OrderLineProductAttributeValue (order_line_id, attribute_id, value)
-- tells us things like: order line 999 has "colour" of "blue"
The SQL to pull this together is not trivial, but it's not too complex either... and most of it will be write once and keep (either in stored procedures or your data access layer).
We do similar things with a number of types of entity.
Chris and AJ: Thanks for your responses. The product line may change, but I would not term it "volatile".
The reason I dislike the third option is that it comes at the cost of metadata for the product attribute values. It essentially turns columns into rows, losing most of the advantages of the database column in the process (data type, default value, constraints, foreign key relationships etc.)
I've actually been involved in a past project where the product definition was done in this way. We essentially created a full product/product attribute definition system (data types, min/max occurrences, default values, 'required' flags, usage scenarios etc.) The system worked, ultimately, but came with a significant cost in overhead and performance (e.g. materialized views to visualize products, custom "smart" components to represent and validate data entry UI for product definition, another "smart" component to represent the product instance's customizable attributes on the order line, blahblahblah).
Again, thanks for your replies!