I know having DEFAULT NULLS is not a good practice but I have many optional lookup values which are FK in the system so to solve this issue here is what i am doing: I use NOT NULL for every FK / lookup colunms. I have the first row in every lookup table which is PK id = 1 as a dummy row with just "none" in all the columns. This way I can use NOT NULL in my schema and if needed reference to the none row values PK =1 for FKs which do not have any lookup value.
Is this a good design or any other work arounds?
EDIT:
I have:
Neighborhood table
Postal table.
Every neighborhood has a city, so the FK can be NOT NULL.
But not every postal code belongs to a neighborhood. Some do, some don't depending on the country. So if i use NOT NULL for the FK between postal and neighborhood then I will be screwed as there has to be some value entered. So what i am doing in essence is: have a row in every table to be a dummy row just to link the FKs.
This way row one in neighborhood table will be:
n_id = 1
name =none
etc...
In postal table I can have:
postal_code = 3456A3
FK (city) = Moscow
FK (neighborhood_id)=1 as a NOT NULL.
If I don't have a dummy row in the neighborhood lookup table then I have to declare FK (neighborhood_id) as a Default null column and store blanks in the table. This is an example but there is a huge number of values which will have blanks then in many tables.
Is this a good design or any other work arounds?
ISNULL or COALESCE and LEFT JOIN
Often "None" is an option like any other in a list of options. It may be totally reasonable to have a special row for it; it simplifies things. It may be especially practical if you link other information to options, e.g. a human-readable name.
You can always use left joins to join postal codes that may not exists.
select * from from table_a
left join table_b
on table_a.postalcode_id = table_b.postalcode_id
will select rows whether or not the postalcode_id is null or not. When you use magic numbers to designate nulls then queries become less readable.
clear:
select count(*) from table_a where postalcode_id is null;
Not so clear:
select count(*) from table_a where postalcode_id = 1;
Using nulls makes your queries explicitly handle null cases, but it also self-documents your intentions that nulls are being handled.
This seems like a simple case of premature optimization in a database:
If your schema is something like this, then I don't see a problem. Some postal codes are in a neighborhood, some aren't. That's a good case for a nullable column.
The advice about avoiding nulls is about avoiding information that does not belong in the table. For instance, if you had another five columns which only pertained to postalcodes which were in a neighborhood, then those columns would be null for postal codes which were not in a neighborhood. This would be a good reason to have a second, parallel table for postalcodes which were in a neighborhood, which could contain these other five columns.
More importantly, if performance is a concern, then the solution is to try it both ways, test the performance, and see which performs best. This performance consideration would then compete with the simplicity and readability of the design, and performance might win.
An example to illustrate the issue. I started with an Object-Role Modeling model, the same that I used to produce the earlier ER diagram. However, I created a subtype of PostalCode and added two more mandatory roles to the subtype:
This can produce an ER model very similar to the first:
But this model fails to show that there are columns which are mandatory whenever the PostalCode is a NeighborhoodPostalCode. The following model does show that:
I would say that if you have a set of optional columns which are mandatory under certain circumstances, then you should create a "subtype" which always has those columns NOT NULL. However, if you simply have random columns which may randomly be not null, then keep them as NULL columns in the main table.
Related
I have a scenario that, there are three types of functionalities has same set of fields (except their primary key).
The below is the sample. I would like know, whether it is a better idea to group the common fields in a single table? If we create a common table, how can we give the FK reference to the corresponding primary key table? What would be the better approach?
tblCountry
tblState
tblCity
countryid
StateId
CityId
Name
CountryId
StateId
officiallanguage
officiallanguage
officiallanguage
officialFlag
officialFlag
officialFlag
officialFlower
officialFlower
officialFlower
officialAnimal
officialAnimal
officialAnimal
officialBird
officialBird
officialBird
...
...
...
...
...
...
...
...
...
etc
etc
etc
Your intended third normal form (3NF) is good as it is.
From a simplicity point of view (affecting joins) it is as good as it can be. And foreign keys between country, state and city are obivous and trivial.
Now to save you from copying some column names you could put all three elements country, state and city into a single table - effectively making it second normal form. With this the meaning of your column names start to roleplay. With this I mean the officialLanguage can either be country, state or city related. From the stored table design this is no longer obvious. Only by interpretation of the multi column key.
So in short by saving on some typing / copying you will complicate any further work using a single table with convoluted meaning instead of using three tables with clear meanings.
Now towards data selection this is an issue only if there are no aliases.
Consider selecting officialLanguage of a city in a country.
SELECT
name,
officialLanguage,
name,
officialLanguage
FROM city
INNER JOIN state
ON state.stateid = city.stateid
INNER JOIN country
ON country.countryid = state.countryid
;
This will fail as the columns chosen are ambigiuous.
Now consider this query (where the aliases are shorted just to demonstrate the aliases - personally I try to use up-to-10-letters aliases):
SELECT
cit.name AS city_name,
cit.officialLanguage AS city_language,
cou.name AS country_name,
cou.officialLanguage aS country_language
FROM city AS cit
INNER JOIN state AS sta
ON sta.stateid = cit.stateid
INNER JOIN country AS cou
ON cou.countryid = sta.countryid
;
It is very clear and concise. I can use country table in different queries without having to pre-select those countries from a table with intermingled objects like country, state and city.
The only downside to this approach is the multi join of properly indexed tables.
Also as there are quite a few countries, states and cities across the world this single table approach can be a performance issue down the line.
4NF (or at least BCNF or 3.5NF as it is otherwise known) is best for fast performance in joins with the trade off that joins can become complex (to write) when properly indexed. However for database engines these are easiest to read.
2NF (or Excel tables as I call those) are easiest to read for humans. Which require complicated join and/or conditions (WHERE clause) to properly identify just a subset.
For the database design best use at minimum 3NF or better, then prepare views to turn the data back to 2NF to make your data human-readable.
You're asking for advice around when you should consider further normalisation on a schema.
This answer from #KnutBoehnert is quite detailed, my answer is really a supplement to that.
The goal of normalizing the database structure is NOT to reduce the duplication names of fields to a single reference, but to reduce the duplication of the data that is stored within those fields. To provide a definitive answer for a given schema would normally require you to provide a set of records, however your dataset is simple enough to correlate to real world norms that we can talk about some hypotheticals.
This is different to inheritance or composition in OO programming, where I would strongly encourage you to encapsulate these fields either as properties on a common base class between the objects that represent records from these tables, or as an interface that they all compose into their structure.
Only when Country/State/City commonly have the same values for the duplicated field names is there is a strong argument to refactor this structure to introduce a separate table to hold the values of Language,Flag,Flower,Animal,Bird.
If you did create a separate table for this information, how often would the records in it be re-used? For instance, how many different countries are going to reference the same record? most likely none as the flag is usually unique for each country, but certainly the combination of all those fields will be uinique per country. For state the same statement is usually true. If there is no re-use of the records in this new table, if the relationship is always 1:1 then the database and the queries are optimised by leaving them within the table the way you have it now.
1:1 relationships do have a place, especially if for a given cenceptual record there are distinct use cases where one set of fields is updated or queried in isolation from another set and that these two sets have very different query rates, however on its own without any further supportive reasons, 1:1 relationships in a schema can be simplified by merging the two tables in the one record.
If for instance all states within a country, and all cities within a state are expeceted to have the same Language, then you would not need the field at the lower levels at all and you could use joins to access the Language field from the Country record. But in the physical world there are many countries that have states that have different local languages or dialects to that of the country as a whole, so I don't think this applies to this particular schema.
In fact I see no reason for your current schema to change, the structure as it is now even allows you use null values to indicate that the value should be coalesced from the parent level.
In Australia for instance, all of the records for State and City are more than likely to have a Language value of English, we can use null coalescing statements to prevent the need to enter and maintain this value in each State and City record, so in Australia, of all states and cities left a value of null in the officialLanguage field, we could still coalesce it from the parent related country:
SELECT
country.Name AS country_name,
country.officialLanguage AS country_language
state.name AS state_name,
IsNull(state.officialLanguage,country.officialLanguage) AS state_language
city.name AS city_name,
COALESCE(city.officialLanguage, state.officialLanguage, country.officialLanguage) AS city_language,
FROM tblCity AS city
INNER JOIN tblState AS state
ON state.Stateid = city.Stateid
INNER JOIN tblCountry AS country
ON country.Countryid = state.Countryid
To query the records for just the city table in this way could look like this:
I'm not sure it makes sense to coalesce the other fields, but it could be done
SELECT
CityId,
city.StateId,
state.CountryId,
...
COALESCE(city.officialLanguage, state.officialLanguage, country.officialLanguage) AS officialLanguage,
COALESCE(city.officialFlag, state.officialFlag, country.officialFlag) AS officialFlag,
COALESCE(city.officialFlower, state.officialFlower, country.officialFlower) AS officialFlower,
COALESCE(city.officialAnimal, state.officialAnimal, country.officialAnimal) AS officialAnimal,
COALESCE(city.officialBird, state.officialBird, country.officialBird) AS officialBird,
FROM tblCity AS city
INNER JOIN tblState AS state
ON state.Stateid = city.Stateid
INNER JOIN tblCountry AS country
ON country.Countryid = state.Countryid
In many real world applications of this null coalescing concept the application layer would handle the display of these fields differently, but the storage is what we are most concerned about for this question today.
The transactional fact table of one ofthe star schemas need to anser questions like Is the first application is final application.This is associated with one of the business process.
Is it a good idea to keep this as a part of the fact table with a column name,
IsFirstAppLastFlag.
There are not much flags to create a seperate dimension.Also this flag(calculated flag) is essential in the report writing.In this context do we need to keep it in Dimension or in Fact!
I assume the creation of junk dimension is for those flags /low cardinality columns which are not so useful can kept it inside a dimension?!
This will depend on your own needs but if you like the purest view of the fact table then the answer is no, these fields should not be included in your fact table.
The fact table should include dimension keys, degenerate dimension keys, and facts.
IsStatusOne, IsStatusTwo, etc are attributes and as you rightly suggest would be well suited to a junk dimension in the absence of them belonging to a more suitable dimension, e.g., IsWeekDay would be suited to dimension "Date" table.
You may start off with only a few "Is" attributes in your fact table but over time you may need more and more of these attributes, you will look back and possibly wish you created a junk dimension.
Performance:
Interestingly if you are using bit columns for your flags then then there is little storage difference in using 8 bit flags in your fact table then having one tinyint dimension key, however when your flags are more verbose or have multiple status values then you should use the junk dimension to improve performance on the fact table, less storage, memory, more rows in a page, etc..
Personally, I would junk them
That seems fine, as long as it it an attribute of the fact, not of one of the dimensions. In some cases I think you might have a slowly changing dimension in which it would be more appropriately placed.
I would be concerned that this plan might require updates on the fact table, for example if you were intending to flag that a particular fact was the most recent for a customer. If that was the case it might be better to keep a transaction number in the fact table, and a "most recent transaction number" in the dimension table, and provide an indexing method to effectively retrieve the most recent per-customer.
You can use Junk Dimension.
Instead of creating several dimension with few rows you can create on dimnsion with all possible combination of value then you add just one foregion key in your fact table.
you can populate your junk dimension with a query like below.
WITH cteFlags AS
(
SELECT 'N' AS Value
UNION ALL
SELECT 'Y'
)
SELECT
Flag1.Value,
Flag2.Value,
Flag3.Value
FROM
cteFlags Flag1
CROSS JOIN cteFlags Flag2
CROSS JOIN cteFlags Flag3
I have a lot trouble finding the best design solution for this situation. I have two tables with a common base. Currently I have designed it like this: I have an order table (the common base):
[order_table]
order_id
order_type
company
created
I have another table with reference to the order table:
[product_order]
order_id fk
product_id
quantity
price
I have second table with reference to the order table:
[special_order]
order_id fk
description
price_estimate
color
size
Both tables share the same order_id which i like. I often have to do large queries on order_table using the information available in that table lets say 'company = 200'. But for each result I also need its data from product_order or special_order depending on which type it is. So the only optimal solution I see is to left joining the query with both tables on order_id and filter the information afterwards. The only other option I see is to add the common columns to each table, but then I would have a lot of reorganizing afterwards to get them in correct order.
Is there a better way to organize the data?
So those extra tables are extra attributes to a specific order-id (1:1)?
I'd consider adding all the fields to the common tables, or at least the fields from the most used sub-table.
If not appropriate, you may want to add "Type" to the common table and let a trigger manage insert/delete of related records to avoid the fuzz with orphans etc.
Use views with your left joins (wouldn't inner be better?) to fetch the different types.
Are tables with lots of columns indicative of bad design? For example say I have the following table that stores user information and user settings:
[Users table]
userId
name
address
somesetting1
...
somesetting50
As the site requires more settings the table gets larger. In my mind this table is normalized, all the settings are dependent on the userId.
I have a thing against tables with lots of columns it just seems wrong to me, but then I remembered that you can select what data to return from the table, so If the table is large I could still break it into several different objects in code. For example
[User object]
[UserSetting object]
and return only the data to fill those objects.
Is the above common practice, or are their other techniques that deal with tables with lots of columns that are more suitable to use?
I think you should use multiple tables like this:
[Users table]
userId
name
address
[Settings table]
settingId
userId
settingKey
settingValue
The tables are related by the userId column which you can use to retrieve the settings for the user you need to.
I would say that it is bad table design. If a user doesn't have an entry for 47 of those 50 settings then you will have a large number of NULL's in the table which isn't good practice and will also slow down performance (NULL's have to be handled in a special way).
Instead, have the following:
USER TABLE
Id,
FirstName
LastName
etc
SETTINGS
Id,
SettingName
USER SETTINGS
Id,
SettingId,
UserId,
SettingValue
You then have a many to many join, and eliminate NULL's
first, don't put spaces in table names! all the [braces] will be a real pain!
if you have 50 columns how meaningful will all that data be for each user? will there be lots of nulls? Most data may not even apply to any given user. Think 1 to 1 tables, where you break down the "settings" into logical groups:
Users: --main table where most values will be stored
userId
name
address
somesetting1 ---please note that I'm using "somesetting1", don't
... --- name the columns like this, use meaningful names!!
somesetting5
UserWidgets --all widget settings for the user
userId
somesetting6
....
somesetting12
UserAccounting --all accounting settings for the user
userId
somesetting13
....
somesetting23
--etc..
you only need to have a Users row for each user, and then a row in each table where that data applies to the given user. I f a user doesn't have any widget settings then no row for that user. You can LEFT join each table as necessary to get all the settings as needed. Usually you only need to work on a sub set of settings based on which part of the application that is running, which means you won't need to join in all of the tables, just the one or tow that you need at that time.
You could consider an attributes table. As long as your indexes are good, then you wouldn't have too much of a performance issue:
[AttributeDef]
AttributeDefId int (primary key)
GroupKey varchar(50)
ItemKey varchar(50)
...
[AttributeVal]
AttributeValId int (primary key)
AttributeDefId int (FK -> AttributeDef.AttributeDefId)
UserId int (probably FK to users table?)
Val varchar(255)
...
basically you're "pivoting" your table with many columns into 2 tables with less columns. You can write views and table functions around this structure to give you data for a group of related items or just a specific item, etc. You could also add other things to the attribute definition table to indicate required data elements, restrictions on the data elements, etc.
What's your thought on this type of design?
Use several tables with matching indexes to get the best SELECT speed. Use the indexes as a way to relate the information between tables using a JOIN.
i have an event calendar application with a sql database behind it and right now i have 3 tables to represent the events:
Table 1: Holiday
Columns: ID, Date, Name, Location, CalendarID
Table 2: Vacation
Columns: Id, Date, Name, PersonId, WorkflowStatus
Table 3: Event
Columns: Id, Date, Name, CalendarID
So i have "generic events" which go into the event tableand special events like holidays and vacation that go into these separate tables. I am debating consolidating these into a single table and just having columns like location and personid blank for the generic events.
Table 1: Event:
Columns : Id, Date, Name, Location, PersonId, WorkflowStatus
does anyone see any strong positives or negative to each option. Obviously there will be records that have columns that dont necessarily apply but it there is overlap with these three tables.
Either way you construct it, the application will have to cope with variant types. In such a situation I recommend that you use a single representation in the DBM because the alternative is to require a multiplicity of queries.
So it becomes a question of where you stick the complexity and even in a huge organization, it's really hard to generate enough events to worry about DBMS optimization. Application code is more flexible than hardwired schemata. This is a matter of preference.
If it were my decision, i'd condense them into one table. I'd add a column called "EventType" and update that as you import the data into the new table to specify the type of event.
That way, you only need to index one table instead of three (if you feel indexes are required), the data is all in one table, and the queries to get the data out would be a little more concise because you wouldn't need to union all three tables together to see what one person has done. I don't see any downside to having it all in one table (although there will probably be one that someone will bring up that i haven't thought of).
How about sub-typing special events to an Event supertype? This way it is easy to later add any new special events.
Data integrity is the biggest downside of putting them in one table. Since these all appear to be fields that would be required, you lose the ability to require them all by default and would have to write a trigger to make sure that data integrity was maintained properly (Yes, this must be maintained in the database and not, as some people believe, by the application. Unless of course you want to have data integrity problems.)
Another issue is that these are the events you need now and there may be more and more specialized events in the future and possibly breaking code for one type of event because you added another specialized field that only applies to something else is a big risk. When you make a change to add some required vacation information, will you be sure to check that it doesn't break the application concerning holidays? Or worse not error out but show information you didn't want? Are you going to look at the actual screen everytime? Unit testing just of code may not pick up this type of thing especially if someone was foolish enough to use select * or fail to specify columns in an insert. And frankly not every organization actually has a really thorough automated test process in place (it could be less risk if you do).
I personally would tend to go with Damir Sudarevic's solution. An event table for all the common fields (making it easy to at least get a list of all events) and specialized tables for the fields not held in common, making is simpler to write code that affects only one event and allowing the database to maintain its integrity.
Keep them in 3 separate tables and do a UNION ALL in a view if you need to merge the data into one resultset for consumption. How you store the data on disk need not be identical to how you need to consume the data so long as the performance is adequate.
As you have it now there are no columns that do not apply for any of the presented entities. If you were to merge the 3 tables into one you'd have to add a field at the very least to know which columns to expect to be populated and reduce your performance. Now when you query for a holiday alone you go to a subset of the data that you would have to sift through / index to get at the same data in a merged storage table.
If you did not already have these tables defined you could consider creating one table with the following signature...
create table EventBase (
Id int PRIMARY KEY,
Date date,
Name varchar(50)
)
...and, say, the holiday table with the following signature.
create table holiday (
Id int PRIMARY KEY,
EventId int,
Location varchar(50),
CalendarId int
)
...and join the two when you needed to do so. Choosing between this and the 3 separate tables you already have depends on how you plan on using the tables and volume but I would definitely not throw all into a single table as is and make things less clear to someone looking at the table definition with no other initiation.
Or combine the common fields and separate out the unique ones:
Table 1: EventCommon
Columns: EventCommonID, Date, Name
Table 2: EventOrHoliday
Columns: EventCommonID, CalendarID, isHoliday
Table3: Vacation
Columns: EventCommonID, PersonId, WorkflowStatus
with 1->many relationships between EventCommon and the other 2.