Normalization: What does "repeating groups" mean? - database

I have read different tutorials and seen different examples of normalization, specially the notion of "repeating groups" in the first normal form. From them I have have gathered that repeating groups are "kind-of" multi-valued attributes (e.g. here and here).
But we already make separate tables for each multi-valued attribute by including foreign keys from the parent table during the process of mapping an ERM (Entity relationship Model) to a RDM (Relational Data Model)? Reference: this
Secondly, are those "repeating groups" essentially laid out horizontally in the same row, or can the same value occurring in the same column again and again, i.e. the same value of an attribute again and again, also a repeating group and should be eliminated?
In this example, the value English is repeating again and again. Is this a repeating group?
If I eliminate it to make another table SUBJECT with Subject Name and Module_ID(Foreign key), this is what I get. Sure it gets rid of the repeating value, but I am not sure if this is the right thing. Is it right?

The term "repeating group" originally meant the concept in CODASYL and COBOL based languages where a single field could contain an array of repeating values. When E.F.Codd described his First Normal Form that was what he meant by a repeating group. The concept does not exist in any modern relational or SQL-based DBMS.
The term "repeating group" has also come to be used informally and imprecisely by database designers to mean a repeating set of columns, meaning a collection of columns containing similar kinds of values in a table. This is different to its original meaning in relation to 1NF. For instance in the case of a table called Families with columns named Parent1, Parent2, Child1, Child2, Child3, ... etc the collection of Child N columns is sometimes referred to as a repeating group and assumed to be in violation of 1NF even though it is not a repeating group in the sense that Codd intended.
This latter sense of a so-called repeating group is not technically a violation of 1NF if each attribute is only single-valued. The attributes themselves do not contain repeating values and therefore there is no violation of 1NF for that reason. Such a design is often considered an anti-pattern however because it constrains the table to a predetermined fixed number of values (maximum N children in a family) and because it forces queries and other business logic to be repeated for each of the columns. In other words it violates the "DRY" principle of design. Because it is generally considered poor design it suits database designers and sometimes even teachers to refer to repeated columns of this kind as a "repeating group" and a violation of the spirit of the First Normal Form.
This informal usage of terminology is slightly unfortunate because it can be a little arbitrary and confusing (when does a set of columns actually constitute a repetition?) and also because it is a distraction from a more fundamental issue, namely the Null problem. All of the Normal Forms are concerned with relations that don't permit the possibility of nulls. If a table permits a null in any column then it doesn't meet the requirements of a relation schema satisfying 1NF. In the case of our Families table, if the Child columns permit nulls (to represent families who have fewer than N children) then the Families table doesn't satisfy 1NF. The possibility of nulls is often forgotten or ignored in normalization exercises but the avoidance of unnecessary nullable columns is one very good reason for avoiding repeating sets of columns, whether or not you call them "repeating groups".
See also this article.

the value English is repeating again and again. Is this a repeating
group?
No. The multiple appearances of English in SUBJECT_MODULE are not a repeating group or even either of the two things that people mistakenly mean by a repeating group. They are also not evidence of redundancy or lack of normalization. Such multiple appearances might be connected to redundancy or normalization, but they appear all the time when there is no redundancy and various levels of normalization.
If SUBJECT_MODULE is rows where "[SUBJECT_NAME] has [MODULE_NAME] identified by [MODULE_ID]" and a subject might have more than one module then somewhere you must have multiple mentions of that subject (perhaps via its name) with mentions of different modules (perhaps by name or id). That would not involve redundancy.
Student Age Subject
Adam 15 Biology
Adam 15 Maths
Alex 14 Maths
Stuart 17 Maths
The redundancy in this example from your question's second "this" link is not that Adam appears in two rows or that Adam appears with 15 in two rows. It is that if the table is rows where "[Student] is [Age] years old and takes [Subject]" then Student (eg Adam) can appear in multiple rows but always appears with the same Age (eg 15). But if the table were rows where "[Student] has a friend [Age] years old in [Subject]" then the table could be fully normalized already.
Sure it gets rid of the repeating value, but I am not sure if this is
the right thing.
It does for your example data, but it might not for other example data. You haven't told us enough. (Anyway as I said above the multiple appearances might not even need normalizing.)
Whether there are any normalization-relevant redundancies in SUBJECT_MODULE or even whether there are any valid decompositions including the one you gave depends on the usual information necessary to normalize to above 1NF. Namely whether some of its columns are functions of others (functional dependencies) and whether its rows are also those where "..." AND "..." (join dependencies).
By giving a possible decomposition you have said that it is also rows where "...[Subject_Name]...[Module_ID]..." AND "...[Module_Name]...[Module_ID]..." And you have given some example decomposition data. But we only know that it could be so decomposed because you added the decomposition. And the decomposition plus data still isn't enough for us to know whether it should be so decomposed.
I have read different tutorials and seen different examples of
normalization, specially the notion of "repeating groups" in the first
normal form.
"Repeating groups" are something from pre-relational databases and cannot possibly appear in a relational table (relation). They are like a named set of values that is like a field of a record but is not quite. A relational table is always in 1NF. Each column of a row has a single value of the column's type. A non-relational database is "normalized" to tables ie 1NF (first sense of "normalized") which gets rid of repeating groups. Then those tables/relations are "normalized" to higher normal forms (second sense of "normalized").
A relational table having multiple similar columns or having a column type with multiple similar parts are each just reminiscent of having a repeating group in a non-relational database. And the multiple columns and parts should become multiple rows in a separate table, just like the multiple members of a repeating group. But these problems have to do with relational quality of design, not repeating groups or normalization (in either sense) or being relational (ie being in 1NF).
Note that a non-relational database might itself have similar problems with multiple similar fields and/or named sets or with multiple similar parts of values of fields. Normalization to tables does not get rid of these when it gets rid of repeating groups.
Regardless of how they got into a relational design, removing them gives a "better" design. It is just because these design problems are reminiscent of repeating groups that people get confused and imagine that somehow a table could contain a repeating group. So the multiple similar columns and values with multiple similar parts (or the parts) get incorrectly called "repeating groups".
See this answer re "atomicity".

Related

Canonical cover is in which normal form?

I want to know which normal form a canonical cover is in. I know before doing normalization we find the canonical cover so I think it is in first normal form. But it may be in no normal form as the definition of 1NF by Wikipedia is no row should have a duplicate:
First normal form enforces below criteria:
Eliminate repeating groups in individual tables.
Create a separate table for each set of related data.
Identify each set of related data with a primary key
Normal forms apply to relations (values and variables), not canonical covers. A cover is a set of FDs (functional dependencies) from which all the FDs in a relation follow. A canonical cover is a cover in a certain form. If you have a canonical cover and a relation's attributes then you can find what normal form the relation is in.
No row can have a duplicate by definition of "relation". A relation has a set of rows. (SQL tables, though, can have duplicate rows.) No column/attribute can be "multivalued" or a "repeating group". An attribute has a value.
Wikipedia has a lot of nonsense in its relational articles including the one you quote. But from this answer by me:
By definition, a relation's tuple's attribute has a value from a domain. Re: "repeating groups": It can't have any, that's something from pre-relational databases. Re "non-atomic": Codd defined relations as able to have relation-valued domains. He pointed out that the only way that a value could be considered (in the everyday sense) non-atomic in a relational context was to be relation-valued. Ie he defined "atomic" in a relational context to mean not a relation. He defined "normalized" to mean having no relation-valued (ie non-atomic) attributes. (All this in 1970.) Later he defined "1NF" as normalized. And developed "2NF" & "3NF". Then (after Kent & with Boyce) "BCNF". So his use of these terms assumed no relation-valued domains.
But normalization theory is presented independent of domains. Ie it's considered to just be decomposition per problematic JDs [join dependencies]. So "1NF" also gets used for just being a relation. And the other "NFs" get used without regard to domains. (Although if there are relation-valued domains then there can be constraints different from but similar to FDs & JDs that cause different but similar anomalies, and that cause constraints and anomalies in components even after decomposition per all problematic JDs.) Regardless of whether a relation has relation-valued domains and regardless of what one means by "1NF" or "normalized" or "normalization", the decomposition procedure you're following to remove problematic JDs from problematic FDs to what you're calling [a ]NF is independent of domains.
From another:
There is no such thing as a "multivalued attribute" in a relation. A tuple has an attribute value for each attribute name. [...] If you have an attribute that you consider to contain multiple parts, ie you want to generically query about the parts without using operators with parameters of their types, then it is usually good design to have a separate table with attributes for those parts. But that's not addressed by normalization. Any value can be considered to have multiple parts in multiple ways and it is your application/queries that determine when you stop making tables whose attributes are the values of parts of other values and just have an attribute for a value. Similarly, if you have a bunch of attributes that play a similar role (often with similar names) then it is usually good design to have a separate table with just one attribute for the role. But that's not addressed by normalization.
Candidate keys matter to FDs, MVDs, JDs and normalization. PKs [primary keys] don't. You can pick one CK as "the PK" but its primariness is irrelevant to the relational model. It might be relevant to some information modeling method or product.
If you have particular concerns about your relation(s), attributes, attribute types, canonical cover or normalization process then you should explain in your question.

Onetomany with parent database design

Below is a database design which represents my problem(it is not my actual database design). For each city I need to know which restaurants, bars and hotels are available. I think the two designs speak for itself, but:
First design: create one-to-many relations between city and restaurants, bars and hotels.
Second design: only create an one-to-many relation between city and place.
Which design would be best practice? The second design has less relations, but would I be able to get all the restaurants, bars and hotels for a city and there own data (property_x/y/z)?
Update: this question is going wrong, maybe my fault for not being clear.
the restaurant/bar/hotel classes are subclasses of "place" (in both
designs).
the restaurant/bar/hotel classes must have the parent "place"
the restaurant/bar/hotel classes have there own specific data (property_X/Y/X)
Good design first
Your data, and the readability/understandability of your SQL and ERD, are the most important factors to consider. For the purpose of readability:
Put city_id into place. Why: Places are in cities. A hotel is not a place that just happens to be in a city by virtue of being a hotel.
Other design points to consider are how this structure will be extended in the future. Let's compare adding a new subtype:
In design one, you need to add a new table, relationship to 'place' and a relationship to city
In design two, you simply add a new table and relationship to 'place'.
I'd again go with the second design.
Performance second
Now, I'm guessing, but the reason for putting city_id in the subtype is probably that you anticipate that it's more efficient or faster in some specific use cases and this may be a very good reason to ignore readability/understandability. However, until you measure performance on the actual hardware you'll deploy on, you don't know:
Which design is faster
Whether the difference in performance would actually degrade the overall system
Whether other optimization approaches (tuning SQL or database parameters) is actually a better way to handle it.
I would argue that design one is an attempt to physically model the database on an ERD, which is a bad practice.
Premature optimization is the root of a lot of evil in SW Engineering.
Subtype approaches
There are two solutions to implementing subtypes on an ERD:
A common-properties table, and one table per subtype, (this is your second model)
A single table with additional columns for subtype properties.
In the single-table approach, you would have:
A subtype column, TYPE INT NOT NULL. This specifies whether the row is a restaurant, bar or hotel
Extra columns property_X, property_Y and property_Z on place.
Here is a quick table of pros and cons:
Disadvantages of a single-table approach:
The extension columns (X, Y, Z) cannot be NOT NULL on a single table approach. You can implement row-level constraints, but you lose the simplicity and visibility of a simple NOT NULL
The single table is very wide and sparse, especially as you add additional subtypes. You may hit the max. number of columns on some databases. This can make this design quite wasteful.
To query a list of a specific subtype, you have to filter using a WHERE TYPE = ? clause, whereas the table-per-subtype is a much more natural `FROM HOTEL INNER JOIN PLACE ON HOTEL.PLACE_ID = PLACE.ID"
IMHO, mapping into classes in an object-oriented languages is harder and less obvious. Consider avoiding if this DB is going to be mapped by Hibernate, Entity Beans or similar
Advantages of a single-table approach:
By consolidating into a single table, there are no joins, so queries and CRUD operations are more efficient (but is this small difference going to cause problems?)
Queries for different types are parameterized (WHERE TYPE = ?) and therefore more controllable in code rather than in the SQL itself (FROM PLACE INNER JOIN HOTEL ON PLACE.ID = HOTEL.PLACE_ID).
There is no best design, you have to pick based on the type of SQL and CRUD operations you are doing most frequently, and possibly on performance (but see above for a general warning).
Advice
All things being equal, I would advise the default option is your second design. But, if you have an overriding concern such as those I listed above, do choose another implementation. But don't optimize prematurely.
Both of them and none of them at all.
If I need to choose one, I would keep the second one, because of the number of foreign keys and indexes needed to be created after. But, a better approach would be: create a table with all kinds of places (bars, restaurants, and so on) and assign to each row a column with a value of the type of the place (apply a COMPRESS clause with the types expected at the column). It would improve both performance and readability of the structure, plus being more easier to maintain. Hope this helps. :-)
you do not show alternate columns in any of the sub-place tables. i think you should not split type data into table names like 'bar','restaurant', etc - these should be types inside the place table.
i think further you should have an address table - one column of which is city. then each place has an address and you can easily group by city when needed. (or state or zip code or country etc)
I think the best option is the second one. In the first design, there is a possibility of data errors as one place can be assigned to a particular restaurant (or any other type) in one city (e.g. A) and at the same time can be assigned to another restaurant in a different city (e.g. B). In the second design, a place is always bound to a particular city.
First:
Both designs can get you all the appropriate data.
Second:
If all extending classes are going to implement the location (which sound obvious for your implementation) then it would be a better practice to include it as part of the parent object. This would suggest option 2.
Thingy:
The thing is that even-tough you can find out the type of each particular PLACE, it is easier to just know that a type (CHILD) is always a place (PARENT). You can think of that while you visualize the result-set of option 2. With that in mind, I recommend the first approach.
NOTE:
First one doesn't have more relations, it just splits them.
If bar, restaurant and hotel have different sets of attributes, then they are different entities and should be represented by 3 different tables. But why do you need the place table? My advice it to ditch it and have 3 tables for your 3 entities and that's that.
In code, collecting common attributes into a parent class is more organised and efficient than repeating them in each child class - of course. But as spathirana comments above, database design is not like OOP. Sure, you'll save on the repetition of column names by sticking common attributes of places into a "place" table. But it will also add complication:
- you have to join on that table whenever you want to reference a bar, restaurant or hotel
- you have to insert into two tables whenever you want to add a new bar, restaurant or hotel
- you have to update two tables when ... etc.
Having 3 tables without the place table is ALSO, PROBABLY, the most performance-optimal design. But that's not where I'm coming from. I'm thinking of clean, simple database design where a single entity means a single row in a single table. There are no "is-a" relationships in a relational DB. Foreign key relationships are "has-a". OK, there are exceptions I'm sure, but your case is not exceptional.

Fact table with multiple facts

I have a dimension (SiteItem) has two important facts:
perUserClicks
perBrowserClicks
however, within this dimension, I have groups of values based on an attribute column (let's call the groups AboveFoldItems, LeftNavItems, OnTheFlyItems, etc.) each have more facts that are specific to that group:
AboveFoldItems: eyeTime, loadTime
LeftNavItems: mouseOverTime
OnTheFlyItems: doesn't have any extra, but may in the future
Is the following fact table schema ok?
DateKey
SessionKey
SiteItemKey
perUserClicks
perBrowserClicks
eyeTime
loadTime
mouseOverTime
It seems a little wasteful since only some columns pertain to some dimension keys (the irrelevant facts are left NULL). But... this seems like it would be a common problem, so there should be a common solution for this, right?
I'm generally in agreement with Damir's answer on this, but because the fact table is very narrow in your particular case, there is still merit to Aaron's advocation for keeping the NULLs.
We have several star schemas in particular subject areas with multiple fact tables that share most (if not all) of the dimensions (conformed and internal). The limited-scope dimensions are not considered "conformed" across the enterprise, but they are what we would call "shared internal" dimensions.
Now typically, if the data is loaded contemporaneously so that the dimension hasn't changed, you can join both fact tables on the keys, but in general, of course, you cannot join two different star schemas on the dimension keys if they are surrogates in traditional slowly changing dimensions. In general, you have to join separate stars on the natural keys or "business keys" within the dimension and not on surrogates (except usually in the special case of the date dimension where it is unchanging and only has a natural key).
Note that when you do join the two stars, you have to use a LEFT JOIN, in which case you WILL produce NULLs which you will still probably have to take account of - so you're actually getting back to the original model you had with NULLs! ;-)
The benefit of the extra fact table is more obvious when your tables are wide with a smaller set of keys and the vertical partitioning of the data produces space savings as well as a cleaner logical model - this is especially true when the keys are only really shared up to a point - having one dummy key or NULL key is definitely not a good idea - this usually points to a dimensional modeling problem.
However, as Aaron says, if you push it to extremes, you can have a single fact column in each fact table with shared keys, which means the key overhead dwarfs the fact cost and you really do end up in a disguised EAV model.
I would also look to see if you are in Kimball's situation of "too few dimensions". Seems like you must have good dimensional attributes lumped into the SessionKey and SiteItemKey - but without seeing your entire model and requirements, it's hard to say, but I would think you would have some user demographics in a low-cardinality or even snowflake dimension without the full Session or Site dimension.
There isn't an elegant solution really, you either have nullable columns or you use an EAV solution. I posted about EAV before (and generated a lot of comments that might be worthwhile reading):
What is so bad about EAV, anyway?
I am a fan of that model in some scenarios, but if your dimensions/attributes do not change frequently, it can be a lot of extra work for nothing. NULL values in a column do not really make waste as long as the surrounding code can deal with them appropriately.
You could have more than one fact table: factperUserClicks, factperBroWserClicks, factEyeTime, etc...
Each of these would have DateKey, SessionKey, SiteItemKey. This way only dimension keys that "make sense" appear with each fact.
Ideally, there should be no NULLS in the DW -- if you keep them in the same fact table, using zeros may be more appropriate.
As far as saving disk space, I do not see an ideal solution -- but, in a DW one is supposed to trade space for speed and (query) simplicity anyway.

What exactly does database normalization do?

Supposedly normalization reduces redundancy of data and increases performance. What is the reason for dividing the master table into other small tables, applying relationships between them, retrieving the data using all possible unions, subqueries, joins etc.? Why can't we have all the data in a single table and retrieve it as required?
The main reason is to eliminate repetition of data, so for example if you had a user with multiple addresses and you stored this information in a single table the user information would be duplicated along with each address entry. Normalisation would seperate the addresses into their own table and then link the two using keys. This way you wouldn't need to duplicate the user data, and your db structure becomes a little cleaner.
Full normalisation will generally not improve performance, in fact it can often make it worse but it will keep your data duplicate free. In fact in some special cases I've denormalised some specific data in order to get a performance increase.
Normalization comes from the mathematical concept of being "normal." Another word would be "perpendicular." Imagine a regular two-axis coordinate system. Moving up just changes the y coordinate, moving to the side just changes the x coordinate. So every movement can be broken down into a sideways and an up-down movement. These two are independent of each other.
Normalization in database essentially means the same thing: If you change a piece of data, this is supposed to change just one single piece of information in a database. Imagine a database of E-Mails: If you store the ID and the name of the recipient in the Mails table, but the Users table also associates the name to the ID, that means if you change a user name, you don't only have to change it in the users table, but also in every single message that this user is involved with. So, the axis "message" and the axis "user" are not "perpendicular" or "normal."
If on the other hand, the Mails table only has the user ID, any change to the user name will automatically apply to all the messages, because on retrieval of a message, all user information is gathered from the Users table (by means of a join).
Database normalisation is, at its simplest, a way to minimise data redundancy. To achieve that, certain forms of normalisation exist.
First normal form can be summarised as:
no repeating groups in single tables.
separate tables for related information.
all items in a table related to the primary key.
Second normal form adds another restriction, basically that every column not part of a candidate key must be dependent on every candidate key (a candidate key being defined as a minimal set of columns which cannot be duplicated in the table).
And third normal form goes a little further, in that every column not part of a candidate key must not be dependent on any other non-candidate-key column. In other words, it can depend only on the candidate keys. This leads to the saying that 3NF depends on the key, the whole key and nothing but the key, so help me Codd1.
Note that the above explanations are tailored toward your question rather than database theorists, so the descriptions are necessarily simplified (and I've used phrases like "summarised as" and "basically").
The field of database theory is a complex one and, if you truly wish to understand it, you'll eventually have to get to the science behind it. But, in terms of your question, hopefully this will be adequate.
Normalization is a valuable tool in ensuring we don't have redundant data (which becomes a real problem if the two redundant areas get out of sync). It doesn't generally increase performance.
In fact, although all database should start in 3NF, it's sometimes acceptable to drop to 2NF for performance gains, provided you're aware of, and mitigate, the potential problems.
And be aware that there are also "higher" levels of normalisation such as (obviously) fourth, fifth and sixth, but also Boyce-Codd and some others I can't remember off the top of my head. In the vast majority of cases, 3NF should be more than enough.
1 If you don't know who Edgar Codd (or Christopher Date, for that matter) is, you should probably research them, they're the fathers of relational database theory.
We use normalization to reduce the chances of anomalies that may arise as a result of data insertion, deletion, updation. Normalization doesnt necessarily increase performance.
There is much material on internet so i wont repeat the stuff here again. But you can have a look at
Normalization rules
Anomalies
(others aswell)
As well as all the above, it just makes a certain sense. Say you have a user and you want to record what kind of car they have.
Put that all in one table and then you're fine, until someone owns two cars... You're then going to need two rows for that person, and a way of making sure that you can link those two rows together...
And then what if you also want to record how many dogs they have? Same table with lots of confusing dups? Another table with your own custom logic to manage unique users?
Normalization keeps you away from a lot of these problems...

Can I have non-measure codes mixed with measures in my fact table?

We're doing a complex bit of data accumulation. Our customer sends us some stuff that includes two dimensions (time and a business unit). Time is mostly year-month. The business unit dimension has just a few attributes: a name, and a few categories to which BU's can belong for reporting and analysis purposes.
The stuff they send us includes some current state information (dates and codes). These seem fact-like. They also send some information that characterizes the relationship with the business unit (mostly additional codes). Again, these are unique to the business unit and time period.
Finally, they send us stuff that is clearly additive facts. It includes currency and counts that have proper units.
Should I commingle this qualitative information in a single fact table with the additive facts? Or should I separate the qualitative stuff (which can only be used with counts) from the quantitative stuff (which can be used with sum)?
Only put things in the fact table if they are degenerate (causing a high-cardinality/uniqueness problems in your dimension where it takes the dimension to a 1-1 relationship to the fact table). Kimball recommends avoiding the temptation to put anything but degenerate dimensions in with the facts (unique order number, for instance).
You can always put these in what Kimball calls a "junk" dimension. All those codes can simply be lumped into a junk dimension. Most dates would go in the fact table as keys into your date dimension in a particular role (usually with a natural int key of the form YYYYMMDD - one of the only times we don't use a non-identity meaningless surrogate key)
I like to naively view the star as all the facts and then which columns go into which dimensions is simply determined by convenience. One should not necessarily view them as corresponding to a particular business entity - remember, the star is not an ERD-style normalized OLTP database.
If the data is both directly related to the additive fact and is not something you want to be grouping/sorting/search on, then putting it in the fact table is okay.
Be aware, though, that non-additive data in the fact table will either prevent roll-ups or will become a lossy operation.
Brad Wilson accurately describes the risk of adding them to your fact table. In the past, I've added junk attributes to my fact table only to require refactoring later.
The stuff they send us includes some
current state information (dates and
codes). These seem fact-like. They
also send some information that
characterizes the relationship with
the business unit (mostly additional
codes). Again, these are unique to the
business unit and time period.
What business purpose do the dates serve? Offhand, I'd recommend making these their own dimensions and describe them accurately.
How volatile are the extra codes that come in? If the grain of your fact table is date and BU, why can't they be included in the BU dimension and treated as slowly changing attributes?
Without more details I can't make a firm recommendation but these would be the first questions I'd ask myself.

Resources