How to normalize composite attribute?

How to normalize composite attribute? - database

I need to normalize a relation so that it is in the 1st normal form. I know how to normalize multi-valued attributes, it is just composite attributes that are giving me issues. For example, one of the composite attributes is 'Employee-Address', and as expected, it contains sub-attributes like 'House-Number', 'Street-Name' etc.
How do I normalize this? These composite attributes are not multivalued/complex i.e a single employee may only have 1 address. I also believe the 'employee-id' attribute can be used to identify all of sub-attributes of the address. Is it as simple as breaking up the composite attribute and storing each sub-attribute as its own attribute in the relation? This way all the sub-attributes would become simple, single and stored values?
Before anyone complains; this question is related to a college assignment and I've looked through the entirety of the recommended textbook(and the internet) for the answer, which I have not found. Of course, I'd like a solution to my answer, but if you'd rather give your own example that is great; any advice or pointers are much appreciated!

The only requirement of 1NF is that each attribute contain only a single "atomic" value.
If the question suggests that the address is a composite value and that each part of the address is a separate sub-value, then you should create an attribute for each sub-value.
You probably want to store each part of the address in its own attribute anyway, so you can index them and efficiently run queries like "find everyone in New York City."

Related

DB Design for Choosing One of Multiple Possible Foreign Tables

Say if I have two or more vastly different objects that are each represented by a table in the DB. Call these Article, Book, and so on. Now say I want to add a commentening feature to each of these objects. The comments will behave exactly the same in each object, so ideally I would like to represent them in one table.
However, I don't know a good way to do this. The ways I know how to do this are:
Create a comment table per object. So have Article_comments, Book_comments, and so on. Each will have a foreign key column to the appropriate object.
Create one global comment table. Have a comment_type that references "Book" or "Article". Have a foreign key column per object that is nullable, and use the comment_type to determine which foreign key to use.
Either of the above ways will require a model/db update every time a new object is added. Is there a better way?

There is one other strategy: inherit1 different kinds of "commentable" objects from one common table then connect comments to that table:
All 3 strategies are valid and have their pros and cons:
Separate comment tables are clean but require repetition in DML and possibly client code. Also, it's impossible to enforce a common key on them, unless you employ some form of inheritance, which begs the question: why not go straight for (3) in the first place?
One comment table with multiple FKs will have a lot of NULLs (which may or may not be a problem storage and cache-wise) and requires adding a new column to the comments table whenever a new kind of "commentable" object is added to the database. BTW, you don't necessarily need the comment_type - it can be inferred from what field is non-NULL.
Inheritance is not directly supported by current relational DBMSes, which brings its own set of engineering tradeoffs. On the up side, it could enable easy addition of new kinds of commentable objects without changing the rest of the model.
1 Aka. category, subclassing, generalization hierarchy... For more on inheritance, take a look at "Subtype Relationships" section of ERwin Methods Guide.

I personally think your first option is best, but I'll throw this option in for style points:
Comments have a natural structure to them. You have a first comment, maybe comments about a comment. It's a tree of comments really.
What if you added one field to each object that points to the root of the comment tree. Then you can say, "Retrieve the comment tree for article 123.", and you could take the root and then construct the tree based off the one comment table.
Note: I still like option 1 best. =)

Multivalued dependency example tricky

Employé3: {noEmp, ability, country}
I have this little set of attributes and the following restrictions: Each employee may have some abilities in relation with a certain country. For instance, Alfred can cook Italian and Chinese food and can write in french.
My problem here is I cant decide what DM would be the best solution. I tried use
noEmp,country ->> aptitude, but it bogs me. It says that I can have two tuples with same (noEmp,country), but not necessarily same aptitude. OK!, but is it enough?
I thought about using noEmp->>country,ability, but it doesn't seems to express the relation between the ability and the country.
Of course, all of these DM's are trivial, because it complains all the attributes, so maybe its a silly question...
Just another question: What about the keys? Can I use the DM to determinate it? At first I thought no, because the key must be single. But in this case, I would be forced to use all attributes as keys, what its a little strange, how could I possibly have a 4FN relation if I can't use the DM's to determinate something?

The pair (ability, country) is compound attribute. Let's call it ethnic_ability for the lack of better term. Compound attributes are complex domains, which are flattened into multiple columns of primitive datatypes. Examples: (yyyy_mm_ss_date, hh_mm_daytime), (first_name,last_name), (integer_part_of_real_number,decimal). From DM perspective compound attribute can be considered atomic. Therefore, you have a table with two columns {noEmp, ethnic_ability}, and there is not much what dependency theory can say about binary predicates.

Custom Fields for a Form representing an object

I have an architectural question concerning custom fields in a view for an object. Let's say you have a User Object with some basic information like firstname, lastname, ... that can be used by all customers.
Now, often we get a question from a customer to add couple of custom fields typical for their domain. Our solution now is an xml data column where key value pairs are stored. This has been ok so far, but now we'll have to find a more architectural solution.
For instance, now, a customer wants a dropdown where it can select the value for its custom field. We could still store the selected value in the xml data column, but where do we store all those dropdown values...
I know that in sharepoint you can also add custom fields like dropdowns and I was wondering how to deal with this best. I want to avoid creating custom tables for customers, or having a table with 90 columns (10 basic and then 10 for each customer), ...
You get the idea, it should be generic and be able to deal with all sorts of problems in the future.
What I was thinking about is a Table UserConfiguration where each record has a Foreign Key to the Customer (Channel in our database), then a column FieldName, a column FieldType and a column Values. The column values should be an xml type column, because for a dropdown, we'll need to add multiple values. Also, each value can have extra data attached to it (not just a name). The other problem then is how to store the selected value. I don't like the idea of having foreign keys to xml in my database (read somewhere that Azure can't handle this all to well). Do you just store the name of the value (what if the value were to disappear out of the xml?)?
Any documentation, links on this kind of problems would also be great. I'm trying to find a design pattern that deals with this kind of problem in the database.

I want to answer your question in two parts:
1) Implementing custom fields in a database server
2) Restricting custom fields to an enumeration of values
Although common solutions to 1) are discussed in the question referenced by #Simon, maybe you are looking for a bit of discussion on what the problem is and why it hasn't been solved for us already.
databases are great for structured, typed data
custom fields are inherently less structured
therefore, custom fields are more difficult to work with in a database
some or many of the advantages of using a database are lost
some queries may be more difficult or impossible
type safety may be lost (in the database)
data integrity may no longer be enforced (by the database)
it's a lot more work for the implementers and maintainers
As discussed in the other question, there's no perfect solution.
But these benefits/features still need to be implemented somewhere, and so often the application becomes responsible for data integrity and type safety.
For situations like these, people have created Object-Relation Mapping tools, although, as Jeff Atwood says, even using an ORM could create more problems than it solved. However, you mentioned that it 'should be generic and be able to deal with all sorts of problems in the future' -- this makes me think an ORM might be your best bet.
So, to sum up my answer, this is a known problem with known solutions, none of which are completely satisfactory (because it's so hard). Pick your poison.
To answer the second part of (what I think is) your question:
As mentioned in the linked question, you could implement Entity-Attribute-Value in your database for custom fields, and then add an extra table to hold the legal values for each entity. Then, the attribute/value of the EAV table is a foreign key into the attribute-value table.
For example,
CREATE TABLE `attribute_value` ( -- enumerations go in this table
`attribute` varchar(30),
`value` varchar(30),
PRIMARY KEY (`attribute`, `value`)
);
CREATE TABLE `eav` ( -- now the values of attributes are restricted
`entityid` int,
`attribute` varchar(30),
`value` varchar(30),
PRIMARY KEY (`entityid`, `attribute`),
FOREIGN KEY (`attribute`, `value`) REFERENCES `attribute_value`(`attribute`, `value`)
);
Of course, this solution isn't perfect or complete -- it's only supposed to illustrate the idea. For instance, it uses varchars, and lacks a type column. Also, who gets to decide what the possible values for each attribute are? Can these be changed at any time by the user?

I'm doing something similar for a customer. I've create a JSON FieldType which holds the entire JSON stream of a complex object and a String containing the FQTN (FullQualifiedTypeName) of my C# model class.
By using custom New-, Edit- and Display-Forms we'd ensured that our custom objects are rendered the correct way for best user experience.
To promote fields from the complex C# model to the SharePoint list, we've build something like Microsoft did in InfoPath. Users are able to select Properties or MetaData from the Complex C# type, which will be automatically promoted to the hosting SharePoint list.
The big advantage of JSON is, that its smaller than XML and easier to work with in the web world. (JavaScript...)

When you let the users create the data models, I would recommend looking at an document database or 'NoSQL' since you want exactly that, to store schemaless data structures.
Also, sharePoint stores metadata the way you mentioned (10 columns for text, 5 for dates etc)
That said, in my current project (locked in SharePoint, so Framework 3.5 + SQL Server and all the constraints that follow) we use a somewhat similar structure as below:
Form
Id
Attribute (or Field)
Name
Type (enum) Text, List, Dates, Formulas etc
Hidden (bool)
Mandatory
DefaultValue
Options (for lists)
Readonly
Mask (for SSN etc)
Length (for text fields)
Order
Metadata
FormId
AttributeId
Text (the value for everything but dates)
Date (the value for dates)
Our formulas employ functions such as Increment: INC([attribute1][attribute2], 6) and this would produce something like 000999 for the 999th instance of the combined values for attribute 1 and attribute 2 for a form, this is stored as:
AttributeIncrementFormula
AtributeId
Counter
Token
Other 'formulas' (aka anything non-trivial) such as barcodes are stored as single metadata values. In the actual implementation, we would have something like this:
var form = formRepository.GetById(1);
form.Metadata["firstname"].Value
Value above is a readonly property that decides whether we should get the value from Text or Date and if some additional transform is required. Note that the database here is merely a storage, we hold all the domain complexity in the application.
We also let our customer decide which attribute is the form title for example, so if firstname is the form title, they'll set an in-memory param that spans the entire application to be something like Params.InMemory.TitleAttributeId = <user-defined-id>.
I hope this gives you some insight on a production impl of a similar scenario.

This is really more of a comment than an answer, but I need more space than SO will allow for comments, so here 'tis:
I think your UserConfiguration table approach is good, and would suggest only abstracting the "type" and "value" pieces of your design a bit more:
Since your application will need to validate user input, each notion of "type" will have an associated piece of evaluation logic. Obviously the more of this you can abstract into data the easier it will be to keep your code small. Enumerated lists are a good start, but if your "validator" logic can be extended to handle pattern matching for text strings and Boolean logical expressions (e.g. to describe/enforce constraints on input values), then you can express pretty much any "type" of input that your application may need to handle in terms of (relatively) simple "atoms" that you can map naturally to DB tables.
When storing a user-specified value, you can either store the "raw" data (e.g. in JSON) and a foreign key to the associated "type", or you can add an lookup/cache system that assigns an integer to each new value that is encountered by the system ("novelty" can be checked by checking a hash of the "raw" data, for example). The latter approach obviously scales better if you're expecting lots of data duplication (which of course you would in the case of a multiple-choice menu).

Table and column naming conventions when plural and singular forms are odd or the same

In my search I found mostly arguments for whether to use plurality in database naming conventions, and ways to handle it in either case. I have decided I prefer plural table names, so I don't want to argue that.
I need to represent an animal's species and genus and so on in a database. The plural and singular form for 'species' are the same, and the plural of 'genus' is 'genera'.
I'm using Microsoft's Entity Data Model, by the way. My concern is mainly about whether I'll have problems later on depending on my naming choices.
I think I can get by with:
Table: Genera | Column: Genus
But I'm unsure how I should handle:
Table: Species | Column: Species
If I really wanted to be lazy about this I'd just name them 'species > specie' and 'genuses > genus', but I would prefer to read them in their correct forms.
Any advice would be appreciated.

I would go for Genera/Genus and Species/Species. That's how you say it in English, so why using an incorrect form of the word?

I generally avoid have a column name that is the same as a table name because it can be confusing to human readers. The database engine knows whether it expects a table name or column name in any given context, I don't recall that ever being a problem. (Is there some context where either would be valid? I can't think of one.)
That said, if you run into this issue, it indicates to me that you have a poorly chosen name for one or the other. Species makes good sense as a table name: this table contains data about a species. So if a field in that table is called "species" ... what about the species? Presumably everything in the table is about a species. I'd guess it was probably some sort of identifier and not, say, the number of chromosomes or method of reproduction. But is it an ID number? An abbreviation? The common name? The binomial nomenclature name? Etc. If it's, say, the common name, I'd call it "common_name" and not "species".
By the way, another naming convention you should decide on is whether column names that could be ambiguous if taken out of context should have names that specify the context, or whether you use the table name to eliminate the ambiguity. For example, you could have many things that have a "name". You could call any such field simply "name", and if there's ambiguity, qualify it, like "species.name", "laboratory.name", etc. Or you could give each field a unique name, like "species_name", "laboratory_name", etc. That's one of those questions that I think has no definitively right answer, just pros and cons and make a decision and be consistent.

What is a good KISS description of Boyce-Codd normal form?

What is a KISS (Keep it Simple, Stupid) way to remember what Boyce-Codd normal form is and how to take a unnormalized table and BCNF it?
Wikipedia's info: not terribly helpful for me.

Chris Date's definition is actually quite good, so long as you understand what he means:
Each attribute
Your data must be broken into separate, distinct attributes/columns/values which do not depend on any other attributes. Your full name is an attribute. Your birthdate is an attribute. Your age is not an attribute, it depends on the current date which is not part of your birthdate.
must represent a fact
Each attribute is a single fact, not a collection of facts. Changing one bit in an attribute changes the whole meaning. Your birthdate is a fact. Is your full name a fact? Well, in some cases it is, because if you change your surname your full name is different, right? But to a genealogist you have a surname and a family name, and if you change your surname your family name does not change, so they are separate facts.
about the key,
One attribute is special, it's a key. The key is an attribute that must be unique for all information in your data and must never change. Your full name is not a key because it can change. Your Social Insurance Number is not a key because they get reused. Your SSN plus birthdate is not a key, even if the combination can never be reused, because an attribute cannot be a combination of two facts. A GUID is a key. A number you increment and never reuse is a key.
the whole key,
The key alone must be sufficient [and necessary!] to identify your values; you cannot have the same data represented by different keys, nor can a subset of the key columns be sufficient to identify the fact.
Suppose you had an address book with a GUID key, name and address values. It is OK to have the same name appearing twice with different keys if they represent different people and are not the "same data".
If Mary Jones in accounting changes her name to Mary Smith, Mary Jones in Sales does not change her name as well.
On the other hand, if Mary Smith and John Smith have the same street address and it really is the same place, this is not allowed. You have to create a new key/value pair with the street address and a new key.
You are also not allowed to use the key for this new single street address as a value in the address book since now the same street address key would be represented twice.
Instead, you have to make a third key/value pair with values of the address book key and the street address key; you find a person's street address by matching their book key and address key in this group of values.
and nothing but the key
There must be nothing other than the key that identifies your values. For example, if you are allowed an address of "The Taj Mahal" (assuming there is only one) you are not allowed a city value in the same record,
since if you know the address you would also know the city. This would also open up the possibility of there being more than one Taj Mahal in a different city.
Instead, you have to again create a secondary Location key with unique values like the Taj, the White House in DC, and so on, and their cities.
Or forbid "addresses" that are unique to a city.
So help me, Codd.

Here are some helpful excerpts from the Wikipedia page on Third Normal Form:
Bill Kent defines Third Normal Form this way:
Each non-key attribute "must provide
a fact about the key, the whole key,
and nothing but the key."
Requiring that non-key attributes be
dependent on "the whole key" ensures
that a table is in 2NF; further
requiring that non-key attributes be
dependent on "nothing but the key"
ensures that the table is in 3NF.
Chris Date adapts Kent's mnemonic to define Boyce-Codd Normal Form:
"Each attribute must represent a fact
about the key, the whole key, and
nothing but the key." Here the
requirement is concerned with every
attribute in the table, not just
non-key attributes.
This comes into play when a table has multiple compound candidate keys, and an attribute within one candidate keys has a dependency on a part of another candidate key. Third Normal Form wouldn't prohibit this, because it excludes key attributes. But BCNF applies the rule to key attributes as well.
As for how to make a table satisfy BCNF, you need to represent the extra dependency, with another attribute and possibly by splitting attributes into another table.

I googled "boyce codd normal form" and after wikipedia this is the second result. My textbook gives a very simple definition in terms of relational database management systems:
The left side of every nontrivial FD must be a superkey.
-"Database Systems The Complete Book" by Garcia-Molina, Ullman and Widom.

The best informal answer I've read is that, in BCNF, every "arrow" in every functional dependency is an "arrow" out of a candidate key. I don't recall the source, but it was probably something Chris Date wrote.

Basically Boyce-Codd is "fifth normal form". It is visually recognizable by the existance of "Attributive entities" in the data model, for things like Types (e.g. roles, status, process state, location-type, phone-type, etc).
The attributive entities (sub-subtypes) are lists of finite sets of values that further categorize a class level entity. So you may have a phone-type ('mobile', ' desk', 'VOIP') email account type ('business', 'personal', 'gaming'), role (project manager, data modeler, super model) etc.
Another morphological clue is the existance of super-types, (aka. master-classes, super-classes, meta-entities) such as Parties (subtypes being company, person, etc.).
It's basically Taxonomy gone wild (..no the video is not that exciting) to the atomic or leaf-level; see Bill Karwin's comment above for a more technical explanation.
Boyce-Codd level models are essentially highly detailed logical models, derived from more simplistic business-based conceptual models. **They are typically NOT implemented ver batim in the PHYSICAL model, because PDM optimization for performance (or functional simplicity) may result in the super-types and attributive entities being managed as drop-down lists in UIs, or in behind the scenes logic in the application, or in database constraints and methods to enforce referential integrity. (i.e. they may end up as look-up tables in the PDM schema, or they may be handled by code and not represented in the database).
So - why do them if they may not end up in the PDM? For the same reason you build a good 3NF model before you 'optimize', so that the database structure reflects the real world and is hence more stable than the typical kludges we inherit and have to do heroic acts to make work as our business/clients requirements change.

Often times it is easiest to listen to your gut and this will come naturally. Generally speaking, if you meet 3NF you have met BCNF. This doesn't cover detailed analysis of an ERD or have examples but there are thirteen rules according to Codd. I find it best to follow these rules but always remember there is no one correct way to do things so follow them loosely. So regarding the RDBMS, here are the rules:
http://www.87android.com/12-rules-of-relational-database-model-by-codd/
This may not answer the question directly, but if you are asking about how to get to BCNF or an easy way to remember it then you don't understand normalization well enough. That is of no concern though. Relational databases take many forms and very few are done well. The best thing you can do is know what it means to be relational, follow the rules above, and do not worry about the level of normalization. The process of normalization eliminates the duplication of data. Each level more so by moving into migration of functional dependencies. Keep that in mind and you will be fine, your gut and intellect will do the rest.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight