Specific Database Normalization - Author / Documents - database

I'm familiar with database normalization techniques, but I'm struggling with a certain scenario. I have nearly 100 documents, each having at least one author. I'd like to create functionality where users can ask for documents specific to authors, specific to years, or simply show all documents ever. The part I'm struggling with is the request for documents according to an author or authors.
Here's a breakdown of my relations: Author has many documents, document has many authors, document has one year, document has one title. That's what I have to work with.
Here are my relational tables thus far:
Author Table
+----------+-------------+
| id | name |
+----------+-------------+
| 1 | David Noyce |
+----------+-------------+
Document Table
+----------+--------------------+----------+-----------+-----------+
| id | title | BLOB | Year | ???? |
+----------+--------------------+----------+-----------+-----------+
| 1 | Car Crashes, WI | stuff | 2013 | ? |
+----------+--------------------+----------+-----------+-----------+
I'm assuming I want author id(s) to be where my ???? is in my document table, and if each document only had one author, that would probably work nicely, but that's not the case. What's the best way to relate a document to many authors?

You just need to add a junction table that allows there to be a one-to-many or many-to-many relationship between the documents and authors.
documentAuthor
--------------
documentId
authorId

Related

EAV database design for model with multiple attributes (other models)

I have a model that possesses a lot of attributes with multiple values that are either representations of lists or other models. My research led me to consider an Entity-Attribute-Value design to represent such but I have seen more discouragement from more knowledgeable people than recommendations.
One that sticks to me is this comment:
In a nutshell, EAV is useful when your list of attributes is frequently growing, or when it's so large that most rows would be filled with mostly NULLs if you made every attribute a column. It becomes an anti-pattern when used outside of that context.
by Karl Bielefeldt.
Basically my model is student_report. It has the following attributes based on the actual form:
id
creator
revision history
department
references
funding (optional, variable/not fixed)
comments
objectives (paragraph)
scope (paragraph)
creator, revision history, department, references, funding and comments are other models that this form will rely on.
My initial plan is to create student_report with only the following:
id
id of creator
objectives
other paragraph-style content
while the others: revision history, department, references, funding and comments will posses the foreign key student_report_id.
For the variable/not fixed models such as references and funding, I plan to use a mediator table to connect student_form to the "list" of those to normalize the DB:
student_report
| id | name |
|----|-----------------|
| 1 | Abraham Smith |
| 2 | Betty Gladstone |
| 3 | Chen Hong |
references
| id | name |
|----|--------------|
| 1 | Reference 1 |
| 2 | Reference 2 |
| 10 | Reference 10 |
report_references
| user_id | reference_id |
|---------|--------------|
| 1 | 2 |
| 1 | 3 |
| 2 | 10 |
Is my proposed solution enough? This will be a small-scale project and I doubt this will require hundreds of use a day.
EAV helps you capture data when the data model is not well understood. It allows you to skip over data analysis and to come up with a single design that is so adaptable that it will handle a body of data no matter what the actual structure of the data is.
But there's a downside. Since you haven't analyzed the data at storage time, you have to analyze the data when you go to retrieve it and turn it into something useful, such as a report or an extract. Otherwise your results are meaningless. This downside can, in some circumstances be much larger than the upside you experienced earlier.
In your case, you seem to have a good understanding of the attributes you want to store, and the semantics of those attributes. It also looks unlikely that the attribute list is going to have to expand based on surprises.
So I advise you to avoid EAV, and instead concentrate on how to compose relations out of attributes. Relations are simply collections of attributes, grouped together in some way that is meangful and useful. You can read books about this topic, if you care to.
In SQL, tables represent relations. Tables have rows, that represent tuples. Tables have columns, that represent attributes. The intersection of a row and a table provides a location where a value can be stored. In !NF, each location stores a "simple" value.
Your design looks pretty good to me. I think it will serve you better than an EAV model would. I don't know whether it's completely normalized, and I'm not sure it has to be.

How to differentiate a many-to-many relationships while designing a database?

I have to design a database. And I am finding entities and their relationships. But every relationships seems to have a many-to-many relationships. For instance, in my case:
1) A staff manages client
Here a staff can manage zero or more client. Similarly, a client is managed by one or more client.
2) A client orders to buy a stock
Here a client can order zero, one or more stock to buy and a stock can be ordered by zero, one or more client.
3) A client orders to sell a stock
Here a client can order zero, one or more stock to sell and a stock can be ordered by zero, one or more client to sell.
These are some of the examples of my situation. And I am confused how to separate these relationships. There are other numerous cases like these in my scenario. And I am having difficulty to conceptualize the design.
SO, please enlighten me regarding my situation.
It seems like there's quite a lot to the system you are developing and presumably there are requirements you haven't mentioned so it isn't really possible to come up with a complete answer. But hopefully some of the following will help you to "conceptualize the design" as you describe it.
1) This is a very common scenario and there's pretty much a standard way of dealing with these many-to-many relationships.
If there are 2 entities A and B with a many-to-many relationship then you would normally introduce an entity C that consists of 2 columns - one a foreign key to the unique id of A and the other a foreign key to the unique id of B. And you would remove the foreign key column in entity A pointing to B and vice versa.
i.e
|-----|
| A |
|-----|
\|/
|
|
/|\
|-----|
| B |
|-----|
becomes:
|-----| |-----|
| A | | B |
|-----| |-----|
| |
| |
| |
/|\ /|\
|-------------|
| C |
|-------------|
The main challenge is often what to call these new entities! Sometimes they might just be something like a_b_relationship but it's good if can identify more meaningful names.
2) It looks like you need to do a bit more analysis to identify all the actual entities. One way of doing this is to go through your description of the system and identify the nouns - often if there's a noun in the description it's appropriate to have an entity in the entity-relationship diagram.
"Order" jumps out as a noun you overlooked.
Typically for order-processing you would have 2 entities - the order that contains the date, total value, customer etc, and a child orderline which identifies how many of which product have been ordered and individual prices. So in ecommerce a shopping cart would be the order and each item in the shopping cart would be an orderline record.
In your scenario we'd have:
|----------| |-----------|
| client | | product |
|----------| |-----------|
| |
| |
| |
/|\ /|\
|-------------| /|-------------|
| order |--------| orderline |
|-------------| \|-------------|
3) Client sells many products
Here you are identifying an additional role for a client and what I'd do here is question whether "client" is an appropriate entity at this stage. You may find it easier to think in terms of "buyer" and "seller" until the first-cut design is understood. If buyer and seller have a lot in common (especially if an individual can be both a buyer and a seller) then you may decide to use a single entity eventually. Your ERD tool may provide support for this - have a search for "subtype entities" or "entity subtypes".
The specifics will depend on your actual application but it could be that each orderline should have a relationship to the seller, and the order a relationship to the buyer. This will depend on whether it is possible for example for a buyer to order a number of items of a particular product, some of which are sourced from one seller and some from another. It could get complicated!
Also, it might be helpful to consider whether you need to record a seller's stock prior to it being sold. Here it might be useful to distinguish between "product" and "stock", e.g.
|---------| |-----------|
| seller | | product |
|---------| |-----------|
| |
| |
| |
/|\ /|\
|-----------------|
| stock |
|-----------------|
As a general comment I'd say it really can help to go through the design process step by step. So once you have got your initial model, assign all the data items you need to store to the appropriate entity, and methodically make sure that the design is in first normal form, then second normal form then 3rd normal form. Only once you have done this, and are confident that the design reflects the requirements, should you think about how to implement the design in a database. That's what I learned many years ago anyway!
That's hard to answer this question. Everything in designing is situational. If you really need to store which of staffs manage the client and a client could managed by many staffs, yes your relation is many-to-many. Pay attention there is many relations between entities in real world; you should just store which of them are important and necessary to be stored.
For another example, if your stock contain the available count of that kind of goods, thus the relation between client and stock is many-to-many too.
Note: Don't use plural form of noun for you tables' name, it leads you to be confused about relations.
Edit: To apply many-to-many relationship in your database tables, you will need a mediator table. For example about Customer and Product table, you should create a table named CustomerProduct (Or everything you want). CustomerProduct table contains two foreign keys, one from Customer table and another from Product table. Usually (not all the time) one many-to-many relationship breakdown to two many-to-one relationships.
See this Link .

Database schema for multiple checkboxes

I currently have a Users table and now want to add other user-related information for a particular user. The form that accepts this information has fields like languages and OS each with a list of options with checkboxes.
For Example:
Languages known:
checkbox PHP,
Java,
Ruby
OS knowledge:
Windows,
Linux,
Mac
Currently my database tables looks like this:
USER
----------------------------------------
| ID | Name |
-----------------------
| 1 | John |
-----------------------
| 2 | Alice |
-----------------------
LANGUAGES
----------------------------------------
| ID | User_ID(FK) | lang-name |
----------------------------------------
| 1 | 1 | PHP |
----------------------------------------
| 1 | 2 | Java |
----------------------------------------
OS
----------------------------------------
| ID | User_ID(FK) | os-name |
----------------------------------------
| 1 | 1 | Windows |
----------------------------------------
| 1 | 2 | Windows |
----------------------------------------
Does this seem like a good schema? There are many more such user-related fields that will each need to have their own table and there seems to be a lot of redundancy within a table since thousands of users will know PHP and hence there will be thousands of rows with PHP as the language for each of the different users.
Is there a better way to organize the schema?
Perhaps you could make Language and OS first-class entities in the database with their own tables, then use a joining table for the many-to-many relationship with User. Something like this:
User
---------
ID
Name
etc...
Language
---------
ID
Name
OS
---------
ID
Name
UserLanguage
---------
UserID
LanguageID
UserOS
---------
UserID
OSID
That way the actual entities (User, Language, OS) are self-contained with only the data that's meaningful to them, not polluted or duplicated with the concerns of their relationships with each other. And the relationships are contained within their own simple numeric-only tables, which themselves aren't entities but are just many-to-many links between entities.
No data is duplicated (in your sample data, Language and OS would each have only three records, at least for now), and it would be a lot friendlier to ORMs and other frameworks if you ever need to use one.
Edit: Based on your comment, you might try something like this:
User
---------
ID
Name
etc...
Lookup
---------
ID
LookupTypeID
Value
LookupType
---------
ID
Value
UserLookup
---------
UserID
LookupID
This gives you a lot of flexibility. In your sample data, Language and OS would be records in LookupType. All of the languages and OSes would be values in Lookup which link back to their respective LookupType. So still no repeating of data. And the UserLookup table is the only many-to-many link table.
Be careful with this design, though. It is flexible, definitely. But when you use this table structure as your actual domain models you run into situations where terms like "Lookup" become business terms, and that's probably not the case. "Language" and "OS" are the actual models. I would recommend using Views or perhaps Stored Procedures to abstract this structure from the code. So the code would pull Languages from the Language view or procedure, not directly from the Lookup table.

Unique id shared between multiple tables sql 2008

I have a problem with my site. The site has multiple entities: Articles, Posts, Reviews ... orund 6 types. Now I am introducing the possibility for a user to rate an item (it can be any of these entities)
I created a table Votes (int Id primary key, int ItemId, nvarchar(30) Ip, datetime Timestamp, int VoteValue). Here I will store all votes and their ips.
My problem is that I must have ItemID unique ... but my database already have items of various types having the same id. All tables started the ids from 0. What options do you see for my design in order to be able to store all votes in a single table?
Your approach is trying to assign multiple meanings to the field "ItemId", which will lead to the issues you are encountering. If I were to see "9500" in that field, how would I know what that means?
I would suggest dropping the ItemId field and creating "crosswalk" tables between Votes and the other entities.
For example, your entities:
+-----------+
| Articles |
+-----------+
| ArticleId | PK
| ~ snip ~ |
+-----------+
+-----------+
| Posts |
+-----------+
| PostId | PK
| ~ snip ~ |
+-----------+
... etc ....
Your votes table:
+-----------+
| Votes |
+-----------+
| VoteId | PK
| ~ snip ~ |
+-----------+
Your "crosswalk" tables:
+--------------+
| ArticleVotes |
+--------------+
| ArticleId | PK, FK to Articles
| VoteId | PK, FK to Votes
+--------------+
+--------------+
| PostVotes |
+--------------+
| PostId | PK, FK to Posts
| VoteId | PK, FK to Votes
+--------------+
Note that in your crosswalk tables, you would create a composite primary key consisting of both the FK references to the appropriate entities, thereby ensuring uniqueness.
In my experience, this is an appropriate normalized approach to the domain you describe.
In querying, to get the votes for the Articles (for example) simply INNER JOIN Articles through ArticleVotes to Votes. To get all Votes, simply query Votes.
Additionally, I would suggest creating an IPAddresses table and FKing to that in your Votes table to reduce redundancy.
An option that wasn't mentioned was GUIDs. If you use GUIDs in your Articles/Posts/Reviews/etc. instead of the int primary keys you could rely on these to be unique. I am not saying this is the route you should use as it can add additional overhead to store/search GUIDs rather than ints.
I would recommend adding the type field to the votes table and having it be part of the key. It sounds like you had already thought about this idea but are worried about performance. If you are worried about performance do some tests to ensure the table queries meet your needs before putting the changes into production.
One possibility is to convert your set of disparate tables into a type/subtype set. A discussion of this can be found here. The (perhaps considerable) downside is that you’d have to refactor all your tables… but then you’d have one Id (perhaps “ItemId?) used to uniequely identify all your Item Types.

relational VS parametrized Data modeling when building semantic web applications?

Here is the summary of my question then i'll describe it in more details :
I read about using the parametrized data modeling method instead of using the standard relational data modeling when building semantic web application,i think we'll lose 90% of normalization if we used this method,If I want to design the database of my semantic web application should i use this way? what is the practical value ?
In More Details :
I've read a lot of articles around this, in this book "Programming the semantic web - Toby Segaran, Colin Evans, and Jamie Taylor" at page 14 they tell us to use parametrized Data modeling to get Semantic Relationships instead of the standard relational database described by this example:
in the standard Relational Database :
Venue : [ ID(PK), Name, Address ]
Restaurant : [ ID(PK), VenueID(FK), CuisineID]
Bar : [ ID(PK), VenueID(FK), DJ?, Specialty ]
Hours : [ VenueID(FK), Day, Open, Close ]
For Semantic Relationships : One table only !!! Fully parameterized venues
Properties : [ VenueID,Field, Value ]
Example:
VenueID _ Field____Value
1__Cuisine__Deli
1__Price__ $
1__Name__Deli Llama
1__Address__Peachtree Rd
2__Cuisine__Chinese
2__Price__ $$$
2__Specialty Cocktail __ Scorpion Bowl
2__DJ?__No
2__Name__ Peking Inn
2__Address Lake St
3__Live Music? __ Yes
3__Music Genre__ Jazz
3__Name__ Thai Tanic
3__Address__Branch Dr
Then the authors Says :
Now each datum is described alongside the property that defines it. In doing this, we’ve
taken the semantic relationships that previously were inferred from the table and column
and made them data in the table. This is the essence of semantic data modeling:
flexible schemas where the relationships are described by the data itself.
If I want to design the database of my semantic web application should i use this way? what is the practical value ?
What you lose in immediate clarity, you gain in flexibly. Notice with your more parametrized approach you gain the ability to easily add fields without altering any tables. This allows you give different fields to different venues as it suites your application. By association, this also makes it easy to extend your web application via your creation or future maintainer/modification authors (if you intend to release) down the road.
Just be careful when it comes to performance. Don't adopt a fully parametrized design when it is easier to a standard relational design. Let's say, for a moment, you have a two different users tables, one relational the other parametrized:
Table: users_relational
+---------+----------+------------------+----------+
| user_id | username | email | password |
+---------+----------+------------------+----------+
| 1 | Sam | sam#example.com | ******** |
| 2 | John | john#example.com | ******** |
| 3 | Jane | jane#example.com | ******** |
+---------+----------+------------------+----------+
Table: users_parametrized
+---------+----------+------------------+
| user_id | field | value |
+---------+----------+------------------+
| 1 | username | Sam |
| 1 | email | sam#example.com |
| 1 | password | ******** |
| 2 | username | John |
| 2 | email | john#example.com |
| 2 | password | ******** |
| 3 | username | Jane |
| 3 | email | jane#example.com |
| 3 | password | ******** |
+---------+----------+------------------+
Now you want to select a single user. With your relational table, you will only select one row, while your parametrized version will select the number of rows that there are fields associated with that user, in this case 3.
The next issue is searchability (at times). Say you have that same users table from the example above, but instead of knowing the user ID, you only know the username. You may be using two queries, one to find the user id and the other to get the data associated with the user.
Your last con stems from selecting only a few rows at a time. Taking the users tables example again, we can limit the number of fields easily with the relational one:
SELECT username, email FROM users_relational WHERE user_id = 2
We should get a single result with two columns.
Now, for the parametrized table:
SELECT field, value FROM users_parametrized WHERE user_id = 2 AND field IN('username','email')
It's a little more verbose and will become less readable than the first one, especially if you start taking on more and more fields to select.
Additionally, the parametrized will be slower for a few reasons. It now has to do text comparisons from the varchar in the field column, instead of a single, numerically indexed user_id. With the first query, it knows when to stop looking for the record because you're selecting by a primary key. In the parametrized, you are not selecting by a primary key, so you will take a performance hit because your database must look through all the records.
This leads me into the final real difference (as far as your DBMS sees it). There is no primary key in the parametrized, which (as you saw above) can be a performance issue, especially if you already have a considerable number of records. For something like a users table where you can have thousands of records, your record count would be that number times 3 (as we have three non-user_id fields) in this case alone. That's a lot of data for the database to search through.
There are quite a few things to consider when designing your application. Don't be afraid to mix your database with parametrized and relational style - it just has to make sense practically. In the case you gave, it makes perfect sense to do so; in the case I displayed, it would be pointless.
It is possible to stay fully relational while pursuing the intent of storing data in a parameterized fashion. The following is a greatly oversimplified demonstration, but should suffice to show the main tricks that are needed -- in a nutshell, additional levels of abstraction, some surrogate primary keys, and some tables with composite primary keys. I will leave out exact description of foreign key constraints assuming the reader can grasp the obvious relations between tables below.
Your first table is only to establish the entities you want to store information about, and a key to look up what sorts of information will be stored:
entity_id | entity_type
---------------------------
1 | lawn mower
2 | toothbrush
3 | bicycle
4 | restaurant
5 | person
The next table relates entity type to the fields you wish to store for each entity type:
entity_type | attribute
------------------------
lawn mower | horsepower
lawn mower | retail price
lawn mower | gas_or_electric
lawn mower | ...etc
toothbrush | bristle stiffness
toothbrush | weight
toothbrush | head size
toothbrush | retail price
toothbrush | ...etc
person | name
person | email
person | birth date
person | ...etc
This is expandable to as many fields as you like for each entity type. It's still relational; this table does have a primary key, it's just a composite key composed of both columns.
This example is oversimplified for brevity; in actual practice you have to confront the namespacing issues with attributes and you probably want certain attribute names to be per-entity-type in case the same name means something different on an entirely different kind of entity. Use a surrogate primary key for the attributes in order to solve the namespacing issue, if you don't mind the decrease in readability when looking directly at the tables.
Meanwhile, and opposite of the preceding point, it's useful to make common and unambiguous attributes (such as "weight in grams" or "retail price in USD" available for reuse across multiple entity types. To handle this, add a level of abstraction between attributes and entity types. Make a table of "attribute sets", with each set linked to 1..n attributes. Then each entity type in the table above would be linked not directly to attributes, but to one or more attribute sets.
You'll need to either guarantee that attribute sets do not overlap in what attributes they point to, or create a means of resolving conflicts by hierarchy, composition, set union, or whatever fits your needs.
So at this point a lookup for a particular entity goes as follows. From the entity id we get the entity type. From entity type we get 1..n attribute sets, which yield a resulting attribute set that is held by the entity. Finally there is the big table with the actual data in it as follows:
entity_id | attribute_id | value
---------------------------------------
923 | 1049272 | green
923 | 1049273 | 206.55
924 | 1049274 | 843-219-2862
924 | 1049275 | Smith
929 | 1049276 | soft
929 | 1049277 | ...etc
As with all of these tables, this one has a primary key, in this case composed of the entity_id and attribute_id columns. The values are stored in a plain-text column without units. The units are stored in a separate table linking attributes to units. More tables can be established if you need to get more specific on that; you can set up additional levels of abstraction to establish an "attribute type" system similar to the entity type system described above.
If needed, you can go as far as storing relationships such as "attribute X is numerically convertible to attribute Y by the following formula", for numerical attributes. Or for non-numerical attributes you can establish equivalence tables to manage alternate spellings or formats for the allowed values of an attribute.
As you can imagine, the farther you go with your "attribute types and units" system, and the more you use that additional machinery in computation, the slower this all will be. In the worst case you're looking at many joins. But that problem can be addressed with caching and views, if your situation allows you to make tradeoffs such as slowing write speed to gain a great increase in read speed. Also, many of your queries to the database will be in situations where you already know what entity type you're working with at the moment and what its resulting attributes are and their types; so you only have to grab the literal values out of the entity/attribute/value table, and that is plenty fast.
In conclusion, hopefully I have shown how you can get as parameterized as you wish while remaining fully relational. It just requires more tables for more levels of abstraction than some of the simpler approaches do; yet it avoids the disadvantages of the "one-big-table" style. This style of entity>type>attribute>value storage is powerful, flexible, and can be extended as far as you need.
And thanks to a relational/normalized table setup, you can do all sorts of reorganizing along the way as your entity schema evolves, without losing data. The additional levels of abstraction allow you to re-parent attributes from one attribute set to another, change their names if needed, and change which sets of attributes an entity type makes use of, without losing stored values, as long as you write appropriate migrations. The other day I realized I needed to store a certain product attribute on a per-brand basis instead of per-product, and was able to make the schema change in five minutes with only a couple of updated rows in the database. In many other setups, particularly in a one-big-table setup, it could have been a lot more work, requiring as much as one or more updated rows per entity affected by the change.

Resources