Sometimes I am having a hard time seeing a difference between an entity and a column when I am starting to make a diagram. I don't know when it is supposed to be a entity or a column. For example, in some game if you have a user and that user can play by itself or it can play in the group. Would you make that two different entities User and GroupUser ?
Also, for example if the User has levels, status and badges they earn which is part of the game. Would these be entities also or they would just be in one entity which would be part of the User ?

Entity could be a Person (e.g. Student), Place (e.g. Room Name), Object (e.g. Books), Abstract Concept (e.g. Course, Order) that could be represented in your database and normally could become a Table in your Database.
Column(s) on the other hand is/are the attribute(s) of your Entity.
So, in your case you have a User entity and the possible columns or attributes (or fields) are
UserID, UserLevel, UserStatus, Badges, PlayStatus (values could be individual or group).
Your Badges although is a column could turn into Entity if it violates the Normalization rules.
For example if you have this Table for User:
Table: Users
UserID UserName UserStatus PlayStatus Badges
------ -------- ---------- ---------- ------
1 Surefire Active Single Private, Warrior, Platoon Leader
2 FastMachine Active Group Private, Warrior
3 BeatTheGeek Inactive Group Private
The Badges here violates the 1NF (1st Normal Form) in Normalization rules which says that there should be no repeating groups or in this case no Multi-valued columns. So, this could be normalized like:
Table: Users
UserID UserName UserStatus PlayStatus
------ -------- ---------- ----------
1 Surefire Active Single
2 FastMachine Active Group
3 BeatTheGeek Inactive Group
Table: Badges
BadgeID BadgeName
------ --------
1 Private
2 Indie
3 Warrior
4 Platoon Leader
5 Colonel
6 1 Star General
7 2 Star General
8 3 Star General
9 4 Star General
10 5 Star General
11 Hero
Table: UserBadgesHistory
UserID BadgeID ReceiveDate
------ -------- -----------
1 1 12/01/2013
1 3 12/05/2013
1 4 1/5/2014
2 1 2/5/2014
2 3 2/10/2014
3 2 11/10/2013

In general, an entity has multiple columns (i.e. attributes) of its own, and a column (or attribute) does not.
In your example, if the only data you're interested in storing is a User's current level, then level is unlikely to be an entity. This is because it would have only a single attribute of name/number. If you wanted to find all Users currently at level 4, you would simply do a query with level = 4.
On the other hand, if you had a reason to add additional data about the level, such as what abilities are associated with that level or the date a given User achieved the level, then you would want to make Level a separate entity.
A Level entity would have an ID, a number or name, and whatever other attributes you need as data.
ID | Prerequisite | Ability
1 | NULL | May gain foos
2 | Gain 10 foos | May gain bars
3 | Gain 20 bars | 30 free foos
In a fully normalized state, you would have another entity called UserLevel in which you would store data about, for example, when a certain User gained a level.
The UserLevel entity would contain the LevelID and the UserID as foreign keys (links back to the other entities), and a DateAchieved column for when the User achieved the level.
LevelID | UserID | DateAchieved
1 | 1 | 2014-02-01
1 | 2 | 2014-02-01
2 | 1 | 2014-02-05
3 | 1 | 2014-02-09
2 | 2 | 2014-02-11
4 | 1 | 2014-02-13
This shows User 1 and User 2 starting at Level 1 on the same day and leveling up at different rates.


How to deal with Variable data over time in associations

In linked models (let's say a drink transaction, a waiter, and a restaurant), when you want to display data, you look for informations in your linked content :
Where was that beer bought ?
Fetch Drink transaction => Fetch its Waiter => Fetch this waiter's Restaurant : this is where the beer was purchased
So at time T, when I display all transactions, I fetch my data following associations, thus I can display this :
TransactionID Waiter Restaurant
1 Julius Caesar's palace
2 Cleo Moe's tavern
Let's say now that my waiter is moved to another restaurant.
If I refresh this table, the result will be
TransactionID Waiter Restaurant
1 Julius Moe's tavern
2 Cleo Moe's tavern
But we know that the transaction n°1 was made in Caesar's palace !
Solution 1
Don't modify the waiter Julius, but clone it.
Upside : I keep an association between models, and still can filter with every field of every associated models.
Downside : Every modification on every model duplicates content, which can do a LOT when time passes.
Solution 2
Keep a copy of the current state of your associated models when you create the transaction.
Upside : I don't duplicate the contents.
Downside : You can't anymore use fields on your content to display, sort or filter them, as your original and real data is inside, let's say, a JSON field. So you have to, if you use MySQL, filter your data by makin plain-search queries in that field.
What is your solution ?
The problem goes further, as it's not only a matter when association changes : a simple modification on an associated model causes a problem too.
What I mean :
What's the amount of this order ?
Fetch Drink transaction => Fetch its product => Fetch this product's Price => Multiply by order quantity : this is the total amount of the order
So at time T, when I display all transactions, I fetch my data following associations, thus I can display this :
TransactionID Qty ProductId
1 2 1
ProductID Title Price
1 Beer 3
==> Amount of order n°1 : 6.
Let's say now that the beer costs 2,5.
If I refresh this table, the result will be
TransactionID Qty ProductId
1 2 1
ProductID Title Price
1 Beer 2,5
==> Amount of order n°1 : 5.
So, once again, the 2 solutions are available : do I clone the beer product when its price is changed ? Do I save a copy of beer in my order when the order is made ? Do you have any third solution ?
I can't just add an "amount" attribute on my orders : yes it can solve that problem (partially) but it's not a scalable solution as many other attributes will be in the same situation and I can't multiply attributes like this.
Event Sourcing
This is a good use case for Event Sourcing. Martin Fowler wrote a very good article about it, I advise you to read it.
there are times when we don't just want to see where we are, we also want to know how we got there.
The idea is to never overwrite data but instead create immutable transactions for everything you want to keep a history of. In your case you'll have WaiterRelocationEvents and PriceChangeEvents. You can recreate the status of any given time by applying every event in order.
If you don't use Event Sourcing, you lose information. Often it's acceptable to forget historic information, but sometimes it's not.
Lambda Architecture
As you don't want to recalculate everything on every single request, it's advisable to implement a Lambda Architecture. That architecture is often explained with BigData technology and frameworks, but you could implement it with Plain Old Java and CronJobs.
It consists of three parts: Batch Layer, Service Layer and Speed Layer.
The Batch Layer regularly calculates an aggregated version of the data, for example you'll calculate the monthly income once per day. So the current month's income will change every night until the month is over.
But now you want to know the income in real-time. Therefore you add a Speed Layer, which will apply all events of the current date immediately. Now if a request of the current month's income arrives, you'll add up the last result of the Batch Layer and the Speed Layer.
The Service Layer allows more advanced queries by combing multiple batch results and the Speed Layer results into one query. For example you can calculate the year's income by summing the monthly incomes.
But as said before, only use the Lambda approach if you need the data often and fast, because it adds extra complexity. Calculations which are rarely needed, should be run on-the-fly. For example: Which waiter creates the most income at Saturday evenings?
| Timestamp | Id | Name |
| ---------- | -- | --------------- |
| 2016-01-01 | 1 | Caesar's palace |
| 2016-11-01 | 2 | Moe's tavern |
| Timestamp | Id | Name | FirstRestaurant |
| ---------- | -- | -------- | --------------- |
| 2016-01-01 | 11 | Julius | 1 |
| 2016-11-01 | 12 | Cleo | 2 |
| Timestamp | WaiterId | RestaurantId |
| ---------- | -------- | ------------ |
| 2016-06-01 | 11 | 2 |
| Timestamp | Id | Name | FirstPrice |
| ---------- | -- | -------- | ---------- |
| 2016-01-01 | 21 | Beer | 3.00 |
| Timestamp | ProductId | NewPrice |
| ---------- | --------- | -------- |
| 2016-11-01 | 21 | 2.50 |
| Timestamp | Id | ProductId | Quantity | WaiterId |
| ---------- | -- | --------- | -------- | -------- |
| 2016-06-14 | 31 | 21 | 2 | 11 |
Now let's get all information about order 31.
get order 31
get price of product 21 at 2016-06-14
get last PriceChangeEvent before the date or use FirstPrice if none exists
calculate total price by multiplying retrieved price with quantity
get waiter 11
get waiter's restaurant at 2016-06-14
get last WaiterRelocationEvent before the date or use FirstRestaurant if none exists
get restaurant name by retrieved restaurant id of the waiter
As you can see it becomes complicated, therefore you should only keep history of useful data.
I wouldn't involve the relocation events in the calculation. They could be stored, but I would store the restaurant id and the waiter id in the order directly.
The price history on the other hand could be interesting to check if orders went down after a price change. Here you could use the Lambda Architecure to calculate a full order with prices from the raw order and the price history.
Decide of which data you want to keep the history.
Implement Event Sourcing for that data.
Use the Lambda Architecture to speed up commonly used queries.
I like the question as it raises something very straightforward and also something more subtle.
The common principle in both cases is that ‘History must not change’, meaning if we run a query over a specified past date range today the results are the same as when we run that same query at any point in the future.
Waiters Case
When a waiter changes restaurants we must not change the history of sales. If waiter Julius sells a drink yesterday in restaurant 1 then he switches to sell more drinks today in restaurant 2 we must retain those details.
Thus we want to be able to answer queries such as ‘how many drinks has Julius sold in restaurant 1’ and ‘how many drinks has Julius sold in all restaurants’.
To achieve this you have to abstract away from Julius as a waiter by bringing in a concept of staff. Julius is a member of staff. Staff work as waiters. When working in restaurant 1 Julius is waiter A and when he works in another restaurant he is waiter B, but always the same member of staff – Julius. With an entity ‘Staff’ the queries can be answered easily.
No loss of historic data or excessive duplications.
Downside New entity Staff must be managed. But waiter table content is reduced making net overhead of data storage is low.
In summary - abstract data subject to change into a new entity and refer back to it from transactions.
Value of Order Case
The extended use case regarding ‘what is the value of this order’ is more involved. I work in cross-currency transactions where value for the observer (user) in the price list changes from day to day as currency fluctuations occur.
But there are good reasons to lock the order value in place. For example invoice processing systems have tolerance for a small difference between their expected invoice value and that of the submitted invoice, but any large difference can lead to late payment whilst invoice handlers check the issue. Also, if customers run reports on their historic purchases then the values of those orders must remain consistent despite fluctuations in currency rates over time.
The solution is to save into the order line:
the value of product in the customers currency,
or the rate between custom and supplier currency,
but ideally do both to avoid rounding errors.
What this does is provide a statement that ‘on the date that this order was placed line 1 cost $44.56 at exchange rate 1.1 $/£’. Having this data locked in allows you to invoice to the customers expectation and provide consistent spend reports over time.
Upside: Consistent historic data. Fast database performance as no look-ups required against historic rate tables.
Downside: Some data duplication. However, trading off against overhead of storage and indexation for historic rate storage plus indexation then this is possibly an upside.
Regarding adding 'amount' to your order table - you have to do this if you want to achieve a consistent data history. If you only work in one currency then amount is the only additional storage concern. And by adding this one attribute you have protected history. Your other alternative is to store a historic cost table for drinks so you know in January beer was $1, in February it as $1.10 etc and then store the cost-table key in the transaction so that you can look up the cost if anyone asks about a historic order. But the overhead on storing the key PLUS the indexes needed to make this practicable will outweigh the storage cost of cloning 'amount' onto the order record.
In summary - clone cost data that will change over time.

Modelling a voting poll in a graph database

I've modelled a voting poll for a RDBMS system. The structure is a bit more complicated than a conventional voting poll since users can choose to vote either for an option on the poll or pass on their vote to another user for a given poll.
My structure looks something like this:
id | title
1 | Who should be president
id | poll_id | title
1 | 1 | Obama
2 | 1 | Bush
id | poll_id | user_id | vote_type | vote_id
1 | 1 | 1 | option | 1
2 | 1 | 2 | user | 1
In this case, option 1 would receive 2 votes since user 2 gave his vote to user 1 who votes for option 1.
I realize that the data I am going to store is going to be fairly complicated to query in a RDBMS system if I want to visualise how the votes move between users. However, I don't have much experience with graph databases and would like some hints as to how I go around modelling this.
It's always preferable, when making a DB model, to start with an information design model, and then transform this into a DB model.
In an information design model for your problem, options would be componenents of polls (so the UML class diagram would have a composition between Option and Poll), and votes would be relationships/links between users and options (so the UML class diagram would have a *many-to-many association between Option and User, the instances of which are the votes). In addition, there is a ternary association User-delegates-his-vote-in-Poll-to-User, the instances of which are the delegations.
From this, I get the following DB model:
Poll( id, question)
Option( poll_id, option_sequence_no, possible_vote)
Vote( user_id, poll_id, option_sequence_no, nmr_of_votes)
Delegation( user_id, poll_id, delegate_id)
Of course, we have to add a constraint that the number of votes by a use in a poll is the number of delegations plus 1.

Can a 'skinny table' design be compensated with a view?

Context: simple webapp game for personal learning purposes, using postgres. I can design it however I want.
2 tables 1 view (there are additional tables view references that aren't important)
Table: Research
col: research_id (foreign key to an outside table)
col: category (integer foreign key to category table)
col: percent (integer)
constraint (unique combination of the three columns)
Table: Category
col: category_id (primary key auto inc)
col: name(varchar(255))
notes: this table exists to capture the 4 categories of research I want in business logic and which I assume is not best practice to hardcode as columns in the db
View: Research_view
col: research_id (from research table)
col: foo1 (one of the categories from category table)
col: foo2 (etc...)
col: other cols from other joins
notes:has insert/update/delete statements that uses above tables appropriately
The research table itself I worry qualifies as a "Skinny Table" (hadn't heard the term until I just saw it in the Ibatis manning book). For example test data within it looks like:
| research_id | percent | category |
| 1 | 25 | 1 |
| 1 | 25 | 2 |
| 1 | 25 | 3 |
| 1 | 25 | 4 |
| 2 | 20 | 1 |
| 2 | 30 | 2 |
| 2 | 25 | 3 |
| 2 | 25 | 4 |
1) Does it make sense to have all columns in a table collectively define unique entries?
2) Does this 'smell' to you?
Couple of notes to start:
constraint (unique combination of the three columns)
It makes no sense to have a unique constraint that includes a single-column primary key. Including that column will cause every row to be unique.
notes: this table exists to capture the 4 categories of research I want in business logic and which I assume is not best practice to hardcode as columns in the db
If a research item/entity is required to have all four categories defined for it to be valid, they should absolutely be columns in the research table. I can't tell definitively from your statement whether this is the case or not, but your assumption is faulty if looked at in isolation. Let your model reflect reality as closely as possible.
Another factor is whether it's a requirement that additional categories may be added to the system post-deployment. Whether the categories are intended to be flexible vs. fixed should absolutely influence the design.
1) Does it make sense to have all columns in a table collectively
define unique entries?
I would say it's not common, but can imagine there are situations where it might be appropriate.
2) Does this 'smell' to you?
Hard to say without more details.
All that said, if the intent is to view and add research items with all four categories, I would say (again) that you should consider whether the four categories are semantically attributes of the research entity.
As a random example, things like height and weight might be considered categories of a person, but they would likely be stored flat on the person table, and not in a separate table.

Database design - storing a sequence

Imagine the following: there is a "recipe" table and a "recipe-step" table. The idea is to allow different recipe-steps to be reused in different recipes. The problem I'm having relates to the fact that in the recipe context, the order in which the recipe-steps show up is important, even if it does not follow the recipe-step table primary-key order, because this order will be set by the user.
I was thinking of doing something like:
recipe-step table:
id | stepName | stepDescription
1 | step1 | description1
2 | step2 | description2
3 | step3 | description3
recipe table:
recipeId | step
1 | 1
1 | 2
1 | 3
This way, the order in which the steps show up in the step column is the order I need to maintain.
My concerns with this approach are:
if I have to add a new step between two existing steps, how do I query it? What if I just need to switch the order of two steps already in the sequence?
how do I make sure the order maintains its consistency? If I just insert or update something in the recipe table, it will pop up at the end of the table, right?
Is there any other way you would think of doing this? I also thought of having a previous-step and a next-step column in the recipe-step table, but I think it would be more difficult to make the recipe-steps reusable that way.
In SQL, tables are not ordered.
Unless you are using an ORDER BY clause, database engines are allowed to return records in any order they feel is fastest (for example, a covering index might have the data in a different order, and sometimes even SQLite creates temporary covering indexes automatically).
If the steps have a specific order in a specific recipe, then you have to store this information in the database.
I'd suggest to add this to the recipe table:
recipeId | step | stepOrder
1 | 1 | 1
1 | 2 | 2
1 | 3 | 3
2 | 4 | 1
2 | 2 | 2
The recipe table stores the relationship between recipes and steps, so it should be called recipe-step.
The recipe-step table is independent of recipes, so it should be called step.
You probably need a table that stores recipe information that is independent of steps; this table should be called recipe.

Normalizing a Table 6

I'm putting together a database that I need to normalize and I've run into an issue that I don't really know how to handle.
I've put together a simplified example of my problem to illustrate it:
Item ID___Mass___Procurement__Currency__________Amount
1_________13kg___bought_______US dollars_________47.20
2__________5kg___bought_______British Pounds______3.10
4__________9kg___bought_______US dollars__________1.32
(My apologies for the awkward table; new users aren't allowed to paste images)
In the table above I have a property (Amount) which is functionally dependent on the Item ID (I think), but which does not exist for every Item ID (since inherited items have no monetary cost). I'm relatively new to databases, but I can't find a similar issue to this addressed in any beginner tutorials or literature. Any help would be appreciated.
I would just create two new tables ItemProcurement and Currencies.
If I'm not wrong, as per the data presented, the amount is part of the procurement of the item itself (when the item has not been inherited), for that reason I would group the Amount and CurrencyID fields in the new entity ItemProcurement.
As you can see, an inherited item wouldn't have an entry in the ItemProcurement table.
Concerning the main Item table, if you expect just two different values for the kind of procurement, then I would use a char(1) column (varying from B => bougth, I => inherited).
I would looks like this:
The data would then look like this:
| ID | Mass | ProcurementMethod |
| 0 | 2 | I |
| 1 | 13 | B |
| 2 | 5 | B |
TABLE ItemProcurement
| ItemID | CurrencyID | Amount |
| 1 | 840 | 47.20 |
| 2 | 826 | 3.10 |
TABLE Currencies
| CurrencyID | ISOCode | Description |
| 840 | USD | US dollars |
| 826 | GBP | British Pounds |
Not only Amount, everything is dependent on ItemID, as this seems to be a candidate key.
The dependence you have is that Currency and Amount are NULL (I guess this means Unknown/Invalid) when the Procurement is 'inherited' (or 0 cost as pointed by #XIVsolutions and as you mention "inherited items have no monetary cost")
In other words, iems are divided into two types (of procurements) and items of one of the two types do not have all attributes.
This can be solved with a supertype/subtype split. You have a supertype table (Item) and two subtype tables (ItemBought and ItemInherited), where each one of them has a 1::0..1 relationship with the supertype table. The attributes common to all items will be in the supertype table and every other attribute in the respecting subtype table:
ItemID Mass Procurement
0 2kg inherited
1 13kg bought
2 5kg bought
3 11kg inherited
4 9kg bought
ItemID Currency Amount
1 US dollars 47.20
2 British Pounds 3.10
4 US dollars 1.32
If there is no attribute that only inherited items have, you even skip the ItemInherited table altogether.
For other questions relating to this pattern, look up the tag: Class-Table-Inheritance. While you're at it, look up Shared-Primary-Key as well. For a more concpetual treatment, google on "ER Specialization".
Here is my off-the-cuff suggestion:
UPDATE: Mass would be a Float/Decimal/Double depending upon your Db, Cost would be whatever the optimal type is for handling money (in SQL Server 2008, it is "Money" but these things vary).
ANOTHER UPDATE: The cost of an inherited item should be zero, not null (and in fact, there sometime IS an indirect cost, in the form of taxes, but I digress . . .). Therefore, your Item Table should require a value for cost, even if that cost is zero. It should not be null.
Let me know if you have questions . . .
Why do you need to normalise it?
I can see some data integrity challenges, but no obvious structural problems.
The implicit dependency between "procurement" and the presence or not of the value/currency is tricky, but has nothing to do with the keys and so is not a big deal, practically.
If we are to be purists (e.g. this is for homework purposes), then we are dealing with two types of item, inherited items and bought items. Since they are not the same type of thing, they should be modelled as two separate entities i.e. InheritedItem and BoughtItem, with only the columns they need.
In order to get a combined view of all items (e.g. to get a total weight), you would use a view, or a UNION sql query.
If we are looking to object model in the database, then we can factor out the common supertype (Item), and model the subtypes (InheritedItem, BoughtItem) with foreign-keys to the supertype table (ypercube explanation below is very good), but this is very complicated and less future-proof than only modelling the subtypes.
This last point is the subject of much argument, but practically, in my experience, modelling concrete supertypes in the database leads to more pain later than leaving them abstract. Okay, that's probably waaay beyond what you wanted :).
