How to deal with Variable data over time in associations

How to deal with Variable data over time in associations - data-modeling

In linked models (let's say a drink transaction, a waiter, and a restaurant), when you want to display data, you look for informations in your linked content :
Where was that beer bought ?
Fetch Drink transaction => Fetch its Waiter => Fetch this waiter's Restaurant : this is where the beer was purchased
So at time T, when I display all transactions, I fetch my data following associations, thus I can display this :
TransactionID Waiter Restaurant
1 Julius Caesar's palace
2 Cleo Moe's tavern
Let's say now that my waiter is moved to another restaurant.
If I refresh this table, the result will be
TransactionID Waiter Restaurant
1 Julius Moe's tavern
2 Cleo Moe's tavern
But we know that the transaction n°1 was made in Caesar's palace !
Solution 1
Don't modify the waiter Julius, but clone it.
Upside : I keep an association between models, and still can filter with every field of every associated models.
Downside : Every modification on every model duplicates content, which can do a LOT when time passes.
Solution 2
Keep a copy of the current state of your associated models when you create the transaction.
Upside : I don't duplicate the contents.
Downside : You can't anymore use fields on your content to display, sort or filter them, as your original and real data is inside, let's say, a JSON field. So you have to, if you use MySQL, filter your data by makin plain-search queries in that field.
What is your solution ?
[EDIT]
The problem goes further, as it's not only a matter when association changes : a simple modification on an associated model causes a problem too.
What I mean :
What's the amount of this order ?
Fetch Drink transaction => Fetch its product => Fetch this product's Price => Multiply by order quantity : this is the total amount of the order
So at time T, when I display all transactions, I fetch my data following associations, thus I can display this :
TransactionID Qty ProductId
1 2 1
ProductID Title Price
1 Beer 3
==> Amount of order n°1 : 6.
Let's say now that the beer costs 2,5.
If I refresh this table, the result will be
TransactionID Qty ProductId
1 2 1
ProductID Title Price
1 Beer 2,5
==> Amount of order n°1 : 5.
So, once again, the 2 solutions are available : do I clone the beer product when its price is changed ? Do I save a copy of beer in my order when the order is made ? Do you have any third solution ?
I can't just add an "amount" attribute on my orders : yes it can solve that problem (partially) but it's not a scalable solution as many other attributes will be in the same situation and I can't multiply attributes like this.

Event Sourcing
This is a good use case for Event Sourcing. Martin Fowler wrote a very good article about it, I advise you to read it.
there are times when we don't just want to see where we are, we also want to know how we got there.
The idea is to never overwrite data but instead create immutable transactions for everything you want to keep a history of. In your case you'll have WaiterRelocationEvents and PriceChangeEvents. You can recreate the status of any given time by applying every event in order.
If you don't use Event Sourcing, you lose information. Often it's acceptable to forget historic information, but sometimes it's not.
Lambda Architecture
As you don't want to recalculate everything on every single request, it's advisable to implement a Lambda Architecture. That architecture is often explained with BigData technology and frameworks, but you could implement it with Plain Old Java and CronJobs.
It consists of three parts: Batch Layer, Service Layer and Speed Layer.
The Batch Layer regularly calculates an aggregated version of the data, for example you'll calculate the monthly income once per day. So the current month's income will change every night until the month is over.
But now you want to know the income in real-time. Therefore you add a Speed Layer, which will apply all events of the current date immediately. Now if a request of the current month's income arrives, you'll add up the last result of the Batch Layer and the Speed Layer.
The Service Layer allows more advanced queries by combing multiple batch results and the Speed Layer results into one query. For example you can calculate the year's income by summing the monthly incomes.
But as said before, only use the Lambda approach if you need the data often and fast, because it adds extra complexity. Calculations which are rarely needed, should be run on-the-fly. For example: Which waiter creates the most income at Saturday evenings?
Example
Restaurants:
| Timestamp | Id | Name |
| ---------- | -- | --------------- |
| 2016-01-01 | 1 | Caesar's palace |
| 2016-11-01 | 2 | Moe's tavern |
Waiters:
| Timestamp | Id | Name | FirstRestaurant |
| ---------- | -- | -------- | --------------- |
| 2016-01-01 | 11 | Julius | 1 |
| 2016-11-01 | 12 | Cleo | 2 |
WaiterRelocationEvents:
| Timestamp | WaiterId | RestaurantId |
| ---------- | -------- | ------------ |
| 2016-06-01 | 11 | 2 |
Products:
| Timestamp | Id | Name | FirstPrice |
| ---------- | -- | -------- | ---------- |
| 2016-01-01 | 21 | Beer | 3.00 |
PriceChangeEvent:
| Timestamp | ProductId | NewPrice |
| ---------- | --------- | -------- |
| 2016-11-01 | 21 | 2.50 |
Orders:
| Timestamp | Id | ProductId | Quantity | WaiterId |
| ---------- | -- | --------- | -------- | -------- |
| 2016-06-14 | 31 | 21 | 2 | 11 |
Now let's get all information about order 31.
get order 31
get price of product 21 at 2016-06-14
get last PriceChangeEvent before the date or use FirstPrice if none exists
calculate total price by multiplying retrieved price with quantity
get waiter 11
get waiter's restaurant at 2016-06-14
get last WaiterRelocationEvent before the date or use FirstRestaurant if none exists
get restaurant name by retrieved restaurant id of the waiter
As you can see it becomes complicated, therefore you should only keep history of useful data.
I wouldn't involve the relocation events in the calculation. They could be stored, but I would store the restaurant id and the waiter id in the order directly.
The price history on the other hand could be interesting to check if orders went down after a price change. Here you could use the Lambda Architecure to calculate a full order with prices from the raw order and the price history.
Summary
Decide of which data you want to keep the history.
Implement Event Sourcing for that data.
Use the Lambda Architecture to speed up commonly used queries.

I like the question as it raises something very straightforward and also something more subtle.
The common principle in both cases is that ‘History must not change’, meaning if we run a query over a specified past date range today the results are the same as when we run that same query at any point in the future.
Waiters Case
When a waiter changes restaurants we must not change the history of sales. If waiter Julius sells a drink yesterday in restaurant 1 then he switches to sell more drinks today in restaurant 2 we must retain those details.
Thus we want to be able to answer queries such as ‘how many drinks has Julius sold in restaurant 1’ and ‘how many drinks has Julius sold in all restaurants’.
To achieve this you have to abstract away from Julius as a waiter by bringing in a concept of staff. Julius is a member of staff. Staff work as waiters. When working in restaurant 1 Julius is waiter A and when he works in another restaurant he is waiter B, but always the same member of staff – Julius. With an entity ‘Staff’ the queries can be answered easily.
Upside:
No loss of historic data or excessive duplications.
Downside New entity Staff must be managed. But waiter table content is reduced making net overhead of data storage is low.
In summary - abstract data subject to change into a new entity and refer back to it from transactions.
Value of Order Case
The extended use case regarding ‘what is the value of this order’ is more involved. I work in cross-currency transactions where value for the observer (user) in the price list changes from day to day as currency fluctuations occur.
But there are good reasons to lock the order value in place. For example invoice processing systems have tolerance for a small difference between their expected invoice value and that of the submitted invoice, but any large difference can lead to late payment whilst invoice handlers check the issue. Also, if customers run reports on their historic purchases then the values of those orders must remain consistent despite fluctuations in currency rates over time.
The solution is to save into the order line:
the value of product in the customers currency,
or the rate between custom and supplier currency,
but ideally do both to avoid rounding errors.
What this does is provide a statement that ‘on the date that this order was placed line 1 cost $44.56 at exchange rate 1.1 $/£’. Having this data locked in allows you to invoice to the customers expectation and provide consistent spend reports over time.
Upside: Consistent historic data. Fast database performance as no look-ups required against historic rate tables.
Downside: Some data duplication. However, trading off against overhead of storage and indexation for historic rate storage plus indexation then this is possibly an upside.
Regarding adding 'amount' to your order table - you have to do this if you want to achieve a consistent data history. If you only work in one currency then amount is the only additional storage concern. And by adding this one attribute you have protected history. Your other alternative is to store a historic cost table for drinks so you know in January beer was $1, in February it as $1.10 etc and then store the cost-table key in the transaction so that you can look up the cost if anyone asks about a historic order. But the overhead on storing the key PLUS the indexes needed to make this practicable will outweigh the storage cost of cloning 'amount' onto the order record.
In summary - clone cost data that will change over time.

Related

SQL Server database design for evaluations

I'm designing this employee evaluation web page, and was wondering if my current database design is the correct one or if it could be improved.
This is my current design
Table Agenda:
+--------------+----------+----------+-----------+------+-------+-------+
| idEvaluation | Location | Employee | #Employee | Date | Date1 | Date2 |
+--------------+----------+----------+-----------+------+-------+-------+
Date is the date scheduled for the evaluation to be performed.
Date 1 and Date 2 its a period of time to retrieve some metrics from another database.
Table Evaluations:
+--------------+---------+------------+------+----------+
| idEvaluation | Manager | Department | Date | Comments |
+--------------+---------+------------+------+----------+
Table Scores:
+--------------+----------+-------+
| idEvaluation | idFactor | Score |
+--------------+----------+-------+
idFactor relates to another table which contains the factor and a description of it, like I said its this a correct design??
My concern its this, currently there are 60 employees, 11 managers and 12 factors, each employee its evaluated twice a year by every manager, so in the Agenda Table there's not much trouble since its only one record per evaluation (60 employees = 60 records), how ever on the Evaluations Table there are 11 records for every evaluation, so it goes to 660 records (60 employees * 11 managers = 660), and then on the Scores Table it goes even bigger since there are 12 factors for every evaluation, it goes to 7920 records (660 evaluations * 12 factors each = 7920).
Is this normal?? Am I doing it wrong?? Any input its appreciated.
EDIT
Location, Employee, #Employee, Manager and Department are loaded automatically by the vb.net page, they are "imported" from an Active Directory and its checked before insertion so duplicate names, misspelled names, and this sort of thing its not an issue.

The main idea is you dont want to repeat string literals
So if you have
id Department
1 Sales
2 IT
3 Admin
Instead of repeat Sales many time you only use 1 which is smaller so things also get faster.
Second if you have users
id user
1 Jhon Alexander
2 Maria Jhonson
If Jhon decide change his name then you will have to check all tables and change the name. Also there is the problem if two person have same name you wont know which one are you evaluating.
So go for separated table and use the ID.

Can anyone suggest a method of versioning ATTRIBUTE (rather than OBJECT) data in DB

Taking MySQL as an example DB to perform this in (although I'm not restricted to Relational flavours at this stage) and Java style syntax for model / db interaction.
I'd like the ability to allow versioning of individual column values (and their corresponding types) as and when users edit objects. This is primarily in an attempt to drop the amount of storage required for frequent edits of complex objects.
A simple example might be
- Food (Table)
- id (INT)
- name (VARCHAR(255))
- weight (DECIMAL)
So we could insert an object into the database that looks like...
Food banana = new Food("Banana",0.3);
giving us
+----+--------+--------+
| id | name | weight |
+----+--------+--------+
| 1 | Banana | 0.3 |
+----+--------+--------+
if we then want to update the weight we might use
banana.weight = 0.4;
banana.save();
+----+--------+--------+
| id | name | weight |
+----+--------+--------+
| 1 | Banana | 0.4 |
+----+--------+--------+
Obviously though this is going to overwrite the data.
I could add a revision column to this table, which could be incremented as items are saved, and set a composite key that combines id/version, but this would still mean storing ALL attributes of this object for every single revision
- Food (Table)
- id (INT)
- name (VARCHAR(255))
- weight (DECIMAL)
- revision (INT)
+----+--------+--------+----------+
| id | name | weight | revision |
+----+--------+--------+----------+
| 1 | Banana | 0.3 | 1 |
| 1 | Banana | 0.4 | 2 |
+----+--------+--------+----------+
But in this instance we're going to be storing every single piece of data about every single item. This isn't massively efficient if users are making minor revisions to larger objects where Text fields or even BLOB data may be part of the object.
What I'd really like, would be the ability to selectively store data discretely, so the weight could possible be saved in a separate DB in its own right, that would be able to reference the table, row and column that it relates to.
This could then be smashed together with a VIEW of the table, that could sort of impose any later revisions of individual column data into the mix to create the latest version, but without the need to store ALL data for each small revision.
+----+--------+--------+
| id | name | weight |
+----+--------+--------+
| 1 | Banana | 0.3 |
+----+--------+--------+
+-----+------------+-------------+-----------+-----------+----------+
| ID | TABLE_NAME | COLUMN_NAME | OBJECT_ID | BLOB_DATA | REVISION |
+-----+------------+-------------+-----------+-----------+----------+
| 456 | Food | weight | 1 | 0.4 | 2 |
+-----+------------+-------------+-----------+-----------+----------+
Not sure how successful storing any data as blob to then CAST back to original DTYPE might be, but thought since I was inventing functionality here, why not go nuts.
This method of storage would also be fairly dangerous, since table and column names are entirely subject to change, but hopefully this at least outlines the sort of behaviour I'm thinking of.

A table in 6NF has one CK (candidate key) (in SQL a PK) and at most one other column. Essentially 6NF allows each pre-6NF table's column's update time/version and value recorded in an anomaly-free way. You decompose a table by dropping a non-prime column while adding a table with it plus an old CK's columns. For temporal/versioning applications you further add a time/version column and the new CK is the old one plus it.
Adding a column of time/whatever interval (in SQL start time and end time columns) instead of time to a CK allows a kind of data compression by recording longest uninterupted stretches of time or other dimension through which a column had the same value. One queries by an original CK plus the time whose value you want. You dont need this for your purposes but the initial process of normalizing to 6NF and the addition of a time/whatever column should be explained in temporal tutorials.
Read about temporal databases (which deal both with "valid" data that is times and time intervals but also "transaction" times/versions of database updates) and 6NF and its role in them. (Snodgrass/TSQL2 is bad, Date/Darwen/Lorentzos is good and SQL is problematic.)
Your final suggested table is an example of EAV. This is usually an anti-pattern. It encodes a database in to one or more tables that are effectively metadata. But since the DBMS doesn't know that you lose much of its functionality. EAV is not called for if DDL is sufficient to manage tables with columns that you need. Just declare appropriate tables in each database. Which is really one database, since you expect transactions affecting both. From that link:
You are using a DBMS anti-pattern EAV. You are (trying to) build part of a DBMS into your program + database. The DBMS already exists to manage data and metadata. Use it.
Do not have a class/table of metatdata. Just have attributes of movies be fields/columns of Movies.
The notion that one needs to use EAV "so every entity type can be
extended with custom fields" is mistaken. Just implement via calls
that update metadata tables sometimes instead of just updating regular
tables: DDL instead of DML.

Modelling a voting poll in a graph database

I've modelled a voting poll for a RDBMS system. The structure is a bit more complicated than a conventional voting poll since users can choose to vote either for an option on the poll or pass on their vote to another user for a given poll.
My structure looks something like this:
Polls
id | title
----------
1 | Who should be president
Options
id | poll_id | title
--------------------
1 | 1 | Obama
2 | 1 | Bush
Vote
id | poll_id | user_id | vote_type | vote_id
--------------------------------------------
1 | 1 | 1 | option | 1
2 | 1 | 2 | user | 1
In this case, option 1 would receive 2 votes since user 2 gave his vote to user 1 who votes for option 1.
I realize that the data I am going to store is going to be fairly complicated to query in a RDBMS system if I want to visualise how the votes move between users. However, I don't have much experience with graph databases and would like some hints as to how I go around modelling this.

It's always preferable, when making a DB model, to start with an information design model, and then transform this into a DB model.
In an information design model for your problem, options would be componenents of polls (so the UML class diagram would have a composition between Option and Poll), and votes would be relationships/links between users and options (so the UML class diagram would have a *many-to-many association between Option and User, the instances of which are the votes). In addition, there is a ternary association User-delegates-his-vote-in-Poll-to-User, the instances of which are the delegations.
From this, I get the following DB model:
Poll( id, question)
Option( poll_id, option_sequence_no, possible_vote)
Vote( user_id, poll_id, option_sequence_no, nmr_of_votes)
Delegation( user_id, poll_id, delegate_id)
Of course, we have to add a constraint that the number of votes by a use in a poll is the number of delegations plus 1.

Normalizing a Table 6

I'm putting together a database that I need to normalize and I've run into an issue that I don't really know how to handle.
I've put together a simplified example of my problem to illustrate it:
Item ID___Mass___Procurement__Currency__________Amount
0__________2kg___inherited____null________________null
1_________13kg___bought_______US dollars_________47.20
2__________5kg___bought_______British Pounds______3.10
3_________11kg___inherited____null________________null
4__________9kg___bought_______US dollars__________1.32
(My apologies for the awkward table; new users aren't allowed to paste images)
In the table above I have a property (Amount) which is functionally dependent on the Item ID (I think), but which does not exist for every Item ID (since inherited items have no monetary cost). I'm relatively new to databases, but I can't find a similar issue to this addressed in any beginner tutorials or literature. Any help would be appreciated.

I would just create two new tables ItemProcurement and Currencies.
If I'm not wrong, as per the data presented, the amount is part of the procurement of the item itself (when the item has not been inherited), for that reason I would group the Amount and CurrencyID fields in the new entity ItemProcurement.
As you can see, an inherited item wouldn't have an entry in the ItemProcurement table.
Concerning the main Item table, if you expect just two different values for the kind of procurement, then I would use a char(1) column (varying from B => bougth, I => inherited).
I would looks like this:
The data would then look like this:
TABLE Items
+-------+-------+--------------------+
| ID | Mass | ProcurementMethod |
|-------+-------+--------------------+
| 0 | 2 | I |
+-------+-------+--------------------+
| 1 | 13 | B |
+-------+-------+--------------------+
| 2 | 5 | B |
+-------+-------+--------------------+
TABLE ItemProcurement
+--------+-------------+------------+
| ItemID | CurrencyID | Amount |
|--------+-------------+------------+
| 1 | 840 | 47.20 |
+--------+-------------+------------+
| 2 | 826 | 3.10 |
+--------+-------------+------------+
TABLE Currencies
+------------+---------+-----------------+
| CurrencyID | ISOCode | Description |
|------------+---------+-----------------+
| 840 | USD | US dollars |
+------------+---------+-----------------+
| 826 | GBP | British Pounds |
+------------+---------+-----------------+

Not only Amount, everything is dependent on ItemID, as this seems to be a candidate key.
The dependence you have is that Currency and Amount are NULL (I guess this means Unknown/Invalid) when the Procurement is 'inherited' (or 0 cost as pointed by #XIVsolutions and as you mention "inherited items have no monetary cost")
In other words, iems are divided into two types (of procurements) and items of one of the two types do not have all attributes.
This can be solved with a supertype/subtype split. You have a supertype table (Item) and two subtype tables (ItemBought and ItemInherited), where each one of them has a 1::0..1 relationship with the supertype table. The attributes common to all items will be in the supertype table and every other attribute in the respecting subtype table:
Item
----------------------------
ItemID Mass Procurement
0 2kg inherited
1 13kg bought
2 5kg bought
3 11kg inherited
4 9kg bought
ItemBought
---------------------------------
ItemID Currency Amount
1 US dollars 47.20
2 British Pounds 3.10
4 US dollars 1.32
ItemInherited
-------------
ItemID
0
3
If there is no attribute that only inherited items have, you even skip the ItemInherited table altogether.
For other questions relating to this pattern, look up the tag: Class-Table-Inheritance. While you're at it, look up Shared-Primary-Key as well. For a more concpetual treatment, google on "ER Specialization".

Here is my off-the-cuff suggestion:
UPDATE: Mass would be a Float/Decimal/Double depending upon your Db, Cost would be whatever the optimal type is for handling money (in SQL Server 2008, it is "Money" but these things vary).
ANOTHER UPDATE: The cost of an inherited item should be zero, not null (and in fact, there sometime IS an indirect cost, in the form of taxes, but I digress . . .). Therefore, your Item Table should require a value for cost, even if that cost is zero. It should not be null.
Let me know if you have questions . . .

Why do you need to normalise it?
I can see some data integrity challenges, but no obvious structural problems.
The implicit dependency between "procurement" and the presence or not of the value/currency is tricky, but has nothing to do with the keys and so is not a big deal, practically.
If we are to be purists (e.g. this is for homework purposes), then we are dealing with two types of item, inherited items and bought items. Since they are not the same type of thing, they should be modelled as two separate entities i.e. InheritedItem and BoughtItem, with only the columns they need.
In order to get a combined view of all items (e.g. to get a total weight), you would use a view, or a UNION sql query.
If we are looking to object model in the database, then we can factor out the common supertype (Item), and model the subtypes (InheritedItem, BoughtItem) with foreign-keys to the supertype table (ypercube explanation below is very good), but this is very complicated and less future-proof than only modelling the subtypes.
This last point is the subject of much argument, but practically, in my experience, modelling concrete supertypes in the database leads to more pain later than leaving them abstract. Okay, that's probably waaay beyond what you wanted :).

Checking for overlapping car reservations

I'm writing a simple booking program for a car rental (a school assignment). Me and my buddy are trying to make the system a little more advanced than the assignment dictates, but we're having some problems we hoped you could help us with.
The idea is that you can reserve a certain car type, and when you get the car it will be one of that type (you don't reserve a specific car, as our assignment dictates, but only a type). Only one customer can have the car on a specific date. As the reservations tick in we have to make sure, that we don't hire out more cars of each type than we've got. The reservations are basically stored with a start date, an end date, and a car type.
If we ignore the car type for now (lets say we only have one type) then the reservations could graphically look something like this:
1/12 2/12 3/12 4/12 5/12 6/12 7/12
|-------------------|
|-----------------|
|-----|
|-------|
|-----------|
|-------------|
If the rental only has three cars it would be possible to rent a car from 3/12 to 5/12 since all days only have 2 car reservations. But how do we know this? Do we have to check each date and count() the number of reservations that spans over that date?
And what if somebody had reserved a car on 4/12, then 3/12 and 5/12 would still only have 2 reservations, but 4/12 would have 3.
Would it be possible to do with a query some how, or do we have to step through each date in the program to check the number of reservations didn't exceed the number of cars?
(This is easy enough with only full dates, but consider the scenario where you could rent the cars on an hourly basis (not only on a daily as here). Then it could be a though one to step through each our if we have a lot of reservations and cars and the timespan is long...)
Hope you have some nice ideas that will help us along. Thanks for taking the time to read the question :)
Mikkel, Denmark

Assume, You have such reservation situation in real life:
1/12 2/12 3/12 4/12 5/12 6/12 7/12
Car1: |-------------------|
Car2: |-----------------|
Car3: |-------| |-----------| |-----|
Car4: |-------------|
Table car
| id | type | registration |
| 1 | 1 | HH1111 |
| 2 | 1 | HH3333 |
| 3 | 2 | HH77 |
| 4 | 3 | DD999 |
Table reservation
| car_id | date_from | date_to |
| 1 | 2013-12-01 | 2013-12-04 |
| 2 | 2013-12-04 | 2013-12-07 |
| 3 | 2013-12-01 | 2013-12-02 |
| 3 | 2013-12-03 | 2013-12-05 |
| 3 | 2013-12-06 | 2013-12-07 |
| 4 | 2013-12-01 | 2013-12-03 |
Now, You must by really simple logic, select all available cars for period
from 2013-12-05 to 2013-12-06
"Select ALL cars, which does not have any reservation with dates, which blocks it for usage"
with brillian mysql select:
select * from car where not exists ( select * from reservation
where car.id = reservation.car_id AND
date_from < '2013-12-06' AND
date_to > '2013-12-05' )

"Would it be possible to do with a query some how, or do we have to step through each date in the program to check the number of reservations didn't exceed the number of cars? (This is easy enough with only full dates,"
The nature of your problem is that a violation of the constraint could appear on any individual date. So logically speaking, it is indeed necessary to do the check for each individual date comprised in a new reservations. The only optimisation possible would be to do the check at the level of "smallest intervals". To do that, you must first compute all the intervals that already appear in the database, and which overlap with your new reservation.
For example, a new reservation for 4/12-6/12 would have to be split into 4/12-5/12 (second line) and 5/12-6/12 (third line). Those individual intervals might be longer than one single day, and you can do the checks on the level of those individual intervals. (They are the same as individual days in this particular example, but a reservation 7/12-19/12 would not have to be split at all.
However, computing this might prove difficult, and there's another caveat: when you're looking al multi-row inserts, you should also be splitting over the other rows to be inserted (and that requires you to record all the inserted rows in a temporary table, otherwise you won't be able to access them).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight