Dealing with not quite unique data? - database

What do people do if there is non-unique data on files which can and should be joined?
My example is for customer data. One file would track the start of an interaction and how long it took from a system perspective. The other file would keep track of the interaction when the employee logs it - this is typically done at the end of the interaction but there can be delays. So there is no way to match up the timestamps between File 1 and File 2. I would want to determine the Duration and Rating for specific Issue types across 3 files.
I typically create an index (in pandas) which is Date | CustomerID | EmployeeID which works decently most of the time (that customer interacted with that employee on that date). But sometimes the same customer interacts with the same customer on the same day so I have a duplicate value. This didn't bother me before until I notice my joins (pd.merge) caused duplicate data and, by chance, an outlier interaction was duplicated which threw off some analysis.
Should I completely drop any interaction with duplicates? Should I create a more unique ID based on some type of time interval (like if the EndDatetime is within X minutes from the Datetime on another file (which is close to the end of the interaction normally)?
File 1:
StartDatetime | CustomerID | EmployeeID | Duration | EndDatetime
File 2:
Datetime | CustomerID | EmployeeID | Issue
File 3:
Datetime | CustomerID | EmployeeID | Rating

I believe the correct answer to this question relies more on the use cases for your data than anything else. Personally I deal with interaction data a lot, in those cases I prefer indexing by interaction time as well, since the two interactions are truly unique. However, if the analysis I'm performing doesn't take into account the amount of interactions taking place, and just the parties involved, dropping duplicate interactions is preferred. In other cases, grouping is preferable, but since each interaction in your example appears to be truly independent, grouping seems ill-advised, the only criteria you could naturally group on would be rating, and seems like a bad decision to aggregate that separately from any analytics you're performing.

Related

What is the best way to indicate a value that has special handling in a database table?

The setup.
I have a table that stores a list of physical items for a game. Items also have a hierarchical list of categories. Example base table:
Items
id | parent_id | is_category | name | description
-- | --------- | ----------- | ------- | -----------
1 | 0 | 1 | Weapon | Something intended to cause damage
2 | 1 | 1 | Ranged | Attack from a distance
3 | 1 | 1 | Melee | Must be able to reach the target with arm
4 | 2 | 0 | Musket | Shoots hot lead.
5 | 2 | 0 | Bomb | Fire damage over area
6 | 0 | 1 | Mount | Something that carries a load.
7 | 6 | 0 | Horse | It has no name.
8 | 6 | 0 | Donkey | Don't assume or you become one.
The system is currently running on PHP and SQLite but the database back-end is flexible and may use MySQL and the front-end may eventually use javascript or Object-C/Swift
The problem.
In the sample above the program must have a different special handling for each of the top level categories and the items underneath them. e.g. Weapon and Mount are sold by different merchants, weapons may be carried while a mount cannot.
What is the best way to flag the top level tiers in code for special handling?
While the top level categories are relatively fixed I would like to keep them in the DB so it is easier to generate the full hierarchy for visualization using a single (recursive) function.
Nearly all foreign keys that identify an item may also identify an item category so separating them into different tables seemed very clunky.
My thoughts.
I can use a string match on the name and store the id in an internal constant upon first execution. An ugly solution at best that I would like to avoid.
I can store the id in an internal constant at install time. better but still not quite what I prefer.
I can store an array in code of the top level elements instead of putting them in the table. This creates a lot of complications like how does a child point to the top level parent. Another id would have to be added to the table that is used by like 100 of the 10K rows.
I can store an array in code and enable identity insert at install time to add the top level elements sharing the identity of the static array. Probably my best idea but I don't really like the idea of identity insert it just doesn't feel "database" to me. Also what if a new top level item appears. Maybe start the ids at 1Million for these categories?
I can add a flag column "varchar(1) top_category" or "int top_category" with a character or bit-map indicating the value. Again a column used on like 10 of 10k rows.
As a software person I tend to fine software solutions so I'm curious if their is a more DB type solution out there.
Original table, with a join to actions.
Yes, you can put everything in a single table. You'd just need to establish unique rows for every scenario. This sqlfiddle gives you an example... but IMO it starts to become difficult to make sense of. This doesn't take care of all scenarios, due to not being able to do full joins (just a limitation of sqlfiddle that is awesome otherwise.)
IMO, breaking things out into tables makes more sense. Here's another example of how I'd start to approach a schema design for some of the scenarios you described.
The base tables themselve look clunky, but it gives so much more flexibility of how the data is used.
tl;dr analogy ahead
A datase isn't a list of outfits, organized in rows. It's where you store the cothes that make up an outfit.
So the clunky feel of breaking things out into separate tables, is actually the benefit of relational datbases. Putting everything into a single table feels efficient and optimized at first... but as you expand complexity... it starts to become a pain.
Think of your schema as a dresser. Drawers are you tables. If you only have a few socks and underware, putting them all in one drawer is efficient. But once you get enough socks, it can become a pain to have them all in the same drawer as your underware. You have dress socks, crew socks, ankle socks, furry socks. So you put them in another drawer. Once you have shirts, shorts, pants, you start putting them in drawers too.
The drive for putting all data into a single table is often driven by how you intend to use the data.
Assuming your dresser is fully stocked and neatly organized, you have several potential unique outfits; all neatly organized in your dresser. You just need to put them together. Select and Joins are you you would assemble those outfits. The fact that your favorite jean/t-shirt/sock combo isn't all in one drawer doesn't make it clunky or inefficient. The fact that they are separated and organized allows you to:
1. Quickly know where to get each item
2. See potential other new favorite combos
3. Quickly see what you have of each component of your outfit
There's nothing wrong with choosing to think of outfit first, then how you will put it away later. If you only have one outfit, putting everything in one drawer is way easier than putting each pieace in a separate drawer. However, as you expand your wardrobe, the single drawer for everything starts to become inefficient.
You typically want to plan for expansion and versatility. Your program can put the data together however you need it. A well organized schema can do that for you. Whether you use an ORM and do model driven data storage; or start with the schema, and then build models based on the schema; the more complex you data requirements become; the more similar both approaches become.
A relational database is meant to store entities in tables that relate to each other. Very often you'll see examples of a company database consisting of departments, employees, jobs, etc. or of stores holding products, clients, orders, and suppliers.
It is very easy to query such database and for example get all employees that have a certain job in a particular department:
select *
from employees
where job_id = (select id from job where name = 'accountant')
and dept_id = select id from departments where name = 'buying');
You on the other hand have only one table containing "things". One row can relate to another meaning "is of type". You could call this table "something". And were it about company data, we would get the job thus:
select *
from something
where description = 'accountant'
and parent_id = (select id from something where description = 'job');
and the department thus:
select *
from something
where description = 'buying'
and parent_id = (select id from something where description = 'department');
These two would still have to be related by persons working in a department in a job. A mere "is type of" doesn't suffice then. The short query I've shown above would become quite big and complex with your type of database. Imagine the same with a more complicated query.
And your app would either not know anything about what it's selecting (well, it would know it's something which is of some type and another something that is of some type and the person (if you go so far as to introduce a person table) is connected somehow with these two things), or it would have to know what description "department" means and what description "job" means.
Your database is blind. It doesn't know what a "something" is. If you make a programming mistake some time (most of us do), you may even store wrong relations (A Donkey is of type Musket and hence "shoots hot lead" while you can ride it) and your app may crash at one point or another not able to deal with a query result.
Don't you want your app to know what a weapon is and what a mount is? That a weapon enables you to fight and a mount enables you to travel? So why make this a secret? Do you think you gain flexibility? Well, then add food to your table without altering the app. What will the app do with this information? You see, you must code this anyway.
Separate entity from data. Your entities are weapons and mounts so far. These should be tables. Then you have instances (rows) of these entities that have certain attributes. A bomb is a weapon with a certain range for instance.
Tables could look like this:
person (person_id, name, strength_points, ...)
weapon (weapon_id, name, range_from, range_to, weight, force_points, ...)
person_weapon(person_id, weapon_id)
mount (mount_id, name, speed, endurance, ...)
person_mount(person_id, mount_id)
food (food_id, name, weight, energy_points, ...)
person_food (person_id, food_id)
armor (armor_id, name, protection_points, ...)
person_armor <= a table for m:n or a mere person.id_armor for 1:n
...
This is just an example, but it shows clearly what entities your app is dealing with. It knows weapons and food are something the person carries, so these can only have a maximum total weight for a person. A mount is something to use for transport and can make a person move faster (or carry weight, if your app and tables allow for that). Etc.

What is the correct database structure to store historical data?

I am designing a database in Sqlite that is intended to help facilitate the creation of predictive algorithms for FIRST Robotics Competition. On the surface, things looked pretty easy, but I'm struggling with a single issue: How to store past ratings of a team. I have looked over previous questions pertaining to how historical data is stored, but I'm not sure any of it applies well to my situation (though it could certainly be that I just don't understand it well enough).
Each team has an individual rating, and after each match that team participates in the rating gets revised. Now, there are several ways I could go about storing them, but none of them seem particularly good. I'll go through the ones that I have thought through, in no particular order.
Option 1:
Each Team has it's own table.It would include the match_id and the rating after the match was done, and could possibly also include the rating before. The problem is, that there would be bordering on 10,000 tables. I'm pretty sure that's inefficient, especially considering I believe that it's also unnormalized (correct me if I'm wrong).
Table Name: Team_id
match_id | rating_after
Option 2:
Historical rating ratings for each team or stored in the match table, and current ratings are stored in the team table. A simplified version of the team table looks like this:
Table : Team_list
team_id | team_name | team_rating
That isn't really the problem, the problem is with the historical data. The historical data would be stored with the match. Likely, it would be each teams rating before the match.
The problems I have with this one, are how tough of a search this will be to find the previous ratings. This comes from the structure of how FRC works. There are 3 teams on each side (forming what is known as an alliance) for a total of 6 teams. (These alliances are normally designated by the colors Red and Blue)
These alliances are randomly assigned ahead of time, and could include any team at the event, on either side.) In other words the match table would look like this (simplified):
Table: match_table
match_id | Red1 | Red2 | Red3 | Blue1 | Blue2 | Blue3 | RedScore | BlueScore | Red1Rating | Red2Rating | etc.....
So each team has to be included in the match info, as well as a rating for each team. If were to create more than one rating (such as an updated rating design that I want to do a pure comparison test with), things could get clogged really fast.
In order to find the previous rating for team # 67, for instance, I'd have to search Red1, Red2, Red3, Blue1, etc. and then look at the column that pertains to the position, all while being sure that this really is the most recent match.
Note: This might involve knowing not only the year of the data, the week it was taken in (I would get this data from a join with an event table), but the match level(whether it was qualifications or playoffs), and match #(which is not match_id).
Sure, this option is normalized, but it's also got a weird search pattern, and isn't easy from a front end standpoint(I might build a front-end for some of the data in the future, so I want to keep that in mind as well).
My question: Is there an easier/more efficient option that I am missing?
Because both designs feel somewhat inefficient. The first has too many tables, the other has a table that will have well over 100,000 entries and will have to be searched in a convoluted pattern. I feel as if there is some simple design solution that I simply haven't thought of.
There's only one sane answer:
team_rating:
team_id, rating, start_date, end_date
Making all ranges closed by using the creation date of the team as the first rating's start_date, and some arbitrarily distant future date (eg 2199-01-01) as the end_date for the current row. all dates being inclusive.
Queries to find the rating at any date are then a simple
select rating
from team_rating
where team_id = $id
and $date between start_date and end_date
and rating history is just
select start_date, rating
from team_rating
where team_id = $id
order by start_date
It's key that both start and end dates are stored, otherwise the queries are trainwrecks.

Change Data Capture and SQL Server Analysis Services

I'm designing a database application where data is going to change over time. I want to persist historical data and allow my users to analyze it using SQL Server Analysis Services, but I'm struggling to come up with a database schema that allows this. I've come up with a handful of schemas that could track the changes (including relying on CDC) but then I can't figure out how to turn that schema into a working BISM within SSAS. I've also been able to create a schema that translates nicely in to a BISM but then it doesn't have the historical capabilities I'm looking for. Are there any established best practices for doing this sort of thing?
Here's an example of what I'm trying to do:
I have a fact table called Sales which contains monthly sales figures. I also have a regular dimension table called Customers which allows users to look at sales figures broken down by customer. There is a many-to-many relationship between customers and sales representatives so I can make a reference dimension called Responsibility that refers to the customer dimension and a Sales Representative reference dimension that refers to the Responsibility dimension. I now have the Sales facts linked to Sales Representatives by the chain of reference dimensions Sales -> Customer -> Responsibility -> Sales Representative which allows me to see sales figures broken down by sales rep. The problem is that the Sales facts aren't the only things that change over time. I also want to be able to maintain a history of which Sales Representative was Responsible for a Customer at the time of a particular Sales fact. I also want to know where the Sale Representative's office was located at the time of a particular sales fact, which may be different than his current location. I might also what to know the size of a customer's organization at the time of a particular Sales fact, also which might be different than it is currently. I have no idea how to model this in an BISM-friendly way.
You mentioned that you currently have a fact table which contains monthly sales figures. So one record per customer per month. So each record in this fact table is actually an aggregation of individual sales "transactions" that occurred during the month for the corresponding dimensions.
So in a given month, there could be 5 individual sales transactions for $10 each for customer 123...and each individual sales transaction could be handled by a different Sales Rep (A, B, C, D, E). In the fact table you describe there would be a single record for $50 for customer 123...but how do we model the SalesReps (A-B-C-D-E)?
Based on your goals...
to be able to maintain a history of which Sales Representative was Responsible for a Customer at the time of a particular Sales fact
to know where the Sale Representative's office was located at the time of a particular sales fact
to know the size of a customer's organization at the time of a particular Sales fact
...I think it would be easier to model at a lower granularity...specifcally a sales-transaction fact table which has a grain of 1 record per sales transaction. Each sales transaction would have a single customer and single sales rep.
FactSales
DateKey (date of the sale)
CustomerKey (customer involved in the sale)
SalesRepKey (sales rep involved in the sale)
SalesAmount (amount of the sale)
Now for the historical change tracking...any dimension with attributes for which you want to track historical changes will need to be modeled as a "Slowly Changing Dimension" and will therefore require the use of "Surrogate Keys". So for example, in your customer dimension, Customer ID will not be the primary key...instead it will simply be the business key...and you will use an arbitrary integer as the primary key...this arbitrary key is referred to as a surrogate key.
Here's how I'd model the data for your dimensions...
DimCustomer
CustomerKey (surrogate key, probably generated via IDENTITY function)
CustomerID (business key, what you will find in your source systems)
CustomerName
Location (attribute we wish to track historically)
-- the following columns are necessary to keep track of history
BeginDate
EndDate
CurrentRecord
DimSalesRep
SalesRepKey (surrogate key)
SalesRepID (business key)
SalesRepName
OfficeLocation (attribute we wish to track historically)
-- the following columns are necessary to keep track of historical changes
BeginDate
EndDate
CurrentRecord
FactSales
DateKey (this is your link to a date dimension)
CustomerKey (this is your link to DimCustomer)
SalesRepKey (this is your link to DimSalesRep)
SalesAmount
What this does is allow you to have multiple records for the same customer.
Ex. CustomerID 123 moves from NC to GA on 3/5/2012...
CustomerKey | CustomerID | CustomerName | Location | BeginDate | EndDate | CurrentRecord
1 | 123 | Ted Stevens | North Carolina | 01-01-1900 | 03-05-2012 | 0
2 | 123 | Ted Stevens | Georgia | 03-05-2012 | 01-01-2999 | 1
The same applies with SalesReps or any other dimension in which you want to track the historical changes for some of the attributes.
So when you slice the sales transaction fact table by CustomerID, CustomerName (or any other non-historicaly-tracked attribute) you should see a single record with the facts aggregated across all transactions for the customer. And if you instead decide to analyze the sales transactions by CustomerName and Location (the historically tracked attribute), you will see a separate record for each "version" of the customer location corresponding to the sales amount while the customer was in that location.
By the way, if you have some time and are interested in learning more, I highly recommend the Kimball bible "The Data Warehouse Toolkit"...which should provide a solid foundation on dimensional modeling scenarios.
The established best practices way of doing what you want is a dimensional model with slowly changing dimensions. Sales reps are frequently used to describe the usefulness of SCDs. For example, sales managers with bonuses tied to the performance of their teams don't want their totals to go down if a rep transfers to a new territory. SCDs are perfect for tracking this sort of thing (and the situations you describe) and allow you to see what things looked like at any point historically.
Spend some time on Ralph Kimball's website to get started. The first 3 articles I'd recommend you read are Slowly Changing Dimensions, Slowly Changing Dimensions Part 2, and The 10 Essential Rules of Dimensional Modeling.
Here are a few things to focus on in order to be successful:
You are not designing a 3NF transactional database. Get comfortable with denormalization.
Make sure you understand what grain means and explicitly define the grain of your database.
Do not use natural keys as keys, and do not bake any intelligence into your surrogate keys (with the exception of your time keys).
The goals of your application should be query speed and ease of understanding and navigation.
Understand type 1 and type 2 slowly changing dimensions and know where to use them.
Make sure you have a sponsor on the business side with the power to "break ties". You will find different people in the organization with different definitions of the same thing, and you need an enforcer with the power to make decisions. To see what I mean, ask 5 different people in your organization to define "customer" or "gross profit". You'll be lucky to get 2 people to define either the same way.
Don't try to wing it. Read the The Data Warehouse Lifecycle Toolkit and embrace the ideas, even if they seem strange at first. They work.
OLAP is powerful and can be life changing if implemented skillfully. It can be an absolute nightmare if it isn't.
Have fun!

Database Optimization - Store each day in a different column to reduce rows

I'm writing an application that stores different types of records by user and day. These records are divided in categories.
When designing the database, we create a table User and then for each record type we create a table RecordType and a table Record.
Example:
To store data related to user events we have the following tables:
Event EventType
----- ---------
UserId Id
EventTypeId Name
Value
Day
Our boss pointed out (with some reason) that we're gonna store a lot of rows ( Users * Days ) and suggested an idea that seems a little crazy to me: Create a table with a column for each day of the year, like so:
EventTypeId | UserId | Year | 1 | 2 | 3 | 4 | ... | 365 | 366
This way we only have 1 row per user per year, but we're gonna get pretty big rows.
Since most ORMs (we're going with rails3 for this project) use select * to get the database records, aren't we optimizing something to "deoptimize" another?
What's the community thoughs about this?
This is a violation of First Normal Form. It's an example of repeating groups across columns.
Example of why this is bad: Write a query to find which day a given event occurred. You'll need a WHERE clause with 366 terms, separated by OR. This is tedious to write, and impossible to index.
Relational databases are designed to work well even if you have a lot of rows. Say you have 10000 users, and on average every user generates 10 events every day. After 10 years, you will have 10000*366*10*10 rows, or 366,000,000 rows. That's a fairly large database, but not uncommon.
If you design your indexes carefully to match the queries you run against this data, you should be able to have good performance for a long time. You should also have a strategy for partitioning or archiving old data.
That's breaks the DataBase normal forms principles
http://databases.about.com/od/specificproducts/a/normalization.htm
if it's applicable why don't you replace Day columns with a DateTime column in your event table with a default value (GetDate() we are talking about SQL)
then you could group by Date ...
I wouldn't do it. As long as you take the time to index the table appropriately, the database server should work well with tables that have lots of rows. If it's significantly slowing down your database performance, I'd start by making sure your queries aren't forcing a lot of full table scans.
Some other potential problems I see:
It probably will hurt ORM performance.
It's going to create maintainability problems on down the road. You probably don't want to be working with objects that have 366 fields for every day of the year, so there's probably going to have to be a lot of boilerplate glue code to keep track of.
Any query that wants to search against a range of dates is going to be an unholy mess.
It could be even more wasteful of space. These rows are big, and the number of rows you have to create for each customer is going to be the sum of the maximum number of times each different kind of event happened in a single day. Unless the rate at which all of these events happens is very constant and regular, those rows are likely to be mostly empty.
If anything, I'd suggest sharding the table based on some other column instead if you really do need to get the table size down. Perhaps by UserId or year?

Alternative approach to database design: "verticality"

I would like to ask someone, who has experiences in database design. This is my idea, and I can't assess deep consequences of such approach to, let's say, common problem. I appreciate your comments in advance...
Imagine:
- patients in hospital
- each patient should have:
1. personal data - Name, Surname, Street, SecurityID, contact, and many more (which could be changed over time)
2. personal records - a heap of various forms (also changing over time)
Typically I would design table for patient's pesonal data:
personaldata_tbl
| ID | SecurityID | Name | Surname ... | TimeOfEntry
and similar tables for each form in program. This could be very hard task, because it could reach several hundreds of such tables. In addition to it, probably their count will be increasingly growing.
And yes, all of them should be relationally connected for example:
releaseform_tbl
| ID | personaldata_tbl_ID | DateOfRelease | CauseOfRelease ... | TimeOfEntry
My intention is to revert 2D tables to single 1D table - all data about patients would be stored in one table! Other tables will describe (referentially) what kind of data is stored in the main table. Look at this:
data_info_tbl
| ID | Description |
| 1 | Name |
| 2 | Surname |
patient_data_tbl
| ID | patient_ID | data_info_ID | form_ID | TimeOfEntry | Value
| 1 | 121 | 1 | 7 | 17.12.2011 14:34 | John
| 2 | 121 | 2 | 7 | 17.12.2011 14:34 | Smith
The main reason, why this approach attracts me is:
- simplicity
- ability to store any data with appropriate specification and precision
- no table jungle
Contras:
- SQL querying could be problematic in some cases
- there should be reliable algorithm to delete, update, insert data (one way is to dynamically create table, perform operations on it, and finally store it)
- dataaware controls won't be used.
So what would you say ?
thanx for your time and answers
The most obvious problems . . .
You lose control of size. The "Value" column must be big enough to hold the largest type you use, which in the general case must be a blob. (X-ray images, in a hospital database.)
You lose data types. PostgreSQL, for example, includes the data types "point", bit string, internet address, cidr address, MAC address, and UUID. Storing all values in a column of a single type means you lose all the type-safety built into the specific data types.
You lose constraints. Some integers need to be constrained to between 1 and 10, others between 1000 and 3000. Some text strings need to be all numbers (ZIP codes), some need to be a particular mix of alpha and numerics (tire sizes).
You lose scalability. If there are 1000 attributes in a person's medical records, each person's data will take 1000 rows in the table. If you have 100,000 patients--an easily manageable number even in Microsoft Access and SQLite--your table suddenly balloons from a manageable 100,000 rows to 100,000,000 rows. Any query that does a table scan will have to scan 100 million rows, every time. Any single query that needs to return, say, 30 attributes will need 30 joins.
What you're proposing is the EAV anti-pattern. (Starts on slide 30.)
I disagree with Bert Evans (in the sense that I don't find this terribly valid).
First of all it's not clear to me what problem you are trying to solve. The three "benefits" you list:
simplicity
ability to store any data with appropriate specification
and precision
no table jungle
don't really make a lot of sense if the application is small, and if it isn't (like hospital records you mention in your example) any possible "gain" is lost when you start to factor in that this will make any sort of query very inefficient, and that human operators trying to design reports, data extractions or to extend the DB will have to put in a lot of extra effort.
Example: I suppose your hospital patient has an address and therefore a ZIP code... have you considered what loops you will have to jump in to create foreing index on the zip code/state table?
Another example: as soon as you realize that the patient may have a middle name and that on the form it will be placed between the first and last name what will you do? renumber all the last name fields? or place the middle name at the bottom of the pile, so that your form will have to re-add special logic to show it in the "correct" position?
You may want to check some alternatives to SQL DBs, like for example XML based data stores, or even MUMPS, but I really can't see any benefit in the approach you are proposing (and please consider I had seen an over-zealous DBA trying to do something very similar when designing a web application backed by an Oracle DB: every field/label/image on the webpage had just a numeric reference to a sequence-based ID record in the DB, making the whole webapp a nightmare to mantain - so I am not just being a "purist" here).

Resources