SQL Server building the basis for a decision tree - sql-server

I want to build the data basis for what could be a decision tree. The data I have is the following:
|Customer ID| Previous product| New product | Date |
I want to represent this so that I can drill down to analyze which product paths our customer take, for example product A -> product B -> product C, or maybe is it most common to go from product B -> product C -> product A.
I was thinking on making a column for each of the 5 latest products, however I am unsure if this is the best way of doing this. How can I represent this

Related

Many to Many Database Relationship Design - to enable Word Clouds

I'm relatively new to database design and struggling to introduce a many-to-many relationship in a SSAS Tabular model.
I have some 'WordGroup' performance data in one table, like so;
WordGroup | IndexedVolume
Dining | 1,000
Sports | 2,000
Movies | 1,600
... and so on
Then I have 'Words' contained within these 'WordGroups' sitting in another category table, like so;
WordGroup | Word
Dining | Restaurant
Dining | Food
Dining | Dinner
Sports | Football
Sports | Basketball
... and so on
I can't see Performance data (IndexedVolume) by 'Word' detail - only by the 'WordGroup' that it is contained within. For example above, I can't look at 'Football' IndexedVolume on it's own, I can only choose the 'Sports' WordGroup that contains Football.
However, when analysing by 'WordGroup' I would still like users to understand what 'Words' are included (ideally in a Word Cloud Visualisation). Therefore, I wanted to develop a relationship between these two tables, so when someone chooses a Word Group (or multiple) we can return the Words that are contained within the Word Group(s) - i.e. below.
User selects Dining WordGroup
<<<Word Cloud or Flat Table would show Words below>>>
Restaurant
Food
Dinner
I looked at Concatenate / Strings etc, but was deterred as the detail here is much more complex and each WordGroup may contain 10+ Words, with translations.
Any advice would be greatly appreciated!
If analizing by WordGroup is an obligatory requirement, you sholud use these tables:
The many-to-many aplies beacuse your words may be conected to one or more groups, e.g. tree is conected to enviroment, forest, etc.
and obviously one word_group is conected to many words.
To see performance data by Word use :
select w.idword , w.name, sum(wg.index_volume)
from word w
left join word_group_has_word wgw
on w.idword=wgw.word_id
left join word_group wg
on wg.idword_group=wgw.word_group_id
group by w.idword
So you will see the sum of all the index volume of all the group_words conected to the words. ANd if you wanna see the words conected to the word groups use:
select distinct w.idword , w.name
from word w
left join word_group_has_word wgw
on w.idword=wgw.word_id
where wgw.word_group_id in [listWordGroupsId]

Designing a schedule in a sports database

I will try to be as specific as possible, but I am having trouble conceptualizing the problem. As a hobby I am trying to design a NFL database that takes raw statistics and stores it for future evaluation for fantasy league analysis. One of the primary things I want to see is if certain players/teams perform well against specific teams and which defenses are suspect to either pass/run. The issue I am having is trying to design the schedule/event table. My current model is as follows.
TEAMS
TeamID, Team
SCHEDULE
ScheduleID, TeamID, OpponentID, Season, Week, Home_Away, PointsFor, PointsAgainst
In this scenario I will be duplicating every game, but when I use an event table where I use TeamAway and TeamHome I find my queries impossible to run since I have to query both AwayTeam and HomeTeam to find the event for a specific team.
In general though I cannot get a query to work where I have two relationships from a table back to one table, even in the schedule table my query does not work.
I have also considered dropping the team table and just storing NE, PIT, etc. for the Team and Opponent fields so I do not have to deal with the cross-relationships back to the team table.
How can I design this so I am not running queries for TeamID = OpponentID AND TeamID?
I am doing this in MS Access.
Edit
The issue I am having is when I query two table: Team (TeamID, Team) and Event(TeamHomeID, TeamAwayID), that had relationships built between the TeamID - TeamHomeID, and TeamID - TeamWayID I had issues building the query in ms Access.
The SQL would look something like:
SELECT Teams.ID, Teams.Team, Event.HomeTeam
FROM Teams INNER JOIN (Event INNER JOIN Result ON Event.ID = Result.EventID)
ON (Teams.ID = Result.LosingTeamID) AND (Teams.ID = Result.WinningTeamID)
AND (Teams.Team = Event.AwayTeam) AND (Teams.Team = Event.HomeTeam);
It was looking for teams that had IDs of both the losing team and the winning team (which does not exist).
I think I might have fixed this problem. I didn't realize the Relationships in database design are only default, and that within the Query builder I could change the joins on which a particular query is built. I discovered this by deleting all the AND portions of the SQL statement returned, and was able to return the name of all winnings teams.
This is an interesting concept - and good practice.
First off - it sounds like you need to narrow down exactly what kind of data you want so you know what to store. I mean, hell, what about storing the weather conditions?
I would keep Team, but I would also add City (because Teams could switch cities).
I would keep Games (Schedule) with columns GameID, HomeTeamID, AwayTeamID, ScheduleDate.
I would have another table Results with columns ResultID, GameID, WinningTeamID, LosingTeamID, Draw (Y/N).
Data could look like
TeamID | TeamName | City
------------------------
1 | PATS | NE
------------------------
2 | PACKERS | GB
GameID | HomeTeamID | AwayTeamID | ScheduleDate | Preseason
-----------------------------------------------------------
1 | 1 | 2 | 1/1/2016 | N
ResultID | GameID | WinningTeamID | LosingTeamID | Draw
------------------------------------------------------------
1 | 1 | 1 | 2 | N
Given that, you could pretty easily give any W/L/D for any Scheduled Game and date, you could easily SUM a Teams wins, their wins when they were home, away, during preseason or regular season, their wins against a particular team, etc.
I guess if you wanted to get really technical you could even create a Season table that stores SeasonID, StartDate, EndDate. This would just make sure you were 100% able to tell what games were played in which season (between what dates) and from there you could infer weather statistics, whether or not a player was out during that time frame, etc.

What's the most effective way of storing this data?

Need help figuring out a good way to store data effectively and efficiently
I'm using Parse (JavaScript SDK), here's an example of what I'm trying to store
Predictions of football (soccer) matches so an example of one match would be;
Team A v Team B
EventID = "abc"
Categories = ["League-1","Sunday-League"]
User123 predicts the score will be Team A 2-0 Team B -> so 2-0
User456 predicts the score will be Team A 1-3 Team B -> so 1-3
Each event has information attached to it like an eventId, several categories, start time, end time, a result and more
I need to record a score prediction per user for each event (usually 10 events at a time so a lot of predictions will be coming in)
I need to store these so I can cross reference the correct result against the user's prediction and award points based on their prediction, the teams in the match and the categories of the event but instead of adding to a total I need all the awarded points stored separately per category and per user so I can then filter based on predictions between set dates and certain categories e.g.
Team A v Team B
EventID = "abc"
Categories = ["League-1","Sunday-League"]
User123 prediction = 2-0
Actual result = 2-0
So now I need to award X points to User123 for Team A, Team B, "League-1", and "Sunday-League" and record it to the event date too.
I would suggest you create a table for games and a table for users and then an associative table to handle the many to many relationship. This is a pretty standard many to many relationship.

How to Store purchases of "N of the same Product" in an Orders table

So We have our basic tables for the categories, products, and variants of products
categories
id | name | active | parent_id
products
id | name | price | active
c_p_link
category_id | product_id
variants
id | product_id | price | price_override | active | stock
Which works great.
But I have two queries.
The first being how to structure the orders.
We have an orders table
id | customer_id | ordered | status
And we also have a order_products table
id | order_id ..?
this is the one I am curious about.
Say a customer orders 30 of a product. do we
Add 30 rows, and add the price for each individual item on each row.
Add one row, add the combined total onto the row
Add one row, add the individual price onto the row
The next part is, later we are expecting to add voucher support to the cart. e.g. 10% off, buy two, get one free etc. the overall design of this I am not too fussed about right now (this is a couple of months off at least). but I am wondering if that is going to affect which version of the order_products table I should choose?
Disclaimer: I have never written a database model dealing the "Shopping Carts" or "Orders"'
I think the price at time of purchase should be encoded into the purchase data: just like a paper receipt from a store. Let's call this total_price which represents each itemized "line" on the receipt and should not be confused with total_purchase_price.
That is, the amount charged is fixed. It doesn't matter if the product price changes later and changes to prices should not reflect in how much was [to be] paid.
Thus I would have these fields: product, unit_price, quantity, total_price. A computed column of say, base_total_price (unit_price * quantity) can be easily added if required.
Now, the total_price might be a computed value based on say base_total_price * precent_discount field: but, whatever it ends up being, I hold that total_price should exist and should be fixed at time of purchase. (This implies that, if it is a computed column, all inputs are also fixed at time of purchase.)
Addendum: As stated above, I've never designed a model like this before, but one thing I have observed at stores is discounts being applied as a negative cost itemized item. That is, items are bought "at full price" and then the register adds an entry to offset the cost per whatever promotional is occuring. I do not know the merits/reasoning of such an approach.
simply add quantity of product to your order_products table :)
I prefer the 3rd solution, i think it's the best for the performance of your database..

representing graph using relational database

I need to represent graph information with relational database.
Let's say, a is connected to b, c, and d.
a -- b
|_ c
|_ d
I can have a node table for a, b, c, and d, and I can also have a link table (FROM, TO) -> (a,b), (a,c), (a,d).
For other implementation there might be a way to store the link info as (a,b,c,d), but the number of elements in the table is variable.
Q1 : Is there a way to represent variable elements in a table?
Q2 : Is there any general way to represent the graph structure using relational database?
Q1 : Is there a way to represent variable elements in a [database] table?
I assume you mean something like this?
from | to_1 | to_2 | to_3 | to_4 | to_5 | etc...
1 | 2 | 3 | 4 | NULL | NULL | etc...
This is not a good idea. It violates first normal form.
Q2 : Is there any general way to represent the graph structure using database?
For a directed graph you can use a table edges with two columns:
nodeid_from nodeid_to
1 2
1 3
1 4
If there is any extra information about each node (such as a node name) this can be stored in another table nodes.
If your graph is undirected you have two choices:
store both directions (i.e. store 1->2 and 2->1)
use a constraint that nodeid_from must be less than nodeid_to (i.e. store 1->2 but 2->1 is implied).
The former requires twice the storage space but can make querying easier and faster.
In addition to the two tables route mentioned by Mark take a look at the following link:
http://articles.sitepoint.com/article/hierarchical-data-database/2
This article basically preorders the elements in the tree assigning left and right values. You are then able to select portions or all of the tree using a single select statement.
Node | lft | rght
-----------------
A | 0 | 7
B | 1 | 2
C | 3 | 4
D | 5 | 6
EDIT: If you are going to be updating the tree heavily this is not an optimum solution as the whole tree must be re-numbered
I have stored multiple "TO" nodes in a relational representation of a graph structure. I was able to do this because my graph was directed. This meant that if I wanted to know what nodes "A" was connected to, I only needed to select a single record from my table of connections. I stored the TO nodes in an easy-to-parse string and it worked great, with a class that could manage the conversion from string to collection and back.
I recommend looking at dedicated graph databases, as nawroth suggests. One example would be the "Trinity" Database, suited for very large datasets. But there are others.
Listen to the podcast by Scott Hanselman on Hanselminutes about Trinity. Here is the text transcript.

Resources