What's the most effective way of storing this data? - database

Need help figuring out a good way to store data effectively and efficiently
I'm using Parse (JavaScript SDK), here's an example of what I'm trying to store
Predictions of football (soccer) matches so an example of one match would be;
Team A v Team B
EventID = "abc"
Categories = ["League-1","Sunday-League"]
User123 predicts the score will be Team A 2-0 Team B -> so 2-0
User456 predicts the score will be Team A 1-3 Team B -> so 1-3
Each event has information attached to it like an eventId, several categories, start time, end time, a result and more
I need to record a score prediction per user for each event (usually 10 events at a time so a lot of predictions will be coming in)
I need to store these so I can cross reference the correct result against the user's prediction and award points based on their prediction, the teams in the match and the categories of the event but instead of adding to a total I need all the awarded points stored separately per category and per user so I can then filter based on predictions between set dates and certain categories e.g.
Team A v Team B
EventID = "abc"
Categories = ["League-1","Sunday-League"]
User123 prediction = 2-0
Actual result = 2-0
So now I need to award X points to User123 for Team A, Team B, "League-1", and "Sunday-League" and record it to the event date too.

I would suggest you create a table for games and a table for users and then an associative table to handle the many to many relationship. This is a pretty standard many to many relationship.

Related

Database design for user daily food and workout table daily activity

I'm builing an app who manage foods and workout together. i'm not sure who to store user daily activity
Actually i have a table food :
id
name
energy
proteins
Table meals to store the meals made with some foods and store the total nutriments values :
id
name
cat
total_enery
total_proteins
Table meal_food to store the foods in the meal :
meal_id
food_id
nb_portion
Table Workout :
id
name
type
Table Exercices :
id
name
type
duration
Table workout_exercice
workout_id
exercice_id
Now my question is how could a design a table in order to store user
daily activity ? i need to store foods, meals, workout, exercice
I need for each day grab a food choose a portion or grab a meal already stored, a workout or an exercice.
I was strating doing something like this but i'm really really not sure here :
Table user_daily_activity
id
day
food_id
meal_id
food_id_portion
workout_id
exercice_id
Is it a good idea to try to group or should i need to separe things ? should i create tables like this to get more control ?
user_daily_food
user_daily_meal
user_daily_workout
user_daily_exercices
Assuming:
You can choose a specific portion for a food, or duration for an exercise, or choose one from an already pre-defined meal/workout routine.
You are interested in preserving a snapshot of the day, and can alter predefined meals and workout routines as necessary.
Then the simplest way would be to choose from the predefined meals and workout routines and store the results in daily activity:
id
day
food_id
food_id_portion
exercise_id
duration
If predefined meals and/or workout routines cannot change and/or you need to tie back to the original meals/workout routines (even if they can change), then I think you were left with the above tables you mentioned. You could store the results separately for simplicity/normalization, then union the results:
select food_id, food_id portion
from
(
select id, day, food_id, food_id_portion from user_daily_food
union all
select id, day, food_id, nb_portion from user_daily_meal
inner join meals on user_daily_meal.mealid = meals.mealid
) A
I would go with whatever produces the least redundancy, but that depends on the needs for the particular app.

Merging different granularity time series in influxdb

I want to store trades as well as best ask/bid data, where the latter updates much more rapidly than the former, in InfluxDB.
I want to, if possible, use a schema that allows me to query: "for each trade on market X, find the best ask/bid on market Y whose timestamp is <= the timestamp of the trade".
(I'll use any version of Influx.)
For example, trades might look like this:
Time Price Volume Direction Market
00:01.000 100 5 1 foo-bar
00:03.000 99 50 0 bar-baz
00:03.050 99 25 0 foo-bar
00:04.000 101 15 1 bar-baz
And tick data might look more like this:
Time Ask Bid Market
00:00.763 100 99 bar-baz
00:01.010 101 99 foo-bar
00:01.012 101 98 bar-baz
00:01.012 101 99 foo-bar
00:01:238 100 99 bar-baz
...
00:03:021 101 98 bar-baz
I would want to be able to somehow join each trade for some market, e.g. foo-bar, with only the most recent ask/bid data point on some other market, e.g. bar-baz, and get a result like:
Time Trade Price Ask Bid
00:01.000 100 100 99
00:03.050 99 101 98
Such that I could compute the difference between the trade price on market foo-bar and the most recently quoted ask or bid on market bar-baz.
Right now, I store trades in one time series and ask/bid data points in another and merge them on the client side, with logic along the lines of:
function merge(trades, quotes, data_points)
next_trade, more_trades = first(trades), rest(trades)
quotes = drop-while (quote.timestamp < next_trade.timestamp) quotes
data_point = join(next_trade, first(quotes))
if more_trades
return merge(more_trades, quotes, data_points + data_point)
return data_points + data_point
The problem is that the client has to discard tons of ask/bid data points because they update so frequently, and only the most recent update before the trade is relevant.
There are tens of markets whose most recent ask/bid I might want to compare a trade with, otherwise I'd simply store the most recent ask/bid in the same series as the trades.
Is it possible to do what I want to do with Influx, or with another time series database? An alternative solution that produces lower quality results is to group the ask/bid data by some time interval, say 250ms, and take the last from each interval, to at least impose an upper bound on the amount of quotes the client has to drop before finding the one that's closest to the next trade.
NB. Just a clarification on InfluxDB terminology. You're probably storing trade and tick data in different measurements(analogous to a table). Series is a subdivision withing a measurement based on tag values. e.g
Time Ask Bid Market
00:00.763 100 99 bar-baz
is one series
Time Ask Bid Market
00:01.010 101 99 foo-bar
is another series(assuming you are storing Market name/id as a tag and not a field)
Answer
InfluxQL https://docs.influxdata.com/influxdb/v1.7/query_language/spec/ - I can't think of a way to achieve what you need with InfluxQL (Influx Query Language) as it does not support joins.
Perhaps what you could do on the client side is instead of requesting all tick data for a period and discarding most of it, make a request per trade and market to get exactly the (the most recent with respect to the trade) ask/bid datapoint that you need. Something like:
function merge(trades, market)
points = <empty list>
for next_trade in trades
quote = db.query("select last(ask), last(bid) from tick_data where time<=next_trade.timestamp and Market=market and time>next_trade.timestamp - 1m")
// or to get a list per market with one query
// quote_per_market = db.query("select last(ask), last(bid) from tick_data where time<=next_trade.timestamp group by Market")
points = points + join(next_trade, quote)
return points
Of course you'd have the overhead of querying the database more frequently but depending on the number of trades and your resource constraints it may be more efficient. NB. A potential pitfall here is that ask and bid retrieved this way are not retrieved as a pair but independently and while they are returned as a pair it could happen that they have different timestamps. If for some timestamp for some reason you only have an ask or a bid price you might run into this problem. However, as long as you write them in pairs and have no missing data it should be ok.
Flux https://www.influxdata.com/products/flux/ - Flux is a more sophisticated query language that is part of Influxdb 1.7 and 2 that allows you to do joins and operations across different measurements. I can't give you any examples yet but it's worth having a look at.
Other (relational) Times Series DBs that you could have a look at that would also allow you to do joins are CrateDB https://crate.io/ or Postgres + TimescaleDB https://www.timescale.com/products

DB Schema: Versioned price model vs invoice-related data

I am creating some db model for rental invoice generation.
The invoice consists of N booking time ranges.
Each booking belongs to a price model. A price model is a set of rules which determine a final price (base price + season price + quantity discout + ...).
That means the final price for the N bookings within an invoice can be a complex calculation, and of course I want to keep track of every aspect of the final price calculation for later review of an invoice.
The problem is, that a price model can change in the future. So upon invoice generation, there are two possibilities:
(a) Never change a price model. Just make it immutable by versioning it and refer to a concrete version from an invoice.
(b) Put all the price information, discounts and extras into the invoice. That would mean alot of data, as an invoice contains N bookings which may be partly in the range of a season price.
Basically, I would break down each booking into its days and for each day I would have N rows calculating the base price, discounts and extra fees.
Possible table model:
Invoice
id: int
InvoiceBooking # Each booking. One invoice has N bookings
id: int
invoiceId: int
(other data, e.g. guest information)
InvoiceBookingDay # Days of a booking. Each booking has N days
id: int
invoiceBookingId: id
date: date
InvoiceBookingDayPriceItem # Concrete discounts, etc. One days has many items
id: int
invoiceBookingDayId: int
price: decimal
title: string
My question is, which way should I prefer and why.
My considerations:
With solution (a), the invoice would be re-calculated using the price model information each time the data is viewed. I don't like this, as algorithms can change. It does not feel natural for the "read-only" nature of an invoice.
Also the version handling of price models is not a trivial task and the user needs to know about the version concept, which adds application complexity.
With solution (b), I generate a bunch of nested data and it adds alot of complexity to the schema.
Which way would you prefer? Am I missing something?
Thank you
There is a third option which I recommend. I call it temporal (time) versioning and the layout of the table is really quite simple. You don't describe your pricing data so I'll just show a simple example.
Table: DailyPricing
ID EffDate Price ...
A 01/01/2015 17.50 ...
B 01/01/2015 20.00 ...
C 01/01/2015 22.50 ...
B 01/01/2016 19.50 ...
C 07/01/2016 24.00 ...
This shows that all three price schedules (A, B and C just represent whatever method you use to distinguish between price levels) were given a price on Jan 1, 2015. On Jan 1, 2016, the price of plan B was reduced. In July, the price of plan C was increased.
To get the current price of a plan, the query is this:
select dp.Price
from DailyPricing dp
where dp.ID = 'A'
and dp.Effdate =(
select Max( dp2.EffDate )
from DailyPricing dp2
where dp2.ID = dp.ID
and dp2.EffDate >= :DateOfInterest);
The DateOfInterest variable would be loaded with the current date/time. This query returns the one price that is currently in effect. In this case, the price set Jan 1, 2015 as that has never changed since taking effect. If the search had been for plan B, the price set on Jan 1, 2016 would have been returned and for plan C, the price set on July 1, 2016. These are the latest prices set for each plan; that is, the current prices.
Such a query would more likely be in a join with probably the invoice table so you could perform the price calculation.
select ...
from Invoices i
join DailyPricing dp
on dp.ID = i.ID
and dp.Effdate =(
select Max( dp2.EffDate )
from DailyPricing dp2
where dp2.ID = dp.ID
and dp2.EffDate >= i.InvoiceDate )
where i.ID = 1234;
This is a little more complex than a simple query but you are asking for more complex data (or, rather, a more complex view of the data). However, this calculation is probably only executed once and the final price stored back in to the invoice data or elsewhere.
It would be calculated again only if the customer made some changes or you were going through an audit, rechecking the calculation for accuracy.
Notice something, however, that is subtle but very important. If the query above were being executed for an invoice that had just been created, the InvoiceDate would be the current date and the price returned would be the current price. If, however, the query was being run as a verification on an invoice that was two years old, the InvoiceDate would be two years ago and the price returned would be the price that was in effect two years ago.
In other words, the query to return current data and the query to return past data is the same query.
That is because current data and past data remain in the same table, differentiated only by the date the data takes effect. This, I think, is about the simplest solution to what you want to do.
How about A and B?
It's not best practice to re-calculate any component of an invoice, especially if the component was printed. An invoice and invoice details should be immutable, and you should be able to reproduce it without re-calculating.
If you ever have a problem with figuring out how you got to a certain amount, or if there is a bug in your program, you'll be glad you have the details, especially if the calculations are complex.
Also, it's a good idea to keep a history of your pricing models so you can validate how you got to a certain price. You can make this simple to your users. They don't have to see the history -- but you should record their changes in the history log.

Many to Many Database Relationship Design - to enable Word Clouds

I'm relatively new to database design and struggling to introduce a many-to-many relationship in a SSAS Tabular model.
I have some 'WordGroup' performance data in one table, like so;
WordGroup | IndexedVolume
Dining | 1,000
Sports | 2,000
Movies | 1,600
... and so on
Then I have 'Words' contained within these 'WordGroups' sitting in another category table, like so;
WordGroup | Word
Dining | Restaurant
Dining | Food
Dining | Dinner
Sports | Football
Sports | Basketball
... and so on
I can't see Performance data (IndexedVolume) by 'Word' detail - only by the 'WordGroup' that it is contained within. For example above, I can't look at 'Football' IndexedVolume on it's own, I can only choose the 'Sports' WordGroup that contains Football.
However, when analysing by 'WordGroup' I would still like users to understand what 'Words' are included (ideally in a Word Cloud Visualisation). Therefore, I wanted to develop a relationship between these two tables, so when someone chooses a Word Group (or multiple) we can return the Words that are contained within the Word Group(s) - i.e. below.
User selects Dining WordGroup
<<<Word Cloud or Flat Table would show Words below>>>
Restaurant
Food
Dinner
I looked at Concatenate / Strings etc, but was deterred as the detail here is much more complex and each WordGroup may contain 10+ Words, with translations.
Any advice would be greatly appreciated!
If analizing by WordGroup is an obligatory requirement, you sholud use these tables:
The many-to-many aplies beacuse your words may be conected to one or more groups, e.g. tree is conected to enviroment, forest, etc.
and obviously one word_group is conected to many words.
To see performance data by Word use :
select w.idword , w.name, sum(wg.index_volume)
from word w
left join word_group_has_word wgw
on w.idword=wgw.word_id
left join word_group wg
on wg.idword_group=wgw.word_group_id
group by w.idword
So you will see the sum of all the index volume of all the group_words conected to the words. ANd if you wanna see the words conected to the word groups use:
select distinct w.idword , w.name
from word w
left join word_group_has_word wgw
on w.idword=wgw.word_id
where wgw.word_group_id in [listWordGroupsId]

How to keep track changing items in a stock portfolio?

I have a system where people can pick some stocks and it values their portfolios but I'm having trouble doing this in a efficient way on a daily basis because I'm creating entries for days that don't have any changes(think of it like I'm measuring the values and having version control so I can track changes to the way the portfolio is designed).
Here's a example(each day's portfolio with stock name and weight):
Day1:
ibm = 10%
microsoft = 50%
google = 40%
day5:
ibm = 20%
microsoft = 20%
google = 40%
cisco = 20%
I can measure the value of the portfolio on day1 and understand I need to measure it again on day5(when it changed) but how do I measure day2-4 without recreating day1's entry in the database?
My approach right now(which I don't like) is to create a temp entry in my database for when someone changes the portfolio and then at the end of the day when I calculate the values if there is a temp entry I use that otherwise I create a new entry(for day2-4) using the last days data. The issue is as data often doesn't change I'm creating entries that are basically duplicates. The catch is: my stock data is all daily. I also thought of taking the portfolio and if it hasn't been updated in 3 days to find the returns of the last 3 days for each stock but I wasn't sure if there was a better solution.
Any ideas? I think this is a straight forward problem but I just can't see a efficient way of doing it.
note: in finance terms, its called creating a NAV and most firms do it the inefficient way I'm doing it but its because the process was created like 50 years ago and hasn't changed. I think this problem is very similar to version control but I can't seem to make a solution.
In storage terms is makes most sense to just store:
UserId - StockId1 - 23% - 2012-06-25
UserId - StockId2 - 11% - 2012-06-26
UserId - StockId1 - 20% - 2012-06-30
So you see that stock 1 went down at 30th. Now if you want to know the StockId1 percentage at the 28th you just select:
SELECT *
FROM stocks
WHERE datecolumn<=DATE(2012-06-28)
ORDER BY datecolumn DESC LIMIT 0,1
If it gives nothing back you did not have it, otherwise you get the last position back.
BTW. if you need for example a graph of stock 1 you could left join against a table full of dates. Then you can fill in the gaps easily.
Found this post here for example:
UPDATE mytable
SET number = (#n := COALESCE(number, #n))
ORDER BY date;
SQL QUERY replace NULL value in a row with a value from the previous known value

Resources