Storing daily statistics in relational database - database

I'm creating a game that needs to save each players' statistics (games played, exp & gold gained, current gold) on a daily basis, as well as the all-time statistics.
My current approach is I have 3 tables:
table `stats_current` -> for storing player's stats on CURRENT DAY
player_id | games_played | gold_earned | current_gold
table `stats_all_time` -> player's stats accumulated from the very beginning
player_id | games_played | gold_earned | current_gold
table `stats_history` -> player's stats daily, one record for one day
player_id | date | games_played | gold_earned | current_gold
Each player has one record on stats_current, one record on stats_all_time, and limited records on stats_history table (for example, only last 30 days are recorded).
Then there's a daemon / cron job that do these operation on a daily basis:
For each players:
Search for its record on stats_current, get the values.
Insert new record to stats_history, values are from stats_current
Update record on stats_all_time, increment its values with values from stats_current
On stats_current, reset the values of games_played, gold_earned to 0. But leave the current_gold as it is.
Solutions for common tasks:
Get player X's current gold: retrieve current_gold from stats_current
Get player X's stats in last 7 days: select 6 records from stats_history, plus the today's record in stats_current
Get player X's total games played: retrieve values from stats_history
Questions:
Is this a viable approach?
What are the weaknesses?
Is there any way to optimize this solution?

Your approach fails to take advantage of the power of SQL
stats_history
To get today's stats, just use
SELECT * FROM stats_history WHERE Date = CURDATE() and PlayerId = PlayerId--Depending on your RDBMS you might need a different function to get the date.
To get all time stats just use
SELECT SUM(games_played) as games_played, SUM(gold_earned) as gold_earned FROM stats_history WHERE PlayerId = playerid
You could just pull current gold by selecting the top record from stats_history for that player, or by using any of a number of other RMDBS specific strategies (Over clause for SQL Server, Ordering the result set by date and adding current_gold for MySQL etc.)
Your approach is risky because if your Chron ever fails, the other two tables will be inaccurate. It's also uneccessary duplication of data.

Related

Is there a better way to reconstruct a data from hundreds of millions of entries spread over a long period of time?

(first of all - apologies for the title but I couldn't come up with a better one)
Here is my problem - I have a table with 4 columns - entity::INT, entry::TEXT, state::INT and day::INT.
There could be anywhere from 50 to 1,000 entities. Each entity can have over 100 million entries. Each entry can have one or more states, which changes if the data stored in the entry has changed but only one state can be written for any particular day. The day starts at one and is incremented each day.
Example:
entity | entry | state | day
-------------------------------------
1 | ABC123 | 1 | 1
1 | ABC124 | 2 | 1
1 | ABC125 | 3 | 1
...
1 | ABC999 | 999 | 1
2 | BCD123 | 1000 | 1
...
1 | ABC123 | 1001 | 2
2 | BCD123 | 1002 | 3
The index is set to (entity, day, state).
What I want to achieve is to efficiently select the most current state of each entry on day N.
Currently, every week I write all the entries with their latest state to the table so to minimize the number of days that we need to scan, however, given the total number of entries (worst case scenario - 1,000 entities times 100,000,000 entries is a lot of rows to write each week) the table is slowly but surely bloats and everything becomes really slow.
I need to be able to stop writing this "full" version weekly and instead have a setup that will still be fast enough to achieve that. I considered to use DISTINCT ON with a different index set to (entity, entry, day DESC, state) so that I could:
SELECT DISTINCT ON (entity, entry) entry, state
FROM table
WHERE entity = <entity> AND day <= <day>
ORDER BY entity, entry, day DESC, state;
Would that be the most efficient way to do it or there are better ways? Or entry possibly having hundreds of millions of unique values makes it a poor choice for the second column in the index and the performance will eventually come to a halt?
You want to rank the entries by time an take the latest one. That's the same as ranking them in reverse time orse and taking the first one. And ROW_NUMBER() is one way to do that.
WITH
ranked AS
(
SELECT
*,
ROW_NUMBER()
OVER (
PARTITION BY entity, entry
ORDER BY day DESC
)
AS entity_entry_rank
FROM
yourTable
)
SELECT
*
FROM
ranked
WHERE
entity_entry_rank = 1
The day column can then become a timestamp, and you don't need to store a new copy every day.
The appropriate index would be (entity, entry, timestamp)
Also, it's common to have two tables. One with the history, one with the latest value. That makes use of the current value quicker, at a minor disk overhead.
(Apologies for errors or formating, I'm on my phone.)
DISTINCT ON is simple, and performance is great - for few rows per entry. See:
Select first row in each GROUP BY group?
Not for many rows per entry, though.
Each entity can have over 100 million entries
See:
Optimize GROUP BY query to retrieve latest row per user
Assuming an entry table that holds one row for each existing entry (each relevant distinct combination of (entity, entry)), this query is very efficient to get the latest state for a given day:
SELECT e.entity, e.entry, t.day, t.state
FROM entry e
LEFT JOIN LATERAL (
SELECT day, state
FROM tbl
WHERE (entity, entry) = (e.entity, e.entry)
AND day <= <day> -- given day
ORDER BY day DESC
LIMIT 1
) t ON true;
ORDER BY e.entity, e.entry; -- optional
Use CROSS JOIN LATERAL instead of the LEFT JOIN if you only want entries that have at least one row in tbl.
The perfect index for this is on (entity, entry, day) INCLUDE (state).
If you have no table entry, consider creating one. (Typically, there should be one.) The rCTE techniques outlined in the linked answer above can also be used to create such a table.

MS Access: Query difference of two records to artificially create a third record

This is for keeping track of market positions. Raw data is pulled from .csv files provided by brokers. Each row is a record. My issue comes into play when cross trading occurs. Ex. owning 1000, and in one order selling 2,000 shares and opening a short position. I have two records from raw data, a buy of 1,000 shares, and a sell of 2,000 shares. But a new position has been created - short 1,000 shares. So that is 3 position changes, Open of 1,000 shares, close of 1,000 shares, and open -1,000 shares.
I need a query that will recognize that 1,000 shares were bought, and that is a long/buy position, but then recognizes that a sale of anything more than 1,000 shares will close the long position, takes the difference and artificially create a new short position, even though there isn't a record of this.
All of these records fall into the transactions table. This is where the .csv files are imported to.
My query uses four columns from the transactions table to group opening and closing amounts: date, time, symbol, buy/sell. It then sums the quantity field for each group.
I only know how to query the raw data, so I only get one row of information for the scenario above. But a need a second row of information for the new position.
So instead of this:
symbol | qty opened | qty closed
xyz | 1000 | 2000
I need this:
symbol | qty opened | qty closed
xyz | 1000 | 1000
xyz | -1000 | Null
Any guidance would be appreciated.
I have used a running sum, assuming the ID is the order of the trades on this example:
The query to get this result is
SELECT
T.ID,
T.TradeSize,
(SELECT SUM(TradeSize) From tblTrades TS WHERE TS.ID <= T.ID) AS [Position], IIf(Sgn([TradeSize])*Sgn([Position])<0,-[TradeSize],IIf(Abs([TradeSize])>Abs([Position]),[TradeSize]-[Position],0)) AS Closed,
IIf([TradeSize]=-[Closed],0,[TradeSize]-[Closed]) AS [Open]
FROM tblTrades AS T;
The logic is if the sign of the trade is opposite to the position, then the trade has closed out the trade size of the position, otherwise if the number of shares traded is greater than the position size, then we have closed the previous position, which the the trade size less the current position. You may need to play with minus signs and absolute values to get the report you want, but the basis should give you the right numbers.
EDIT:
PS A better name for the Open field would be 'Opened'.

SQL Server database design for evaluations

I'm designing this employee evaluation web page, and was wondering if my current database design is the correct one or if it could be improved.
This is my current design
Table Agenda:
+--------------+----------+----------+-----------+------+-------+-------+
| idEvaluation | Location | Employee | #Employee | Date | Date1 | Date2 |
+--------------+----------+----------+-----------+------+-------+-------+
Date is the date scheduled for the evaluation to be performed.
Date 1 and Date 2 its a period of time to retrieve some metrics from another database.
Table Evaluations:
+--------------+---------+------------+------+----------+
| idEvaluation | Manager | Department | Date | Comments |
+--------------+---------+------------+------+----------+
Table Scores:
+--------------+----------+-------+
| idEvaluation | idFactor | Score |
+--------------+----------+-------+
idFactor relates to another table which contains the factor and a description of it, like I said its this a correct design??
My concern its this, currently there are 60 employees, 11 managers and 12 factors, each employee its evaluated twice a year by every manager, so in the Agenda Table there's not much trouble since its only one record per evaluation (60 employees = 60 records), how ever on the Evaluations Table there are 11 records for every evaluation, so it goes to 660 records (60 employees * 11 managers = 660), and then on the Scores Table it goes even bigger since there are 12 factors for every evaluation, it goes to 7920 records (660 evaluations * 12 factors each = 7920).
Is this normal?? Am I doing it wrong?? Any input its appreciated.
EDIT
Location, Employee, #Employee, Manager and Department are loaded automatically by the vb.net page, they are "imported" from an Active Directory and its checked before insertion so duplicate names, misspelled names, and this sort of thing its not an issue.
The main idea is you dont want to repeat string literals
So if you have
id Department
1 Sales
2 IT
3 Admin
Instead of repeat Sales many time you only use 1 which is smaller so things also get faster.
Second if you have users
id user
1 Jhon Alexander
2 Maria Jhonson
If Jhon decide change his name then you will have to check all tables and change the name. Also there is the problem if two person have same name you wont know which one are you evaluating.
So go for separated table and use the ID.

Designing a schedule in a sports database

I will try to be as specific as possible, but I am having trouble conceptualizing the problem. As a hobby I am trying to design a NFL database that takes raw statistics and stores it for future evaluation for fantasy league analysis. One of the primary things I want to see is if certain players/teams perform well against specific teams and which defenses are suspect to either pass/run. The issue I am having is trying to design the schedule/event table. My current model is as follows.
TEAMS
TeamID, Team
SCHEDULE
ScheduleID, TeamID, OpponentID, Season, Week, Home_Away, PointsFor, PointsAgainst
In this scenario I will be duplicating every game, but when I use an event table where I use TeamAway and TeamHome I find my queries impossible to run since I have to query both AwayTeam and HomeTeam to find the event for a specific team.
In general though I cannot get a query to work where I have two relationships from a table back to one table, even in the schedule table my query does not work.
I have also considered dropping the team table and just storing NE, PIT, etc. for the Team and Opponent fields so I do not have to deal with the cross-relationships back to the team table.
How can I design this so I am not running queries for TeamID = OpponentID AND TeamID?
I am doing this in MS Access.
Edit
The issue I am having is when I query two table: Team (TeamID, Team) and Event(TeamHomeID, TeamAwayID), that had relationships built between the TeamID - TeamHomeID, and TeamID - TeamWayID I had issues building the query in ms Access.
The SQL would look something like:
SELECT Teams.ID, Teams.Team, Event.HomeTeam
FROM Teams INNER JOIN (Event INNER JOIN Result ON Event.ID = Result.EventID)
ON (Teams.ID = Result.LosingTeamID) AND (Teams.ID = Result.WinningTeamID)
AND (Teams.Team = Event.AwayTeam) AND (Teams.Team = Event.HomeTeam);
It was looking for teams that had IDs of both the losing team and the winning team (which does not exist).
I think I might have fixed this problem. I didn't realize the Relationships in database design are only default, and that within the Query builder I could change the joins on which a particular query is built. I discovered this by deleting all the AND portions of the SQL statement returned, and was able to return the name of all winnings teams.
This is an interesting concept - and good practice.
First off - it sounds like you need to narrow down exactly what kind of data you want so you know what to store. I mean, hell, what about storing the weather conditions?
I would keep Team, but I would also add City (because Teams could switch cities).
I would keep Games (Schedule) with columns GameID, HomeTeamID, AwayTeamID, ScheduleDate.
I would have another table Results with columns ResultID, GameID, WinningTeamID, LosingTeamID, Draw (Y/N).
Data could look like
TeamID | TeamName | City
------------------------
1 | PATS | NE
------------------------
2 | PACKERS | GB
GameID | HomeTeamID | AwayTeamID | ScheduleDate | Preseason
-----------------------------------------------------------
1 | 1 | 2 | 1/1/2016 | N
ResultID | GameID | WinningTeamID | LosingTeamID | Draw
------------------------------------------------------------
1 | 1 | 1 | 2 | N
Given that, you could pretty easily give any W/L/D for any Scheduled Game and date, you could easily SUM a Teams wins, their wins when they were home, away, during preseason or regular season, their wins against a particular team, etc.
I guess if you wanted to get really technical you could even create a Season table that stores SeasonID, StartDate, EndDate. This would just make sure you were 100% able to tell what games were played in which season (between what dates) and from there you could infer weather statistics, whether or not a player was out during that time frame, etc.

How to optimize large database requests

I am working with a database that contains information (measurements) about ships. The ships send an update with their position, fuel use, etc. So an entry in the database looks like this
| measurement_id | ship_id | timestamp | position | fuel_use |
| key | f_key | dd-mm-yy hh:ss| lat-lon | in l/km |
A new one of these entries gets added for every ship every second so the amount of entries in the database gets large very fast.
What I need for the application I am working on is not the information for one second but rather cumulative data for 1 minute, 1 day, or even 1 year. For example the total fuel use over a day, the distance traveled in a year, or the average fuel use per day over a month.
To get that and calculate that from this raw data is unfeasible, you would have to get 31,5 million records from the server to calculate the distance traveled in a year.
What I thought was the smart thing to do is combining entries into one bigger entry. For example get 60 measurements and combine them into 1 minute measurement entry in a separate table. By averaging the fuel use, and by summing the distance traveled between two entries. A minute entry would then look like this.
| min_measurement_id | ship_id | timestamp | position | distance_traveled | fuel_use |
| new key |same ship| dd-mm-yy hh| avg lat-lon | sum distance_traveled | avg fuel_use |
This process could then be repeated to work with hours, days, months, years. This way a query for a week could be done by requesting only 7 queries, or if I want hourly details 168 entries. Those look like way more usable numbers to me.
The new tables can be filled by querying the original database every 10 minutes, that data then fills the minute table, which in turn updates the hours table, etc.
However this seems to be a lot of management and duplication of almost the same data, with constantly the same operation being done.
So what I am interested in is if there is some way of structuring this data. Could it be sorted hierarchically (after all seconds, days, minutes are pretty hierarchical) or are there other ways to optimize this?
This is the first time I am using a database this size so I also did not really know what to look for on the internet.
Aggregates are common in data warehouses so your approach to group data is fine. Yes, you are duplicating some of the data, but you'll get the speed benefit.

Resources