In StackOverflow, you can earn a few badges which are based on streaks, and from those you can imagine some other use cases of tracking streaks:
Visit the site every day for 30 days.
Vote on 30 posts per day, for 7 days in a row.
Etc..
How would you keep track of and implement this in any kind of database? (Doesn't really matter which one, SQL, NoSQL, etc..)
The first hard question/problem is, do you do it in cron/background jobs/tasks, or on every change to the database (every page view or vote, for example)?
The second hard problem is, depending on the previous solution, how do you somewhat efficiently query the state of the database to figure out if the conditions are met, and how do you not count previous "object actions" in your new calculation?
So let's focus on the votes one, since that is a little more complicated. Say we are tracking "UTC days".
If we go with the "check on every database change" solution (to part 1), then on next vote, after save to DB, query all votes since past UTC date for that user. If vote count >= 30, create a new "streak" record and do count as 1. Then on next vote (same day), we somehow need to know we already counted it, hmm, so need to modify the approach. So maybe we track "last_vote_id" for the current day, so we have (pseudocode):
table vote_streaks {
date last_date;
int last_vote_id;
int last_date_count;
int total_streak_count;
}
Then on the next vote / same day, we check when the last_date we tracked the vote (today), and the last_date_count (say we had 30 votes at this point), and the total streak is 1, we know that we should ignore the 31st vote for that day. Then on day 2, last_date doesn't match today, so we reset last_date_count to 0 before we set it to 1. I think something like that might work? I can't quite tell.
If we went with the approach to do a cron/background job, then we would query all votes (limit 30) for each user, sometime after the start of the UTC day. If we can build a streak out of that (that is more straightforward to do), then we can get the streak. But what if some problem occurs in the middle of the job and it cuts out, and has to restart? I can't imagine how to solve this yet, but it seems way more complicated than the other real-time approach.
How is this generally solved? Again, don't need to get into the weeds of the actual SQL/NoSQL tables (unless you'd like, pseudocode is fine). Or instead of a general solution, if you know of a specific solution (like how StackOverflow implements it), that would work too.
The smallest data you need to keep track of here is going to be the date of the most recent action by the user and the current streak count.
When the user done the streaked action, you need to check the date of the most recent action. If the most recent action happens within less than a day but now is already a different day, you increment the streak count; if it's further than two days, you reset the streak count to zero; otherwise you update the most recent action timestamp
When checking for streaks, you need to check for the most recent action timestamp as well. If the timestamp was last updated within the last day, the streak count is valid, otherwise the real streak count is really zero.
Alternatively, you can just simply do a document database and do the full streak calculations based on transaction logs. This will get expensive for long streak users, though it will be the simplest to implement. Depending on how often you expect people to have long streaks, this might or might not be acceptable.
Related
I have users, each user gets assigned 12 events(they can reschedule these events etc) every 2 months. Each event is an object with id, name, description, date, is completed.
I'm currently saving these events in the user's document so that I do only one document read. events:[{events}*12] after a year there will be 72 events in this array, and it would keep growing year after year.
I'm wondering, should I be concerned with the 1mb limit?
I'd like to preserve history, so that the user can also view events of the past.
Given that on the calendar at most you could see one months worth of events, and say I lazy loaded the previous month for speed, doing a subcollection for events would result to 12-24 document reads. I fear this would get expensive very quick.
Any advice would be appreciated, thanks.
Honestly, I wouldn't be too concerned with the 1MB limit, that is still a lot of characters (roughly 1 million, although may be a bit less depending on data types) - so unless the descriptions could be incredibly long I think it's unlikely you will reach anywhere near those limits.
That being said, if it is a concern you could schedule a cloud function to periodically (perhaps every 3 months) to archive or move events to a subcollection that are no longer still of use, storing across more documents (to represent the quarter, or year, or whatever time period you decide on)
In my new job we have an old program that is 10 to 15 years old I think, but is still used. I work on renewing the system which have a major problem in the old data.
The program is part of a payment system. It allows the original payment to be split at the paying time.
When it splits the payment it updates the original record and keeps the original date on the new record. It keeps the original value of the last operation in a separate field.
1000$ original split into 2 500$ --> by adding new 500$ record and updating the original into 500$ payment keeping 1000$ as the original.
500$ split into 300$ , 200$ --> by adding new 200$ record and updating the original row into 300$ payment, now the original is updated to 500$ instead of 1000$.
and so on.
The following image contains a example case based on a real case. with two original payments 1000 and 600.
Whoever made the program did not use transactions, so some times the new record is not added (that's how the problem was discovered, but too late 15 years too late).
How can I find the affected customers in an 4.5 million records?
Is there a way to find the real original amount from the original field in the image? (I know that the answer might be no).
The database is oracle and the program was developed on oracle forms.
Thank you.
edit : a step by step example in a spreadsheet
https://docs.google.com/spreadsheets/d/1I9jOlCeiVuGdNlgXpiF_-Ic0e-cqaalrpUCJIUM5oAk/edit?usp=sharing
The problem is that the date field does not keep time only date. if the customer made several translations in the same day the error becomes hard to detect. it can only be detected if the customer made only one transaction in the day, even then it has to be viewed case by case. That's hard for years of work. Unfortunately, maybe they will have to suffer a loss for bad programming.
I will provide all the table fields tomorrow for better understanding.
Thank you for the replies.
tl;dr general question about handling database data and design:
Is it ever acceptable/are there any downsides to derive data from other data at some point in time, and then store that derived data into a separate table in order to keep a history of values at that certain time, OR, should you never store data that is derived from other data, and instead derive the required data from the existing data only when you need it?
My specific scenario:
We have a database where we record peoples' vacation days and vacation day statuses. We track how many days they have left, how many days they've taken, and things like that.
One design requirement has changed and now asks that I be able to show how many days a person had left on December 31st of any given year. So I need to be able to say, "Bob had 14 days left on December 31st, 2010".
We could do this two ways I see:
A SQL Server Agent job that, on December 31st, captures the days remaining for everyone at that time, and inserts them into a table like "YearEndHistories", which would have your EmployeeID, Year, and DaysRemaining at that time.
We don't keep a YearEndHistories table, but instead if we want to find out the amount of days possessed at a certain time, we loop through all vacations added and subtracted that exist UP TO that specific time.
I like the feeling of certainty that comes with #1 --- the recorded values would be reviewed by administration, and there would be no arguing or possibility about that number changing. With #2, I like the efficiency --- one less table to maintain, and there's no derived data present in the actual tables. But I have a weird fear about some unseen bug slipping by and peoples' historical value calculation start getting screwed up or something. In 2020 I don't want to deal with, "I ended 2012 with 9.5 days, not 9.0! Where did my half day go?!"
One thing we have decided on is that it will not be possible to modify values in previous years. That means it will never be possible to go back to the previous calendar year and add a vacation day or anything like that. The value at the end of the year is THE value, regardless of whether or not there was a mistake in the past. If a mistake is discovered, it will be balanced out by rewarding or subtracting vacation time in the current year.
Yes, it is acceptable, especially if the calculation is complex or frequently called, or doesn't change very often (eg: A high score table in a game - it's viewed very often, but the content only changes on the increasingly rare occasions when a player does very well).
As a general rule, I would normalise the data as far as possible, then add in derived fields or tables where necessary for performance reasons.
In your situation, the calculation seems relatively simple - a sum of employee vacation days granted - days taken, but that's up to you.
As an aside, I would encourage you to get out of thinking about "loops" when data is concerned - try to think about the data as a whole, as a set. Something like
SELECT StaffID, sum(Vacation)
from
(
SELECT StaffID, Sum(VacationAllocated) as Vacation
from Allocations
where AllocationDate<=convert(datetime,'2010-12-31' ,120)
group by StaffID
union
SELECT StaffID, -Count(distinct HolidayDate)
from HolidayTaken
where HolidayDate<=convert(datetime,'2010-12-31' ,120)
group by StaffID
) totals
group by StaffID
Derived data seems to me like a transitive dependency, which is avoided in normalisation.
That's the general rule.
In your case I would go for #1, which gives you a better "auditability", without performance penalty.
I have a cube that shows order volume history, but showing the month to date numbers isn't that useful (since they're way out of whack with other full months). I want to include that data in some kind of display, but in a way that makes sense.
What's the best way to go about showing the last 30 days (or last month) over a previous period? If there's some way I can do that with my time dimension, great, but I figured I might need a calculation of some kind.
For example, since today is 7/12, I'd want to show the data for 6/13-7/12 as the most recent period, and then compare it to 5/13-6/12, and so on - that way, it would be easy to see where the current month is trending, but with values that are inline size-wise with previous periods. Also, I'd figure that would make an easier KPI value, since I can compare movement from last month's values using the rolling one-month period for comparison.
We are designing a MySQL table to track the number of followers on a daily basis for 10,000s of Twitter accounts. We've been struggling to figure out the most efficient way to store this data. The two options we are consider are:
1) OPTION 1 - Table with rows: Twitter ID, Month, Day1, Day2, Day3, etc. where each day would contain the number of followers for that account for each day of the specified month
2) OPTION 2 - Table with rows: Twitter ID, Day, Followers
Option 1 would result in about 30x less rows than Option 2. What I'm not sure from a performance perspective is if it's preferable to have less columns or less rows.
In terms of the queries we will be using, we just want to be able to query the data to get the number of followers for a specific Twitter account for arbitrary time ranges.
I would appreciate suggestions on which approach is better and why. Also, if there is a much better option than the ones I present please feel free to suggest it.
Thanks in advance for your help!
Option 2, no question.
Imagine trying to write a query using each option. Let's give the best case for option 1: We know we want the total for all 31 days of the month. THen with option 1 the query is:
select twitterid, day1+day2+day3+day4+day5+day6+day7+day8+day9+day10
+day11+day12+day13+day14+day15+day16+day17+day18+day19+day20
+day21+day22+day23+day24+day15+day26+day27+day28+day29+day30
+day31 as total
from table1
where month='2010-12';
select twitterid, sum(day) as total
from table2
where date between '2010-12-01' and '2010-12-31'
group by twitterid;
The second looks way easier to me. If you don't think so, tell me if you immediately noticed the error in the option 1 version, and if you're confident that no programmer would ever make such an error.
Now imagine that the requirements change just slightly, and someone wants the total for one week. With the second version, that's easy: give a date range that describes that week. This could easily be done when building a query on the fly: JUst ask for start date and add 6 days to it for the end date. But with the first version, what are you going to do? You'd have to figure out which days of the month fall in that week and change the list of fields retrieved. The week might span two calendar months. This would be a giant pain.
As to performance: Sure, more rows take more time to retrieve. But longer rows also take more time to retrieve. Lesson 1 on database design: Don't throw out normalization to do a micro-optimization when you don't even have a good reason to believe there's a problem. Build a normalized database first. Then if it turns out that there are performance problems, tune it afterwards. Odds are that you can buy a faster hard drive for a whole lot less than the cost of one day of programmer's time taken finding a mistake in an unnecessarily complex query.
Offcourse it depends on what queries you are going to do - but unless every query requires the 31 days of that month, for your operational data, Use Option 2.
It's better from a logical perspective (say later on you don't want queries per "30 calender days", but "last X days")
It's better for writes, too (only
update 1 row with 2 fields instead of
overwriting all fields).
You can always optimize later (partitioning comes to mind)
Your data-warehouse can still be optimized for long-term aggregate statistics.
Use Option 2. Option 1 would be a nightmare for queries.
MySQL has good support for doing date ranges in queries, so it is easiest to just have row per day.
I would say option 2, but you would probably want to add a field for a primary key to speed up queries. And if that primary key is an integer value, even better.
Option 2 definitely (with a two-column unique key/constraint on Twitter ID and Day).
Option 1 will just be regrettable.