MySQL Database Table Structure Efficiency Advice - database

We are designing a MySQL table to track the number of followers on a daily basis for 10,000s of Twitter accounts. We've been struggling to figure out the most efficient way to store this data. The two options we are consider are:
1) OPTION 1 - Table with rows: Twitter ID, Month, Day1, Day2, Day3, etc. where each day would contain the number of followers for that account for each day of the specified month
2) OPTION 2 - Table with rows: Twitter ID, Day, Followers
Option 1 would result in about 30x less rows than Option 2. What I'm not sure from a performance perspective is if it's preferable to have less columns or less rows.
In terms of the queries we will be using, we just want to be able to query the data to get the number of followers for a specific Twitter account for arbitrary time ranges.
I would appreciate suggestions on which approach is better and why. Also, if there is a much better option than the ones I present please feel free to suggest it.
Thanks in advance for your help!

Option 2, no question.
Imagine trying to write a query using each option. Let's give the best case for option 1: We know we want the total for all 31 days of the month. THen with option 1 the query is:
select twitterid, day1+day2+day3+day4+day5+day6+day7+day8+day9+day10
+day11+day12+day13+day14+day15+day16+day17+day18+day19+day20
+day21+day22+day23+day24+day15+day26+day27+day28+day29+day30
+day31 as total
from table1
where month='2010-12';
select twitterid, sum(day) as total
from table2
where date between '2010-12-01' and '2010-12-31'
group by twitterid;
The second looks way easier to me. If you don't think so, tell me if you immediately noticed the error in the option 1 version, and if you're confident that no programmer would ever make such an error.
Now imagine that the requirements change just slightly, and someone wants the total for one week. With the second version, that's easy: give a date range that describes that week. This could easily be done when building a query on the fly: JUst ask for start date and add 6 days to it for the end date. But with the first version, what are you going to do? You'd have to figure out which days of the month fall in that week and change the list of fields retrieved. The week might span two calendar months. This would be a giant pain.
As to performance: Sure, more rows take more time to retrieve. But longer rows also take more time to retrieve. Lesson 1 on database design: Don't throw out normalization to do a micro-optimization when you don't even have a good reason to believe there's a problem. Build a normalized database first. Then if it turns out that there are performance problems, tune it afterwards. Odds are that you can buy a faster hard drive for a whole lot less than the cost of one day of programmer's time taken finding a mistake in an unnecessarily complex query.

Offcourse it depends on what queries you are going to do - but unless every query requires the 31 days of that month, for your operational data, Use Option 2.
It's better from a logical perspective (say later on you don't want queries per "30 calender days", but "last X days")
It's better for writes, too (only
update 1 row with 2 fields instead of
overwriting all fields).
You can always optimize later (partitioning comes to mind)
Your data-warehouse can still be optimized for long-term aggregate statistics.

Use Option 2. Option 1 would be a nightmare for queries.
MySQL has good support for doing date ranges in queries, so it is easiest to just have row per day.

I would say option 2, but you would probably want to add a field for a primary key to speed up queries. And if that primary key is an integer value, even better.

Option 2 definitely (with a two-column unique key/constraint on Twitter ID and Day).
Option 1 will just be regrettable.

Related

Hit Counter: Separate Date + Time Fields vs one DateTime2 field

I am planning out a hit counter, and I plan to make many report queries to show number of hits total in a day, the past week, the past month, etc, as well as one that would feed a chart that shows what time of day was most popular, within a specific date range, for a specific page.
With this in mind, would it be beneficial to store the DATE in a separate field from the TIME that the hit occurred, then add indexes? I would be using a where clause with a range (greater than x and less than y) for some of these queries. I do expect to have queries that ask about both the Date and the Time, such as "within the past 6 months, show me number of hit grouped per hour of the day."
Am I over complicating it? should I just use a single DateTime2(0) field or is there some advantage to using two fields for this?
I think you are bordering premature optimization with this approach.
Use Datetime. In due time (i.e. after your application has reached Production and you have a better idea of the actual requirements and how it performs) you can for example introduce views to aggregate your data in a way that proves more useful for any reporting/querying you have to perform frequently.
In the most extreme case you can even refactor your schema and migrate everything from Datetime to two distinct fields, but I doubt this will prove necessary.

More fields in a database or more records?

I'm making a database for different rental investments for my employer. I want each record to have the investment code followed by expected monthly cashflows for the next 10 years.
I can set this up with 120 fields for each future monthly cashflow,
OR
I have only 3 fields - investment code, month and cashflow.
Which is better? I will probably have 5000 new investments each month. The first produces 5000 records, the second produces 600000. Is that a problem? I'll want to run queries and stuff based on relationships in the rest of the database. Which approach gives the best performance?
Thanks in advanced!
I like the second approach, it might have bigger table size as the investment code repeats for each record, however is more normalized. I dont think there should be any performance issue if you have proper primary keys setup. It also has another advantage for new investments: you dont need to have 120 fields as they seem to start at any month.
An update...
I did my own testing using both the wide and short vs narrow and tall options using a test of many thousands.
The query performance of both is roughly the same (as long as the query is well designed and doesn't return individual records). If you use a stopwatch, the wider table is slightly better by a couple of seconds, but tall is still acceptable.
The main difference is the amount of data produced. Wide takes up only 200mb. Tall takes up 1gb for the same information! Writing data in tall takes a lot longer.
Given that Access has a 2gb limit, and that I don't need the flexibility to ever add past 5 years worth (I know for a fact this will not be needed), I think I'll go with option 1 - short and wide tables.
I'm a little surprised a just how big the tall table was - I would have though Access would be able to compress data a little better.

How to improve the design of a huge table which changes every minute and requires lot of queries?

Schema - person_id, location_id, time_slot, availability
So, the table tells me during which periods of the day (divided into time_slots), the person was present at that particular location or not.
So, if time_slots are divided into 10 mins, there are going to be 10*6*24 = 1440 rows for 1 person at 1 location.
1 person may also be present at another 10 locations. So, there are basically, 1440*10 = 14400 rows for every person. Since, I am expect to have 100,000 people in the DB, this problem clearly looks unscalable to me (at least in MySQL).
The DB should perform well on queries like telling me if person X can be present in location Y from 11 am to 5 pm.
I cannot think much beyond the obvious solution. I thought of implementing a segment tree type solution, but that seemed very complex. Also, creating another table with larger time slots seemed like a solution.
I think the entire table design has to be changed and am looking for some help with this. Thanks a lot.
It sounds like your queries are user specific. If thats the case you could enable partitioning on table by User. This will improve your query performance.
Just found a link with similar use case:
http://www.chrismoos.com/2010/01/31/mysql-partitioning-tables-with-millions-of-rows

Deriving and saving the historical values into a separate table, or calculate the historical values from the existing data only when they're needed?

tl;dr general question about handling database data and design:
Is it ever acceptable/are there any downsides to derive data from other data at some point in time, and then store that derived data into a separate table in order to keep a history of values at that certain time, OR, should you never store data that is derived from other data, and instead derive the required data from the existing data only when you need it?
My specific scenario:
We have a database where we record peoples' vacation days and vacation day statuses. We track how many days they have left, how many days they've taken, and things like that.
One design requirement has changed and now asks that I be able to show how many days a person had left on December 31st of any given year. So I need to be able to say, "Bob had 14 days left on December 31st, 2010".
We could do this two ways I see:
A SQL Server Agent job that, on December 31st, captures the days remaining for everyone at that time, and inserts them into a table like "YearEndHistories", which would have your EmployeeID, Year, and DaysRemaining at that time.
We don't keep a YearEndHistories table, but instead if we want to find out the amount of days possessed at a certain time, we loop through all vacations added and subtracted that exist UP TO that specific time.
I like the feeling of certainty that comes with #1 --- the recorded values would be reviewed by administration, and there would be no arguing or possibility about that number changing. With #2, I like the efficiency --- one less table to maintain, and there's no derived data present in the actual tables. But I have a weird fear about some unseen bug slipping by and peoples' historical value calculation start getting screwed up or something. In 2020 I don't want to deal with, "I ended 2012 with 9.5 days, not 9.0! Where did my half day go?!"
One thing we have decided on is that it will not be possible to modify values in previous years. That means it will never be possible to go back to the previous calendar year and add a vacation day or anything like that. The value at the end of the year is THE value, regardless of whether or not there was a mistake in the past. If a mistake is discovered, it will be balanced out by rewarding or subtracting vacation time in the current year.
Yes, it is acceptable, especially if the calculation is complex or frequently called, or doesn't change very often (eg: A high score table in a game - it's viewed very often, but the content only changes on the increasingly rare occasions when a player does very well).
As a general rule, I would normalise the data as far as possible, then add in derived fields or tables where necessary for performance reasons.
In your situation, the calculation seems relatively simple - a sum of employee vacation days granted - days taken, but that's up to you.
As an aside, I would encourage you to get out of thinking about "loops" when data is concerned - try to think about the data as a whole, as a set. Something like
SELECT StaffID, sum(Vacation)
from
(
SELECT StaffID, Sum(VacationAllocated) as Vacation
from Allocations
where AllocationDate<=convert(datetime,'2010-12-31' ,120)
group by StaffID
union
SELECT StaffID, -Count(distinct HolidayDate)
from HolidayTaken
where HolidayDate<=convert(datetime,'2010-12-31' ,120)
group by StaffID
) totals
group by StaffID
Derived data seems to me like a transitive dependency, which is avoided in normalisation.
That's the general rule.
In your case I would go for #1, which gives you a better "auditability", without performance penalty.

Printing the names of all the people greater than age 18?

This was a pretty good question that was posed to me recently. Suppose we have a hypothetical (insert your favorite data storage tool here) database that consists of the names, ages and address of all the people residing on this planet. Your task is to print out the names of all the people whose age is greater than 18 within an HTML table. How would you go about doing that? Lets say that hypothetically the population is growing at the rate of 1200/per second and the database is updated accordingly(don't ask how). What would be your strategy to print the names of all these people and their addresses on an HTML table?
Storing the ages in a DB tables sounds like a recipe for trouble to me - it would be impossible to maintain. You would be better off storing the birth dates, then building an index on that column/attribute.
You have to get an initial dump of the table for display. Just calculate the date 18 years ago (let's say D0) and use a query for any person born earlier than that.
Use DB triggers to receive notifications about deaths, so that you can remove them from the table immediately.
Since people only get older (unfortunately?), you can use ranged queries to get new additions (i.e. people that become 18 years old since yo last queried the table). E.g. if you want to update the display the next day, you issue a query for the people that were born in day D0 + 1 only - no need to request the whole table again.
You could even prefetch the people who reach 18 years of age the next day, keep the entries in memory, and add them to the display at the exact moment they reach that age.
BTW, even with 2KB of data for each person, you get a 18TB database (assuming 50% overhead). Any slightly beefed up server should be able to handle this kind of DB size. On the other hand, the thought of a 12 TB HTML table terrifies me...
Oh, and beware of timezone and DST issues - time is such a relative thing these days...
I don't see what the problem is. You don't have to worry about new records being added at all, since none of them will be included in your query unless that query takes 18 or more years to run. If you have an index on age, and presumably any DB technology sufficient to handle that much data and 1200 inserts a second updates indexes on insert, it should just work.
In the real world, using existing technologies or something like it, I would create a daily snapshot once a day and do queries on that read-only snapshot that would not include records for that day. That table would certainly be good enough for this query, and most others.
Are you forced to aggregate all of the entries into one table?
It would be simpler if you were to create a table for each age group (only around 120 tables would be needed) and just insert the inputs into those, as it's computationally simpler to look over 120 tables when you insert an entry than to look over 6,000,000,000 when looking for entries.

Resources