My application is going to be creating a growing number of materialized views.
The number of materialized views will increase by 8,000 - 10,000 per month.
Most of the views will hold very little information, around 100-1000 rows with small fields, but a few (10 per month) will hold from 100,000 to millions of rows with small fields.
I am cautious to see if this is a good idea before going too far into the implementation.
Can anyone tell me the hard limits I may hit, or if this is a good idea at all?
If needed, I can explain further the use-case. It may be possible to drop some of the older views if needed (99%+), but not all (the 10/month big ones must stay).
EDIT: Explanation Requested
The app allows users to vote on content, and then we make charts of the content with the most votes. We have 5-minute units of voting, and charts generated by request. A user can look at any length of time with a granularity of 5 minutes. For example, I could look at just the past 5 minutes of votes, 15 minutes, 2 hours, etc.
To tally votes and make the "Top Charts" we must do an expensive query, searching all the content that has gotten votes within the time units that are requested, tallying, and sorting. To mitigate this, I wanted to make a sorted materialized view with the results of the vote for EVERY time unit, as after voting closes for that 5 minutes, the votes will never change. This way, a popular search (like the latest 5-minute chart) will not have to be generated and sorted every time a user searches (there are 8760 5-minute units per month). I also wanted to make mviews for the weeks, months, and days.
This is the table I will be using to generate the mviews, with tuid being a reference to the 5-minute voting time unit
Perhaps there is a better way to make this efficient with caching?
Related
I have users, each user gets assigned 12 events(they can reschedule these events etc) every 2 months. Each event is an object with id, name, description, date, is completed.
I'm currently saving these events in the user's document so that I do only one document read. events:[{events}*12] after a year there will be 72 events in this array, and it would keep growing year after year.
I'm wondering, should I be concerned with the 1mb limit?
I'd like to preserve history, so that the user can also view events of the past.
Given that on the calendar at most you could see one months worth of events, and say I lazy loaded the previous month for speed, doing a subcollection for events would result to 12-24 document reads. I fear this would get expensive very quick.
Any advice would be appreciated, thanks.
Honestly, I wouldn't be too concerned with the 1MB limit, that is still a lot of characters (roughly 1 million, although may be a bit less depending on data types) - so unless the descriptions could be incredibly long I think it's unlikely you will reach anywhere near those limits.
That being said, if it is a concern you could schedule a cloud function to periodically (perhaps every 3 months) to archive or move events to a subcollection that are no longer still of use, storing across more documents (to represent the quarter, or year, or whatever time period you decide on)
I'm maintaining a system where users create something called "books" that are accessed by other users.
I need a convenient (good performance) way to store events in database where users visit these books to later display graphs with statistics. The graphs need to demonstrate a history where the owner of the book can see which days in the week, and at which times there is more visiting activity (all over the months).
Using ERD (Entity-Relationship-Diagram), I can produce the following Conceptual Model:
At first the problem seems to be solved, as we have a very simple situation here. This will give me a table with 3 fields. One will be the occurrence of the visit event, and the other 2 will be foreign keys. One represents the user, while the other represents which book was visited. In short, every record in this table will be a visit:
However, thinking that a user can average about 10 to 30 book visits per day, and having a system with 100.000 users, in a single day this table can add many gigabytes of new records. I'm not the most experienced person in good database performance practices, but I'm pretty sure that this is not the solution.
Even though I do a cleanup on the database to delete old records, I need to keep a record history of the last 2 months of visits (at least).
I've been looking for a way to solve this for days, and I have not found anything yet. Could someone help me, please?
Thank you.
OBS: I'm using PostgreSQL 9.X, and the system is written in Java.
As mentioned in the comments, you might be overestimating data size. Let's do the math. 100k users at 30 books/day at, say, 30 bytes per record.
(100_000 * 30 * 30) / 1_000_000 # => 90 megabytes per day
Even if you add index size and some amount of overhead, this is still a few orders of magnitude lower than "many gigabytes per day".
I'm making a database for different rental investments for my employer. I want each record to have the investment code followed by expected monthly cashflows for the next 10 years.
I can set this up with 120 fields for each future monthly cashflow,
OR
I have only 3 fields - investment code, month and cashflow.
Which is better? I will probably have 5000 new investments each month. The first produces 5000 records, the second produces 600000. Is that a problem? I'll want to run queries and stuff based on relationships in the rest of the database. Which approach gives the best performance?
Thanks in advanced!
I like the second approach, it might have bigger table size as the investment code repeats for each record, however is more normalized. I dont think there should be any performance issue if you have proper primary keys setup. It also has another advantage for new investments: you dont need to have 120 fields as they seem to start at any month.
An update...
I did my own testing using both the wide and short vs narrow and tall options using a test of many thousands.
The query performance of both is roughly the same (as long as the query is well designed and doesn't return individual records). If you use a stopwatch, the wider table is slightly better by a couple of seconds, but tall is still acceptable.
The main difference is the amount of data produced. Wide takes up only 200mb. Tall takes up 1gb for the same information! Writing data in tall takes a lot longer.
Given that Access has a 2gb limit, and that I don't need the flexibility to ever add past 5 years worth (I know for a fact this will not be needed), I think I'll go with option 1 - short and wide tables.
I'm a little surprised a just how big the tall table was - I would have though Access would be able to compress data a little better.
We are designing a MySQL table to track the number of followers on a daily basis for 10,000s of Twitter accounts. We've been struggling to figure out the most efficient way to store this data. The two options we are consider are:
1) OPTION 1 - Table with rows: Twitter ID, Month, Day1, Day2, Day3, etc. where each day would contain the number of followers for that account for each day of the specified month
2) OPTION 2 - Table with rows: Twitter ID, Day, Followers
Option 1 would result in about 30x less rows than Option 2. What I'm not sure from a performance perspective is if it's preferable to have less columns or less rows.
In terms of the queries we will be using, we just want to be able to query the data to get the number of followers for a specific Twitter account for arbitrary time ranges.
I would appreciate suggestions on which approach is better and why. Also, if there is a much better option than the ones I present please feel free to suggest it.
Thanks in advance for your help!
Option 2, no question.
Imagine trying to write a query using each option. Let's give the best case for option 1: We know we want the total for all 31 days of the month. THen with option 1 the query is:
select twitterid, day1+day2+day3+day4+day5+day6+day7+day8+day9+day10
+day11+day12+day13+day14+day15+day16+day17+day18+day19+day20
+day21+day22+day23+day24+day15+day26+day27+day28+day29+day30
+day31 as total
from table1
where month='2010-12';
select twitterid, sum(day) as total
from table2
where date between '2010-12-01' and '2010-12-31'
group by twitterid;
The second looks way easier to me. If you don't think so, tell me if you immediately noticed the error in the option 1 version, and if you're confident that no programmer would ever make such an error.
Now imagine that the requirements change just slightly, and someone wants the total for one week. With the second version, that's easy: give a date range that describes that week. This could easily be done when building a query on the fly: JUst ask for start date and add 6 days to it for the end date. But with the first version, what are you going to do? You'd have to figure out which days of the month fall in that week and change the list of fields retrieved. The week might span two calendar months. This would be a giant pain.
As to performance: Sure, more rows take more time to retrieve. But longer rows also take more time to retrieve. Lesson 1 on database design: Don't throw out normalization to do a micro-optimization when you don't even have a good reason to believe there's a problem. Build a normalized database first. Then if it turns out that there are performance problems, tune it afterwards. Odds are that you can buy a faster hard drive for a whole lot less than the cost of one day of programmer's time taken finding a mistake in an unnecessarily complex query.
Offcourse it depends on what queries you are going to do - but unless every query requires the 31 days of that month, for your operational data, Use Option 2.
It's better from a logical perspective (say later on you don't want queries per "30 calender days", but "last X days")
It's better for writes, too (only
update 1 row with 2 fields instead of
overwriting all fields).
You can always optimize later (partitioning comes to mind)
Your data-warehouse can still be optimized for long-term aggregate statistics.
Use Option 2. Option 1 would be a nightmare for queries.
MySQL has good support for doing date ranges in queries, so it is easiest to just have row per day.
I would say option 2, but you would probably want to add a field for a primary key to speed up queries. And if that primary key is an integer value, even better.
Option 2 definitely (with a two-column unique key/constraint on Twitter ID and Day).
Option 1 will just be regrettable.
I need some inspiration for a solution...
We are running an online game with around 80.000 active users - we are hoping to expand this and are therefore setting a target of achieving up to 1-500.000 users.
The game includes a highscore for all the users, which is based on a large set of data. This data needs to be processed in code to calculate the values for each user.
After the values are calculated we need to rank the users, and write the data to a highscore table.
My problem is that in order to generate a highscore for 500.000 users we need to load data from the database in the order of 25-30.000.000 rows totalling around 1.5-2gb of raw data. Also, in order to rank the values we need to have the total set of values.
Also we need to generate the highscore as often as possible - preferably every 30 minutes.
Now we could just use brute force - load the 30 mio records every 30 minutes, calculate the values and rank them, and write them in to the database, but I'm worried about the strain this will cause on the database, the application server and the network - and if it's even possible.
I'm thinking the solution to this might be to break up the problem some how, but I can't see how. So I'm seeking for some inspiration on possible alternative solutions based on this information:
We need a complete highscore of all ~500.000 teams - we can't (won't unless absolutely necessary) shard it.
I'm assuming that there is no way to rank users without having a list of all users values.
Calculating the value for each team has to be done in code - we can't do it in SQL alone.
Our current method loads each user's data individually (3 calls to the database) to calculate the value - it takes around 20 minutes to load data and generate the highscore 25.000 users which is too slow if this should scale to 500.000.
I'm assuming that hardware size will not an issue (within reasonable limits)
We are already using memcached to store and retrieve cached data
Any suggestions, links to good articles about similar issues are welcome.
Interesting problem. In my experience, batch processes should only be used as a last resort. You are usually better off having your software calculate values as it inserts/updates the database with the new data. For your scenario, this would mean that it should run the score calculation code every time it inserts or updates any of the data that goes into calculating the team's score. Store the calculated value in the DB with the team's record. Put an index on the calculated value field. You can then ask the database to sort on that field and it will be relatively fast. Even with millions of records, it should be able to return the top n records in O(n) time or better. I don't think you'll even need a high scores table at all, since the query will be fast enough (unless you have some other need for the high scores table other than as a cache). This solution also gives you real-time results.
Assuming that most of your 2GB of data is not changing that frequently you can calculate and cache (in db or elsewhere) the totals each day and then just add the difference based on new records provided since the last calculation.
In postgresql you could cluster the table on the column that represents when the record was inserted and create an index on that column. You can then make calculations on recent data without having to scan the entire table.
First and formost:
The computation has to take place somewhere.
User experience impact should be as low as possible.
One possible solution is:
Replicate (mirror) the database in real time.
Pull the data from the mirrored DB.
Do the analysis on the mirror or on a third, dedicated, machine.
Push the results to the main database.
Results are still going to take a while, but at least performance won't be impacted as much.
How about saving those scores in a database, and then simply query the database for the top scores (so that the computation is done on the server side, not on the client side.. and thus there is no need to move the millions of records).
It sounds pretty straight forward... unless I'm missing your point... let me know.
Calculate and store the score of each active team on a rolling basis. Once you've stored the score, you should be able to do the sorting/ordering/retrieval in the SQL. Why is this not an option?
It might prove fruitless, but I'd at least take a gander at the way sorting is done on a lower level and see if you can't manage to get some inspiration from it. You might be able to grab more manageable amounts of data for processing at a time.
Have you run tests to see whether or not your concerns with the data size are valid? On a mid-range server throwing around 2GB isn't too difficult if the software is optimized for it.
Seems to me this is clearly a job for chacheing, because you should be able to keep the half-million score records semi-local, if not in RAM. Every time you update data in the big DB, make the corresponding adjustment to the local score record.
Sorting the local score records should be trivial. (They are nearly in order to begin with.)
If you only need to know the top 100-or-so scores, then the sorting is even easier. All you have to do is scan the list and insertion-sort each element into a 100-element list. If the element is lower than the first element, which it is 99.98% of the time, you don't have to do anything.
Then run a big update from the whole DB once every day or so, just to eliminate any creeping inconsistencies.