I have to cache program schedule data based on zipcode. Each zipcode can have between 8-20k program schedule entries for a day. Each program schedule entry would look like this,
program_name,
start_time,
end_time,
channel_no,
..
..
There can be upto 10k zipcode entries.
Now, I want to cache this in such a way so that I can query at any instant to get currently running programs. For a particular zipcode, I want to query based on condition below,
start_time < current_time + 2 minutes AND end_time > current_time
So, I was thinking of couple of approaches here.
a) Use a redis list for each zipcode. List would contain all the program schedule entries. Load all the program schedule entries in memory and filter them based on query condition above.
b) Use 2 sorted sets for each zipcode. One set will use start_time as score for each program schedule entry. Another one with end_time as score. Once we have 2 sets, I could use the zrangebyscore for both sets by passing the current_time for the score param. And then do the intersection between the resulting sets.
I was wondering if there are better ways?
The List approach (a) is likely to be less performant since you'll need to get the entire list on every query.
Sorted Sets are more suitable for this purpose, but instead of using two you can probably get away with using just one by setting the score as start_time.length, do a ZRANGEBYSCORE and then filter the result on the fractional part.
Also, whether you're using two Sorted Sets or just one, consider using a Lua script to perform the query to avoid network traffic and to localize data processing.
I did solve this a bit differently a while back. Thought of coming back and adding my answer incase somebody runs into a similar design issue.
Problem was each of 10k zipcodes could have their own schedule because the channel numbers can be different based on zip code. So, the schedule entries for each of these zipcode is different. Here is what I did.
I load schedules for next hour for all channels in USA. There were
about 25k channel numbers. I do this once a hour by loading the schedules from redis into local memory.
I also store the zipcode <-> channel mapping within local memory.
When I need schedules for a particular zipcode, I get the list of channels for that zipcode and then get the schedule entries matching the channel numbers. Because, I do this in local memory the performance was pretty good!
Related
I am trying to come up with a theoretical solution to an NxN problem for data aggregation and storage. As an example I have a huge amount of data that comes in via a stream. The stream sends the data in points. Each point has 5 dimensions:
Location
Date
Time
Name
Statistics
This data then needs to be aggregated and stored to allow another user to come along and query the data for both location and time. The user should be able to query like the following (pseudo-code):
Show me aggregated statistics for Location 1,2,3,4,....N between Dates 01/01/2011 and 01/03/2011 between times 11am and 4pm
Unfortunately due to the scale of the data it is not possible to aggregate all this data from the points on the fly and so aggregation prior to this needs to be done. As you can see though there are multiple dimensions that the data could be aggregated on.
They can query for any number of days or locations and so finding all the combinations would require huge pre-aggregation:
Record for Locations 1 Today
Record for Locations 1,2 Today
Record for Locations 1,3 Today
Record for Locations 1,2,3 Today
etc... up to N
Preprocessing all of these combinations prior to querying could result in an amount of precessing that is not viable. If we have 200 different locations then we have 2^200 combinations which would be nearly impossible to precompute in any reasonable amount of time.
I did think about creating records on 1 dimension and then merging could be done on the fly when requested, but this would also take time at scale.
Questions:
How should I go about choosing the right dimension and/or combination of dimensions given that the user is as likely to query on all dimensions?
Are there any case studies I could refer to, books I could read or anything else you can think of that would help?
Thank you for your time.
EDIT 1
When I say aggregating the data together I mean combining the statistics and name (dimensions 4 & 5) for the other dimensions. So for example if I request data for Locations 1,2,3,4..N then I must merge the statistics and counts of name together for those N Locations before serving it up to the user.
Similarly if I request the data for dates 01/01/2015 - 01/12/2015 then I must aggregate all data between those periods (by adding summing name/statistics).
Finally If I ask for data between dates 01/01/2015 - 01/12/2015 for Locations 1,2,3,4..N then I must aggregate all data between those dates for all those locations.
For the sake of this example lets say that going through statistics requires some sort of nested loop and does not scale well especially on the fly.
Try a time-series database!
From your description it seems that your data is a time-series dataset.
The user seems to be mostly concerned about the time when querying and after selecting a time frame, the user will refine the results by additional conditions.
With this in mind, I suggest you to try a time-series database like InfluxDB or OpenTSD.
For example, Influx provides a query language that is capable of handling queries like the following, which comes quite close to what you are trying to achieve:
SELECT count(location) FROM events
WHERE time > '2013-08-12 22:32:01.232' AND time < '2013-08-13'
GROUP BY time(10m);
I am not sure what you mean by scale, but the time-series DBs have been designed to be fast for lots of data points.
I'd suggest to definitely give them a try before rolling your own solution!
Denormalization is a means of addressing performance or scalability in relational database.
IMO having some new tables to hold aggregated data and using them for reporting will help you.
I have a huge amount of data that comes in via a stream. The stream
sends the data in points.
There will be multiple ways to achieve denormalization in the case:
Adding a new parallel endpoint for data aggregation functionality in streaming
level
Scheduling a job to aggregate data in DBMS level.
Using DBMS triggering mechanism (less efficient)
In an ideal scenario when a message reaches the streaming level there will be two copies of data message containing location, date, time, name, statistics dimensions, being dispatched for processing, one goes for OLTP(current application logic) second will goes for an OLAP(BI) process.
The BI process will create denormalized aggregated structures for reporting.
I will suggest having aggregated data record per location, date group.
So end-user will query preprossed data that wont need heavy recalculations, having some acceptable inaccuracy.
How should I go about choosing the right dimension and/or combination
of dimensions given that the user is as likely to query on all
dimensions?
That will depends on your application logic. If possible limit the user for predefined queries that can be assigned values by the user(like for dates from 01/01/2015 to 01/12/2015). In more complex systems using a report generator above the BI warehouse will be an option.
I'd recommend Kimball's The Data Warehouse ETL Toolkit.
You can at least reduce Date and Time to a single dimension, and pre-aggregate your data based on your minimum granularity, e.g. 1-second or 1-minute resolution. It could be useful to cache and chunk your incoming stream for the same resolution, e.g. append totals to the datastore every second instead of updating for every point.
What's the size and likelyhood of change of the name and location domains? Is there any relation between them? You said that location could be as many as 200. I'm thinking that if name is a very small set and unlikely to change, you could hold counts of names in per-name columns in a single record, reducing the scale of the table to 1 row per location per unit of time.
you have a lot of datas. It will take a lot of time with all methods due to the amount of datas you're trying to parse.
I have two methods to give.
First one is a brutal one, you probably thought off:
id | location | date | time | name | statistics
0 | blablabl | blab | blbl | blab | blablablab
1 | blablabl | blab | blbl | blab | blablablab
ect.
With this one, you can easily parse and get elements, they are all in the same table, but the parsing is long and the table is enormous.
Second one is better I think:
Multiple tables:
id | location
0 | blablabl
id | date
0 | blab
id | time
0 | blab
id | name
0 | blab
id | statistics
0 | blablablab
With this you could parse (a lot) faster, getting the IDs and then taking all the needed informations.
It also allow you to preparse all the datas:
You can have the locations sorted by location, the time sorted by time, the name sorted by alphabet, ect, because we don't care about how the ID's are mixed:
If the id's are 1 2 3 or 1 3 2, no one actually care, and you would go a lot faster with parsing if your datas are already parsed in their respective tables.
So, if you use the second method I gave: At the moment where you receive a point of data, give an ID to each of his columns:
You receive:
London 12/12/12 02:23:32 donut verygoodstatsblablabla
You add the ID to each part of this and go parse them in their respective columns:
42 | London ==> goes with London location in the location table
42 | 12/12/12 ==> goes with 12/12/12 dates in the date table
42 | ...
With this, you want to get all the London datas, they are all side by side, you just have to take all the ids, and get the other datas with them. If you want to take all the datas between 11/11/11 and 12/12/12, they are all side by side, you just have to take the ids ect..
Hope I helped, sorry for my poor english.
You should check out Apache Flume and Hadoop
http://hortonworks.com/hadoop/flume/#tutorials
The flume agent can be used to capture and aggregate the data into HDFS, and you can scale this as needed. Once it is in HDFS there are many options to visualize and even use map reduce or elastic search to view the data sets you are looking for in the examples provided.
I have worked with a point-of-sale database with hundred thousand products and ten thousand stores (typically week-level aggregated sales but also receipt-level stuff for basket analysis, cross sales etc.). I would suggest you to have a look at these:
Amazon Redshift, highly scalable and relatively simple to get started, cost-efficient
Microsoft Columnstore Indexes, compresses data and has familiar SQL interface, quite expensive (1 year reserved instance r3.2xlarge at AWS is about 37.000 USD), no experience on how it scales within a cluster
ElasticSearch is my personal favourite, highly scalable, very efficient searches via inverted indexes, nice aggregation framework, no license fees, has its own query language but simple queries are simple to express
In my experiments ElasticSearch was faster than Microsoft's column store or clustered index tables for small and medium-size queries by 20 - 50% on same hardware. To have fast response times you must have sufficient amount of RAM to have necessary data structures loaded in-memory.
I know I'm missing many other DB engines and platforms but I am most familiar with these. I have also used Apache Spark but not in data aggregation context but for distributed mathematical model training.
Is there really likely to be a way of doing this without brute forcing it in some way?
I'm only familiar with relational databases, and I think that the only real way to tackle this is with a flat table as suggested before i.e. all your datapoints as fields in a single table. I guess that you just have to decide how to do this, and how to optimize it.
Unless you have to maintain 100% to the single record accuracy, then I think the question really needs to be, what can we throw away.
I think my approach would be to:
Work out what the smallest time fragment would be and quantise the time domain on that. e.g. each analyseable record is 15 minutes long.
Collect raw records together into a raw table as they come in, but as the quantising window passes, summarize the rows into the analytical table (for the 15 minute window).
Deletion of old raw records can be done by a less time-sensitive routine.
Location looks like a restricted set, so use a table to convert these to integers.
Index all the columns in the summary table.
Run queries.
Obviously I'm betting that quantising the time domain in this way is acceptable. You could supply interactive drill-down by querying back onto the raw data by time domain too, but that would still be slow.
Hope this helps.
Mark
TL;DR
I have a table with about 2 million WRITEs over the month and 0 READs. Every 1st day of a month, I need to read all the rows written on the previous month and generate CSVs + statistics.
How to work with DynamoDB in this scenario? How to choose the READ throughput capacity?
Long description
I have an application that logs client requests. It has about 200 clients. The clients need to receive on every 1st day of a month a CSV with all the requests they've made. They also need to be billed, and for that we need to calculate some stats with the requests they've made, grouping by type of request.
So in the end of the month, a client receives a report like:
I've already come to two solutions, but I'm not still convinced on any of them.
1st solution: ok, every last day of the month I increase the READ throughput capacity and then I run a map reduce job. When the job is done, I decrease the capacity back to the original value.
Cons: not fully automated, risk of the DynamoDB capacity not being available when the job starts.
2nd solution: I can break the generation of CSVs + statistics to small jobs in a daily or hourly routine. I could store partial CSVs on S3 and on every 1st day of a month I could join those files and generate a new one. The statistics would be much easier to generate, just some calculations derived from the daily/hourly statistics.
Cons: I feel like I'm turning something simple into something complex.
Do you have a better solution? If not, what solution would you choose? Why?
Having been in a similar place myself before, I used, and now recommend to you, to process the raw data:
as often as you reasonably can (start with daily)
to a format as close as possible to the desired report output
with as much calculation/CPU intensive work done as possible
leaving as little to do at report time as possible.
This approach is entirely scaleable - the incremental frequency can be:
reduced to as small a window as needed
parallelised if required
It also, makes possible re-running past months reports on demand, as the report generation time should be quite small.
In my example, I shipped denormalized, pre-processed (financial calculations) data every hour to a data warehouse, then reporting just involved a very basic (and fast) SQL query.
This had the additional benefit of spreading the load on the production database server to lots of small bites, instead of bringing it to its knees once a week at invoice time (30000 invoiced produced every week).
I would use the service kinesis to produce a daily and almost real time billing.
for this purpose I would create a special DynamoDB table just for the calculated data.
(other option is to run it on flat files)
then I would add a process which will send events to kinesis service just after you update the regular DynamoDB table.
thus when you reach the end of the month you can just execute whatever post billing calculations you have and create your CSV files from the already calculated table.
I hope that helps.
Take a look at Dynamic DynamoDB. It will increase/decrease the throughput when you need it without any manual intervention. The good news is you will not need to change the way the export job is done.
Currently I have a project (written in Java) that reads sensor output from a micro controller and writes it across several Postgres tables every second using Hibernate. In total I write about 130 columns worth of data every second. Once the data is written it will stay static forever.This system seems to perform fine under the current conditions.
My question is regarding the best way to query and average this data in the future. There are several approaches I think would be viable but am looking for input as to which one would scale and perform best.
Being that we gather and write data every second we end up generating more than 2.5 million rows per month. We currently plot this data via a JDBC select statement writing to a JChart2D (i.e. SELECT pressure, temperature, speed FROM data WHERE time_stamp BETWEEN startTime AND endTime). The user must be careful to not specify too long of a time period (startTimem and endTime delta < 1 day) or else they will have to wait several minutes (or longer) for the query to run.
The future goal would be to have a user interface similar to the Google visualization API that powers Google Finance. With regards to time scaling, i.e. the longer the time period the "smoother" (or more averaged) the data becomes.
Options I have considered are as follows:
Option A: Use the SQL avg function to return the averaged data points to the user. I think this option would get expensive if the user asks to see the data for say half a year. I imagine the interface in this scenario would scale the amount of rows to average based on the user request. I.E. if the user asks for a month of data the interface will request an avg of every 86400 rows which would return ~30 data points whereas if the user asks for a day of data the interface will request an avg of every 2880 rows which will also return 30 data points but of more granularity.
Option B: Use SQL to return all of the rows in a time interval and use the Java interface to average out the data. I have briefly tested this for kicks and I know it is expensive because I'm returning 86400 rows/day of interval time requested. I don't think this is a viable option unless there's something I'm not considering when performing the SQL select.
Option C: Since all this data is static once it is written, I have considered using the Java program (with Hibernate) to also write tables of averages along with the data it is currently writing. In this option, I have several java classes that "accumulate" data then average it and write it to a table at a specified interval (5 seconds, 30 seconds, 1 minute, 1 hour, 6 hours and so on). The future user interface plotting program would take the interval of time specified by the user and determine which table of averages to query. This option seems like it would create a lot of redundancy and take a lot more storage space but (in my mind) would yield the best performance?
Option D: Suggestions from the more experienced community?
Option A won't tend to scale very well once you have large quantities of data to pass over; Option B will probably tend to start relatively slow compared to A and scale even more poorly. Option C is a technique generally referred to as "materialized views", and you might want to implement this one way or another for best performance and scalability. While PostgreSQL doesn't yet support declarative materialized views (but I'm working on that this year, personally), there are ways to get there through triggers and/or scheduled jobs.
To keep the inserts fast, you probably don't want to try to maintain any views off of triggers on the primary table. What you might want to do is to periodically summarize detail into summary tables from crontab jobs (or similar). You might also want to create views to show summary data by using the summary tables which have been created, combined with detail table where the summary table doesn't exist.
The materialized view approach would probably work better for you if you partition your raw data by date range. That's probably a really good idea anyway.
http://www.postgresql.org/docs/current/static/ddl-partitioning.html
I was asked by the interviewer to design a system to store gigabytes of data and the system also has to support some kind of query.
Description:
There are massive amount of records generated in an IDC, each record is composed of a url, an IP which visits the url, and the time when the visit occurs. The record can probably be stated as a struct like this, but I'm not sure which data type should I pick to represent them:
struct Record {
url; //char *
IP; //int?
visit_time; //time_t or simply a number?
}
Requirements:
Design a system to store 100 billion records, and also the system gotta support 2 kinds of query at least:
First, given a time period (t1, t2) and a IP, query how many urls this IP has visited in the given period.
Second, given a time period (t1, t2) and a url, query how many times this url has been visited.
I was stumbled, and here is my stupid solution:
Analysis:
because every query is performed upon a given period of time, so:
1.Create a set, put all visit time into the set, and keep the set ordered according to the time's value from older to latest.
2.Create a hash table using hash(visit_time) as the key, this hash table is called time-hash-table, then each node in a specific bucket has 2 pointers pointing to another 2 hash-tables respectively.
3.The another 2 hash-tables would be a ip-hash-table and a url-hash-table.
ip-hash-table uses hash(ip) as the key and all the ips in the same ip-hash-table have the same visit-time;
url-hash-table uses hash(url) as the key and all the urls in the same url-hash-table have the same visit-time.
Give a drawing as follows:
time_hastbl
[]
[]
[]-->[visit_time_i]-->[visit_time_j]...[visit_time_p]-->NIL
[] | |
[] ip_hastbl url_hastbl
[] []
: :
[] []
[] []
So, when doing the query upon (t1, t2):
find the closest match from the time set, let's say the match is (t1', t2'), then all the valid visit time will fall into the part of set starting from t1' to t2';
for each visit-time t in the time set[t1':t2'], do hash(t) and find t's ip_hastbl or url_hastbl, then count and log how many times the given ip or url appears.
Questions:
1.My solution is stupid, hope you can give me another solution.
2.with respect to how to store the massive records on disk, any advice? I thought of B-tree, but how to use it or is B-tree applicable in this system?
I believe the interviewer was expecting a distributed computing based solution, esp when "100 billion records" are involved. With the limited knowledge of Distributed Computing I have, I would suggest you to look into Distributed Hash Table and map-reduce (for parallel query processing)
In my opinion, create a B+ tree using time as the key to help you quickly locate the range of records during given time period (t1,t2) in disk. Then using the records during (t1,t2) to build IP and URL hash table respectively.
Old question, but recently bumped so here's a few other things to think about:
What you need to consider is a few very simple boundary limits beyond your listed requirements, assuming you have no further indexes:
First, given a time period (t1, t2) and a IP, query how many urls this IP has visited in the given period.
If you have 10k users then you can expect at worst a scan of all records in a time window would result in only needing to return in 10k records accessed (on average).
Second, given a time period (t1, t2) and a url, query how many times this url has been visited.
Depending on how many urls you have in the system say 1000, then this again means that a simple scan results in 999 of 1000 records scanned not being returned.
Lets say you have only 100,000 unique urls, you could greatly reduce the space consumed by the database (by using a guid / int foreign key instead), this also means the average url is accessed 1M times on your 100Bn records.
Even with all this it tells us nothing completely, because we don't have numbers / statistics on how clusteded by time the records are for the given search times. Are we getting 1000 page requests every second and searching for a 12month time range, or are we getting 100 requests per second and searching for a 1hour time block (360k requests).
Assuming the 100Bn represents 12 months of data that's 3170 requests per second. Does that sound reasonable?
Why is this important? Because it highlights one key thing you overlooked in your answer.
With 100Bn records in the past 12months, that means in 12months time you'll have 200Bn records to deal with. If 100bn records is for 20 years then it's not such an issue, you can expect to grow by only another 25-30bn in the next 5 years... but it's unlikely that your existing data is over such a long time frame.
Your solution only answers one side of the equation (reading data), you don't consider any complications with writing that much data. A vast majority of the time you will be inserting data into whatever data store you create, will it be able to handle a constant 3k insert requests per second?
If you insert 3k records and each record is just 3x 64bit integers representing Time (in ticks), IP Address and a Foreign key to the url. Then that is only ~75kb/s of writing data which will be fine to maintain. If every URL is to be assumed unique, then you could easily run into performance issues due to IO speeds (ignoring the space requirements).
One other thing the interviewer would be interested in seeing is your thoughts on supporting IPv6.
Lastly, if you provided a solution like you have then the interviewer should have asked a followup question. "How would your system perform if I now want to know when a specific ip address last accessed a specific url?"
So yes, if you don't know about MapReduce and other distributed processing query systems then yours should be a reasonable answer.
It will be an interval tree which is also a B-Tree. An interval tree because all the queries have input as time interval only, and B-Tree due to the size of the input(billions).
I need some inspiration for a solution...
We are running an online game with around 80.000 active users - we are hoping to expand this and are therefore setting a target of achieving up to 1-500.000 users.
The game includes a highscore for all the users, which is based on a large set of data. This data needs to be processed in code to calculate the values for each user.
After the values are calculated we need to rank the users, and write the data to a highscore table.
My problem is that in order to generate a highscore for 500.000 users we need to load data from the database in the order of 25-30.000.000 rows totalling around 1.5-2gb of raw data. Also, in order to rank the values we need to have the total set of values.
Also we need to generate the highscore as often as possible - preferably every 30 minutes.
Now we could just use brute force - load the 30 mio records every 30 minutes, calculate the values and rank them, and write them in to the database, but I'm worried about the strain this will cause on the database, the application server and the network - and if it's even possible.
I'm thinking the solution to this might be to break up the problem some how, but I can't see how. So I'm seeking for some inspiration on possible alternative solutions based on this information:
We need a complete highscore of all ~500.000 teams - we can't (won't unless absolutely necessary) shard it.
I'm assuming that there is no way to rank users without having a list of all users values.
Calculating the value for each team has to be done in code - we can't do it in SQL alone.
Our current method loads each user's data individually (3 calls to the database) to calculate the value - it takes around 20 minutes to load data and generate the highscore 25.000 users which is too slow if this should scale to 500.000.
I'm assuming that hardware size will not an issue (within reasonable limits)
We are already using memcached to store and retrieve cached data
Any suggestions, links to good articles about similar issues are welcome.
Interesting problem. In my experience, batch processes should only be used as a last resort. You are usually better off having your software calculate values as it inserts/updates the database with the new data. For your scenario, this would mean that it should run the score calculation code every time it inserts or updates any of the data that goes into calculating the team's score. Store the calculated value in the DB with the team's record. Put an index on the calculated value field. You can then ask the database to sort on that field and it will be relatively fast. Even with millions of records, it should be able to return the top n records in O(n) time or better. I don't think you'll even need a high scores table at all, since the query will be fast enough (unless you have some other need for the high scores table other than as a cache). This solution also gives you real-time results.
Assuming that most of your 2GB of data is not changing that frequently you can calculate and cache (in db or elsewhere) the totals each day and then just add the difference based on new records provided since the last calculation.
In postgresql you could cluster the table on the column that represents when the record was inserted and create an index on that column. You can then make calculations on recent data without having to scan the entire table.
First and formost:
The computation has to take place somewhere.
User experience impact should be as low as possible.
One possible solution is:
Replicate (mirror) the database in real time.
Pull the data from the mirrored DB.
Do the analysis on the mirror or on a third, dedicated, machine.
Push the results to the main database.
Results are still going to take a while, but at least performance won't be impacted as much.
How about saving those scores in a database, and then simply query the database for the top scores (so that the computation is done on the server side, not on the client side.. and thus there is no need to move the millions of records).
It sounds pretty straight forward... unless I'm missing your point... let me know.
Calculate and store the score of each active team on a rolling basis. Once you've stored the score, you should be able to do the sorting/ordering/retrieval in the SQL. Why is this not an option?
It might prove fruitless, but I'd at least take a gander at the way sorting is done on a lower level and see if you can't manage to get some inspiration from it. You might be able to grab more manageable amounts of data for processing at a time.
Have you run tests to see whether or not your concerns with the data size are valid? On a mid-range server throwing around 2GB isn't too difficult if the software is optimized for it.
Seems to me this is clearly a job for chacheing, because you should be able to keep the half-million score records semi-local, if not in RAM. Every time you update data in the big DB, make the corresponding adjustment to the local score record.
Sorting the local score records should be trivial. (They are nearly in order to begin with.)
If you only need to know the top 100-or-so scores, then the sorting is even easier. All you have to do is scan the list and insertion-sort each element into a 100-element list. If the element is lower than the first element, which it is 99.98% of the time, you don't have to do anything.
Then run a big update from the whole DB once every day or so, just to eliminate any creeping inconsistencies.