design a system supporting massive data storage and query - c

I was asked by the interviewer to design a system to store gigabytes of data and the system also has to support some kind of query.
Description:
There are massive amount of records generated in an IDC, each record is composed of a url, an IP which visits the url, and the time when the visit occurs. The record can probably be stated as a struct like this, but I'm not sure which data type should I pick to represent them:
struct Record {
url; //char *
IP; //int?
visit_time; //time_t or simply a number?
}
Requirements:
Design a system to store 100 billion records, and also the system gotta support 2 kinds of query at least:
First, given a time period (t1, t2) and a IP, query how many urls this IP has visited in the given period.
Second, given a time period (t1, t2) and a url, query how many times this url has been visited.
I was stumbled, and here is my stupid solution:
Analysis:
because every query is performed upon a given period of time, so:
1.Create a set, put all visit time into the set, and keep the set ordered according to the time's value from older to latest.
2.Create a hash table using hash(visit_time) as the key, this hash table is called time-hash-table, then each node in a specific bucket has 2 pointers pointing to another 2 hash-tables respectively.
3.The another 2 hash-tables would be a ip-hash-table and a url-hash-table.
ip-hash-table uses hash(ip) as the key and all the ips in the same ip-hash-table have the same visit-time;
url-hash-table uses hash(url) as the key and all the urls in the same url-hash-table have the same visit-time.
Give a drawing as follows:
time_hastbl
[]
[]
[]-->[visit_time_i]-->[visit_time_j]...[visit_time_p]-->NIL
[] | |
[] ip_hastbl url_hastbl
[] []
: :
[] []
[] []
So, when doing the query upon (t1, t2):
find the closest match from the time set, let's say the match is (t1', t2'), then all the valid visit time will fall into the part of set starting from t1' to t2';
for each visit-time t in the time set[t1':t2'], do hash(t) and find t's ip_hastbl or url_hastbl, then count and log how many times the given ip or url appears.
Questions:
1.My solution is stupid, hope you can give me another solution.
2.with respect to how to store the massive records on disk, any advice? I thought of B-tree, but how to use it or is B-tree applicable in this system?

I believe the interviewer was expecting a distributed computing based solution, esp when "100 billion records" are involved. With the limited knowledge of Distributed Computing I have, I would suggest you to look into Distributed Hash Table and map-reduce (for parallel query processing)

In my opinion, create a B+ tree using time as the key to help you quickly locate the range of records during given time period (t1,t2) in disk. Then using the records during (t1,t2) to build IP and URL hash table respectively.

Old question, but recently bumped so here's a few other things to think about:
What you need to consider is a few very simple boundary limits beyond your listed requirements, assuming you have no further indexes:
First, given a time period (t1, t2) and a IP, query how many urls this IP has visited in the given period.
If you have 10k users then you can expect at worst a scan of all records in a time window would result in only needing to return in 10k records accessed (on average).
Second, given a time period (t1, t2) and a url, query how many times this url has been visited.
Depending on how many urls you have in the system say 1000, then this again means that a simple scan results in 999 of 1000 records scanned not being returned.
Lets say you have only 100,000 unique urls, you could greatly reduce the space consumed by the database (by using a guid / int foreign key instead), this also means the average url is accessed 1M times on your 100Bn records.
Even with all this it tells us nothing completely, because we don't have numbers / statistics on how clusteded by time the records are for the given search times. Are we getting 1000 page requests every second and searching for a 12month time range, or are we getting 100 requests per second and searching for a 1hour time block (360k requests).
Assuming the 100Bn represents 12 months of data that's 3170 requests per second. Does that sound reasonable?
Why is this important? Because it highlights one key thing you overlooked in your answer.
With 100Bn records in the past 12months, that means in 12months time you'll have 200Bn records to deal with. If 100bn records is for 20 years then it's not such an issue, you can expect to grow by only another 25-30bn in the next 5 years... but it's unlikely that your existing data is over such a long time frame.
Your solution only answers one side of the equation (reading data), you don't consider any complications with writing that much data. A vast majority of the time you will be inserting data into whatever data store you create, will it be able to handle a constant 3k insert requests per second?
If you insert 3k records and each record is just 3x 64bit integers representing Time (in ticks), IP Address and a Foreign key to the url. Then that is only ~75kb/s of writing data which will be fine to maintain. If every URL is to be assumed unique, then you could easily run into performance issues due to IO speeds (ignoring the space requirements).
One other thing the interviewer would be interested in seeing is your thoughts on supporting IPv6.
Lastly, if you provided a solution like you have then the interviewer should have asked a followup question. "How would your system perform if I now want to know when a specific ip address last accessed a specific url?"
So yes, if you don't know about MapReduce and other distributed processing query systems then yours should be a reasonable answer.

It will be an interval tree which is also a B-Tree. An interval tree because all the queries have input as time interval only, and B-Tree due to the size of the input(billions).

Related

Choosing proper database in AWS when all items must be read from the table

I have an AWS application where DynamoDB is used for most data storage and it works well for most cases. I would like to ask you about one particular case where I feel DynamoDB might not be the best option.
There is a simple table with customers. Each customer can collect virtual coins so each customer has a balance attribute. The balance is managed by 3rd party service keeping up-to-date value and the balance attribute in my table is just a cached version of it. The 3rd party service requires its own id of the customer as an input so customers table contains also this externalId attribute which is used to query the balance.
I need to run the following process once per day:
Update the balance attribute for all customers in a database.
Find all customers with the balance greater than some specified constant value. They need to be sorted by the balance.
Perform some processing for all of the customers - the processing must be performed in proper order - starting from the customer with the greatest balance in descending order (by balance).
Question: which database is the most suitable one for this use case?
My analysis:
In terms of costs it looks to be quite similar, i.e. paying for Compute Units in case of DynamoDB vs paying for hours of micro instances in case of RDS. Not sure though if micro RDS instance is enough for this purpose - I'm going to check it but I guess it should be enough.
In terms of performance - I'm not sure here. It's something I will need to check but wanted to ask you here beforehand. Some analysis from my side:
It involves two scan operations in the case of DynamoDB which
looks like something I really don't want to have. The first scan can be limited to externalId attribute, then balances are queried from 3rd party service and updated in the table. The second scan requires a range key defined for balance attribute to return customers sorted by the balance.
I'm not convinced that any kind of indexes can help here. Basically, there won't be too many read operations of the balance - sometimes it will need to be queried for a single customer using its primary key. The number of reads won't be much greater than number of writes so indexes may slow the process down.
Additional assumptions in case they matter:
There are ca. 500 000 customers in the database, the average size of a single customer is 200 bytes. So the total size of the customers in the database is 100 MB.
I need to repeat step 1 from the above procedure (update the balance of all customers) several times during the day (ca. 20-30 times per day) but the necessity to retrieve sorted data is only once per day.
There is only one application (and one instance of the application) performing the above procedure. Besides that, I need to handle simple CRUD which can read/update other attributes of the customers.
I think people are overly afraid of DynamoDB scan operations. They're bad if used for regular queries but for once-in-a-while bulk operations they're not so bad.
How much does it cost to scan a 100 MB table? That's 25,000 4KB blocks. If doing eventually consistent that's 12,250 read units. If we assume the cost is $0.25 per million (On Demand mode) that's 12,250/1,000,000*$0.25 = $0.003 per full table scan. Want to do it 30 times per day? Costs you less than a dime a day.
The thing to consider is the cost of updating every item in the database. That's 500,000 write units, which if in On Demand at $1.25 per million will be about $0.63 per full table update.
If you can go Provisioned for that duration it'll be cheaper.
Regarding performance, DynamoDB can scan a full table faster than any server-oriented database, because it's supported by potentially thousands of back-end servers operating in parallel. For example, you can do a parallel scan with up to a million segments, each with a client thread reading data in 1 MB chunks. If you write a single-threaded client doing a scan it won't be as fast. It's definitely possible to scan slowly, but it's also possible to scan at speeds that seem ludicrous.
If your table is 100 MB, was created in On Demand mode, has never hit a high water mark to auto-increase capacity (just the starter capacity), and you use a multi-threaded pull with 4+ segments, I predict you'll be done in low single digit seconds.

What is hyperloglog and why is this good for?

I was studying data structures supported by Redis and I was not able to find out an explanation that could make me understand what HyperLogLog is.
How do I use it and why is this good for?
Basically is a kind of Redis Set which uses optimized algorithms to count elements by avoiding a great consumption of memory. The difference between a Set and a HyperLogLog is that with a HyperLogLog you just can add, count unique element and merge some HyperLogLogs in another one, so basically you don't store the members in a HyperLogLog as you could do in a SET, and retrieve them, you just store the occurrences of different members, that is the reason which HyperLogLog doesn't provide a command to retrieve its stored members.
A clear uses case could be if you want to have a huge SET where you want to count so many times the number of unique data inside the set, you are not interested in which data are inside the set, you are only interested in consuming low memory even when the set grows a lot. For instance, imagine you have a high impact system with a large number of users all of them very active, and you are interested in knowing the number of unique visitors in every webpage of your system. You want to be updated real-time, so you will query every second the unique visitors for every website. You could create a HyperLogLog for every URI in your system, which will represent the webpage and every time a user visits a URL you will PFAAD the user_id:
PFAAD /api/show/concerts id789989
then every second you will iterate for every URL-HyperLogLog to get number of unique user-visitors
PFCOUNT /api/show/concerts
145542
PFCOUNT /api/show/open-airs
25565223
And you would say, yes but I can get the same functionality by using SET with the benefit of having the user_ids in every set as members. Yes, you can, but you will consume much memory by using sets and every time (second) you query every set to get the numbers of unique visitors with SCARD command, you will spend even much memory, so at least you need to store user_ids for some reason, HyperLogLogs are better options as counters of unique elements. For our use case, imagine having 200-300 sets with around 20-30k of users inside.
The correspondence between HyperLogLog and Set commands:
PFADD = SADD
PFCOUNT = SCARD
PFMERGE = SUNION
I do not think that it is considered to be a data type. It is an algorithm but in Redis it is considered as type
It is a very complex algorithm that looks at the string, does some parsing on it, does some very complex math, and kind of remembers that string but it does not actually store it
It has nothing to do with the logging (I thought it was). it is used whenever we want to keep track of the uniqueness of a collection of different elements and specifically the approximate uniqueness.
similar to a set but does not store the elements
it runs in O(1), constant time, and uses a very small amount of memory—up to 12 kB of memory per key.
The HyperLogLog algorithm is probabilistic, which means that it does not ensure 100 percent accuracy because the hyperloglog does not actually truly store these individual items. The Redis implementation of the HyperLogLog has a standard error of 0.81 percent. It means that if you see 1000 views, the real count might be between 991-1008. it is ok having this error for counting the number of views but if you need to keep track of unique usernames or emails, you should be storing them in sets.
Here are a few examples of where HyperLogLogs can be used:
• Counting the number of unique users who visited a website
• Counting the number of distinct terms that were searched for on your
website on a specific date or time
• Counting the number of distinct hashtags that were used by a user
• Counting the number of distinct words that appear in a book

Time to retrieve a single record via a SQL Server index in a large table

Short version of the question:
If you have a table with a large number of small rows and you want to retrieve a single record from this table via an index probably consisting of two columns is this likely to be something that wil be low cost and fast or high cost and slow
Longer version of question and background:
I am a consultant working with a software development company and I have an argument with them about the performance implications of a piece of functionality that I want to add to the application they are building (and I am designing).
At the moment, we write out a log record every time somebody retrieves a client record. I want to put the name and time of the last person prevously to access that record onto the client page each time that record is retrieved.
They are saying that the performance implications of this will be high but based on my reasonable but not expert knowledge of how B trees work, this doesn't seem right even if the table is very large.
If you create an index on the GUID of the client record and the date/time of access (descending), then you ought to be able to retrieve the required record via an index scan which would just need to find the first entry for that GUID and then stop? And that with a b-tree index, most of the index would be cached so the number of physical disc accesses needed would be very small and the query time therefore significantly less than 1s.
Or have I got this completely wrong
You will have problems with GUID index fragmentation but because your rows do not increase in size (as you said in the comments) you will not have page-splitting problems. The random insert issue is fixable by doing reorganizing and rebuilding.
Besides that, there is nothing wrong with your approach. If the table is larger than RAM you will likely have a single disk IO per access (the intermediate index levels will be cached). If your data fits in RAM you will pay about 0.2 to 0.5ms per query. If your data is on a magnetic disk a seek will likely require 8-12ms. On an SSD you are back to 0.2ms to 0.5ms (maybe 0.05ms more).
Why don't you just create some test data (by selecting a cross product from sys.object of 1M rows) and measure it. It takes little time and you will find out for sure.
should be low cost and fast since the columns are indexed and that would be O(n) I think
You say last person to access? You mean that for every read you will have a write?
And that write is going to change an indexed date time column?
Then I would be worried too.
Writing on each record read will cause you lots of extra disk writes. This will block reads and it might be bad to your caching too. You also need to update your index a lot, and since you change the indexed data your index will be very fragmented.
It depends.
A single retrieval will be low cost and fast
on a decent indexed table
running on decent hardware
over a decent network
On the other hand, it takes time nonetheless.
If we are talking about one retrieval per hour, don't sweat over it. If we are talking about thousands of retrievals per second (as opposed to currently none) it will start to add up to the point it would be noticable.
Some questions you need to adress
Is my hardware up to spec
Does adding two fields result in a page split (unlikely)
How many extra pages need to be read for your regular result sets
How many retrievals/sec will be made
How many inserts/sec (triggering an index update) will be made
After you've adressed these questions, you should be able to make the determination yourself. As far as my gut feelings go, I would be surprised you would notice the performance difference.

Even partitioning of nonuniform ranged data in cassandra

I've got a rather tricky one, bear with me as I try not to stumble over my words here. I'm doing some research, and my group is transitioning to a cassandra database. Our research used MySQL before, but the data outgrew the database (192 million rows in memory # 16G -- it was the only way to query the data fast enough). The data itself is kinda-sorta static. There's a whole lot of it, but any new data is a somewhat slow trickle at this point.
The data consists of a boatload of classifier-score pairs. We formulate queries for the database which basically say, "give me the top 500 for the following classifiers". Then the database returns that many scores. For example, if we ask for the top 500 scores for 2 classifiers, we get back 1000 rows (each row consisting of a classifier ID and a score -- i.e. [4, 9100]). The scores themselves are non-uniform (the distribution tends to clump toward one end of the values -- which by the way are from -10000 to 10000)
As we transition to cassandra, there are a number of requirements. First of all, we need to be able to query for the top and bottom N scores on a per-classifier basis. Normally I can see that an ordered partitioner would be appropriate for this, however like I said the scores tends to clump at the extremes (which would put too much of a burden on one node). So my first question is, how do I evenly distribute the classifier/score pairs while still being able to query for the top or bottom N.
There is a secondary requirement which pretty much screws up the first one. Sometimes it is necessary to find all scores that are near another score. So if I see classifier 6 with a score of 400, I might ask, show me 500 scores that are the closest to that (all within classifier 6). I'm absolutely stumped about this one. I've read that cassandra supports secondary indices (yay) but only hash type (boo - no ranges). Do we create a seperate ColumnFamily for this use case?
And finally, speed is paramount. The data is being used in an interactive GUI application. Ideally, queries should only take a few seconds. And if data all gets stuck on one particular node, it will slow things down.
We've tried all kinds of clever tricks. Our best idea was to put the data into buckets, so that the top 500 went into bucket 1, the next 500 went into bucket 2, and so on. The advantage is that to get the top 500 we just ask for bucket 1. Also all of the data WOULD be evenly distributed using a random partitioner. However since MOST of our queries are interested only in bucket 1, it would put a lot of burden on just one node (remember, if N classifiers are involved, it's actually 500 * N scores per bucket). The real disadvantage of this scheme is that it falls apart when we need to query based on nearness to a score (we'd have to do some kind of weird binary search over the buckets to find our starting value).
At this point we're running low on ideas. Everything I've seen about cassandra makes me wonder if it's even appropriate for this task. We chose it mainly because of it's horizontal scalability, which is important (much easier to add a node than to shard an RDBM). So I suppose my overall question is: how would you approach this? If cassandra, please address any of the above issues. Otherwise any insight or wisdom would be appreciated. Thanks.
Why not storing the classifier as a column family row key and the score in column name. Since columns are sorted it is really fast to query the top/bottom 500 columns for a given classifier. The second type of query is also possible, when you are looking for the scores near s you can for instance select 500 columns before s and 500 columns after s and then filter the 500 columns near s.

Inspiration needed: Selecting large amounts of data for a highscore

I need some inspiration for a solution...
We are running an online game with around 80.000 active users - we are hoping to expand this and are therefore setting a target of achieving up to 1-500.000 users.
The game includes a highscore for all the users, which is based on a large set of data. This data needs to be processed in code to calculate the values for each user.
After the values are calculated we need to rank the users, and write the data to a highscore table.
My problem is that in order to generate a highscore for 500.000 users we need to load data from the database in the order of 25-30.000.000 rows totalling around 1.5-2gb of raw data. Also, in order to rank the values we need to have the total set of values.
Also we need to generate the highscore as often as possible - preferably every 30 minutes.
Now we could just use brute force - load the 30 mio records every 30 minutes, calculate the values and rank them, and write them in to the database, but I'm worried about the strain this will cause on the database, the application server and the network - and if it's even possible.
I'm thinking the solution to this might be to break up the problem some how, but I can't see how. So I'm seeking for some inspiration on possible alternative solutions based on this information:
We need a complete highscore of all ~500.000 teams - we can't (won't unless absolutely necessary) shard it.
I'm assuming that there is no way to rank users without having a list of all users values.
Calculating the value for each team has to be done in code - we can't do it in SQL alone.
Our current method loads each user's data individually (3 calls to the database) to calculate the value - it takes around 20 minutes to load data and generate the highscore 25.000 users which is too slow if this should scale to 500.000.
I'm assuming that hardware size will not an issue (within reasonable limits)
We are already using memcached to store and retrieve cached data
Any suggestions, links to good articles about similar issues are welcome.
Interesting problem. In my experience, batch processes should only be used as a last resort. You are usually better off having your software calculate values as it inserts/updates the database with the new data. For your scenario, this would mean that it should run the score calculation code every time it inserts or updates any of the data that goes into calculating the team's score. Store the calculated value in the DB with the team's record. Put an index on the calculated value field. You can then ask the database to sort on that field and it will be relatively fast. Even with millions of records, it should be able to return the top n records in O(n) time or better. I don't think you'll even need a high scores table at all, since the query will be fast enough (unless you have some other need for the high scores table other than as a cache). This solution also gives you real-time results.
Assuming that most of your 2GB of data is not changing that frequently you can calculate and cache (in db or elsewhere) the totals each day and then just add the difference based on new records provided since the last calculation.
In postgresql you could cluster the table on the column that represents when the record was inserted and create an index on that column. You can then make calculations on recent data without having to scan the entire table.
First and formost:
The computation has to take place somewhere.
User experience impact should be as low as possible.
One possible solution is:
Replicate (mirror) the database in real time.
Pull the data from the mirrored DB.
Do the analysis on the mirror or on a third, dedicated, machine.
Push the results to the main database.
Results are still going to take a while, but at least performance won't be impacted as much.
How about saving those scores in a database, and then simply query the database for the top scores (so that the computation is done on the server side, not on the client side.. and thus there is no need to move the millions of records).
It sounds pretty straight forward... unless I'm missing your point... let me know.
Calculate and store the score of each active team on a rolling basis. Once you've stored the score, you should be able to do the sorting/ordering/retrieval in the SQL. Why is this not an option?
It might prove fruitless, but I'd at least take a gander at the way sorting is done on a lower level and see if you can't manage to get some inspiration from it. You might be able to grab more manageable amounts of data for processing at a time.
Have you run tests to see whether or not your concerns with the data size are valid? On a mid-range server throwing around 2GB isn't too difficult if the software is optimized for it.
Seems to me this is clearly a job for chacheing, because you should be able to keep the half-million score records semi-local, if not in RAM. Every time you update data in the big DB, make the corresponding adjustment to the local score record.
Sorting the local score records should be trivial. (They are nearly in order to begin with.)
If you only need to know the top 100-or-so scores, then the sorting is even easier. All you have to do is scan the list and insertion-sort each element into a 100-element list. If the element is lower than the first element, which it is 99.98% of the time, you don't have to do anything.
Then run a big update from the whole DB once every day or so, just to eliminate any creeping inconsistencies.

Resources