What is hyperloglog and why is this good for? - database

I was studying data structures supported by Redis and I was not able to find out an explanation that could make me understand what HyperLogLog is.
How do I use it and why is this good for?

Basically is a kind of Redis Set which uses optimized algorithms to count elements by avoiding a great consumption of memory. The difference between a Set and a HyperLogLog is that with a HyperLogLog you just can add, count unique element and merge some HyperLogLogs in another one, so basically you don't store the members in a HyperLogLog as you could do in a SET, and retrieve them, you just store the occurrences of different members, that is the reason which HyperLogLog doesn't provide a command to retrieve its stored members.
A clear uses case could be if you want to have a huge SET where you want to count so many times the number of unique data inside the set, you are not interested in which data are inside the set, you are only interested in consuming low memory even when the set grows a lot. For instance, imagine you have a high impact system with a large number of users all of them very active, and you are interested in knowing the number of unique visitors in every webpage of your system. You want to be updated real-time, so you will query every second the unique visitors for every website. You could create a HyperLogLog for every URI in your system, which will represent the webpage and every time a user visits a URL you will PFAAD the user_id:
PFAAD /api/show/concerts id789989
then every second you will iterate for every URL-HyperLogLog to get number of unique user-visitors
PFCOUNT /api/show/concerts
145542
PFCOUNT /api/show/open-airs
25565223
And you would say, yes but I can get the same functionality by using SET with the benefit of having the user_ids in every set as members. Yes, you can, but you will consume much memory by using sets and every time (second) you query every set to get the numbers of unique visitors with SCARD command, you will spend even much memory, so at least you need to store user_ids for some reason, HyperLogLogs are better options as counters of unique elements. For our use case, imagine having 200-300 sets with around 20-30k of users inside.
The correspondence between HyperLogLog and Set commands:
PFADD = SADD
PFCOUNT = SCARD
PFMERGE = SUNION

I do not think that it is considered to be a data type. It is an algorithm but in Redis it is considered as type
It is a very complex algorithm that looks at the string, does some parsing on it, does some very complex math, and kind of remembers that string but it does not actually store it
It has nothing to do with the logging (I thought it was). it is used whenever we want to keep track of the uniqueness of a collection of different elements and specifically the approximate uniqueness.
similar to a set but does not store the elements
it runs in O(1), constant time, and uses a very small amount of memory—up to 12 kB of memory per key.
The HyperLogLog algorithm is probabilistic, which means that it does not ensure 100 percent accuracy because the hyperloglog does not actually truly store these individual items. The Redis implementation of the HyperLogLog has a standard error of 0.81 percent. It means that if you see 1000 views, the real count might be between 991-1008. it is ok having this error for counting the number of views but if you need to keep track of unique usernames or emails, you should be storing them in sets.
Here are a few examples of where HyperLogLogs can be used:
• Counting the number of unique users who visited a website
• Counting the number of distinct terms that were searched for on your
website on a specific date or time
• Counting the number of distinct hashtags that were used by a user
• Counting the number of distinct words that appear in a book

Related

How can I store the date with datastore?

Datastore documentation is very clear that there is an issue with "hotspots" if you include 'monotonically increasing values' (like the current unix time), however there isn't a good alternative mentioned, nor is it addressed whether storing the exact same (rather than increasing values) would create "hotspots":
"Do not index properties with monotonically increasing values (such as a NOW() timestamp). Maintaining such an index could lead to hotspots that impact Cloud Datastore latency for applications with high read and write rates."
https://cloud.google.com/datastore/docs/best-practices
I would like to store the time when each particular entity is inserted into the datastore, if that's not possible though, storing just the date would also work.
That almost seems more likely to cause "hotspots" though, since every new entity for 24 hours would get added to the same index (that's my understanding anyway).
Perhaps there's something more going on with how indexes work (I am having trouble finding great explanations of exactly how they work) and having the same value index over and over again is fine, but incrementing values is not.
I would appreciate if anyone has an answer to this question, or else better documentation for how datastore indexes work.
Is your application actually planning on querying the date? If not, consider simply not indexing that property. If you only need to read that property infrequently, consider writing a mapreduce rather than indexing.
That advice is given due to the way BigTable tablets work, which is described here: https://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/
To the best of my knowledge, it's more important to have the primary key of an entity not be a monotonically increasing number. It would be better to have a string key, so the entity can be stored with better distribution.
But saying this as a non-expert, I can't imagine that indexes on individual properties with monotonic values would be as problematic, if it's legitimately needed. I know with the Nomulus codebase for example, we had a legitimate need for an index on time, because we wanted to delete commit logs older than a specific time.
One cool thing I think happens with these monotonic indexes is that, when these tablet splits don't happen, fetching the leftmost or rightmost element in the index actually has better latency properties than fetching stuff in the middle of the index. For example, if you do a query that just grabs the first result in the index, it can actually go faster than a key lookup.
There is a key quote in the page that Justine linked to that is very helpful:
As a developer, what can you do to avoid this situation? ... Lower your write rate, or figure out how to better distribute values.
It is ok to store an indexed time stamp as long as that entity has a low write rate.
If you have an entity where you want to store an indexed time stamp and the entity has a high write rate, then the solution is to split the entity into two entities. Entity A will have properties that need to be updated frequently and entity B will have the time stamp and properties that don't get updated often.
When I do this, I have a common ID for the two entities to make it really easy to get from one to the other.
You could try storing just the date and put random hours, minutes, and seconds into the timestamp, then throw away that extra data later. (Or keep the hours and minutes and use random seconds, for example). I'm not 100% sure this would work but if you need to index the date it's worth trying.

GAE — Performance of queries on indexed properties

If I had an entity with an indexed property, say "name," what would the performance of == queries on that property be like?
Of course, I understand that no exact answers are possible, but how does the performance correlate with the total number of entities for which name == x for some x, the total number of entities in the datastore, etc.?
How much slower would a query on name == x be if I had 1000 entities with name equalling x, versus 100 entities? Has any sort of benchmarking been done on this?
Some not very strenuous testing on my part indicated response times increased roughly linearly with the number of results returned. Note that even if you have 1000 entities, if you add a limit=100 to your query, it'll perform the same as if you only had 100 entities.
This is in line with the documentation which indicates that perf varies with the number of entities returned.
When I say not very strenuous, I mean that the response times were all over the place, and it was a very very rough estimate to draw a line through. I'd often see an order of magnitude difference in perf on the same request.
AppEngine does queries in a very optimized way, so it is virtually irrelevant from a performance stand-point whether you do a query on the name property vs. just doing a batch-get with the keys only. Either will be linear in the number of entities returned. The total number of entities stored in your database does not make a difference. What does make a tiny difference, though, is the number of different values for "name" that occur in your database (so, 1000 entities returned will be pretty much exactly 10 times slower than 100 entities returned).
The way this is done is via the indices (or indexes as preferred) stored along with your data. An index for the "name" property consists of a table that has all names sorted in alphabetical order (and a second one sorted in reverse alphabetical order, if you use descending order in any of your queries) and a query will then simply find the first occurrence of the name you are querying in the table and start returning results in order. This is called a "scan".
This video is a bit technical, but it explains in detail how all this works and if you're concerned about coding for maximum performance, might be a good time investment:
Google I/O 2008: Under the Covers of the Google App Engine Datastore
(the video quality is fairly bad, but they also have the slides online (see link above video))

design a system supporting massive data storage and query

I was asked by the interviewer to design a system to store gigabytes of data and the system also has to support some kind of query.
Description:
There are massive amount of records generated in an IDC, each record is composed of a url, an IP which visits the url, and the time when the visit occurs. The record can probably be stated as a struct like this, but I'm not sure which data type should I pick to represent them:
struct Record {
url; //char *
IP; //int?
visit_time; //time_t or simply a number?
}
Requirements:
Design a system to store 100 billion records, and also the system gotta support 2 kinds of query at least:
First, given a time period (t1, t2) and a IP, query how many urls this IP has visited in the given period.
Second, given a time period (t1, t2) and a url, query how many times this url has been visited.
I was stumbled, and here is my stupid solution:
Analysis:
because every query is performed upon a given period of time, so:
1.Create a set, put all visit time into the set, and keep the set ordered according to the time's value from older to latest.
2.Create a hash table using hash(visit_time) as the key, this hash table is called time-hash-table, then each node in a specific bucket has 2 pointers pointing to another 2 hash-tables respectively.
3.The another 2 hash-tables would be a ip-hash-table and a url-hash-table.
ip-hash-table uses hash(ip) as the key and all the ips in the same ip-hash-table have the same visit-time;
url-hash-table uses hash(url) as the key and all the urls in the same url-hash-table have the same visit-time.
Give a drawing as follows:
time_hastbl
[]
[]
[]-->[visit_time_i]-->[visit_time_j]...[visit_time_p]-->NIL
[] | |
[] ip_hastbl url_hastbl
[] []
: :
[] []
[] []
So, when doing the query upon (t1, t2):
find the closest match from the time set, let's say the match is (t1', t2'), then all the valid visit time will fall into the part of set starting from t1' to t2';
for each visit-time t in the time set[t1':t2'], do hash(t) and find t's ip_hastbl or url_hastbl, then count and log how many times the given ip or url appears.
Questions:
1.My solution is stupid, hope you can give me another solution.
2.with respect to how to store the massive records on disk, any advice? I thought of B-tree, but how to use it or is B-tree applicable in this system?
I believe the interviewer was expecting a distributed computing based solution, esp when "100 billion records" are involved. With the limited knowledge of Distributed Computing I have, I would suggest you to look into Distributed Hash Table and map-reduce (for parallel query processing)
In my opinion, create a B+ tree using time as the key to help you quickly locate the range of records during given time period (t1,t2) in disk. Then using the records during (t1,t2) to build IP and URL hash table respectively.
Old question, but recently bumped so here's a few other things to think about:
What you need to consider is a few very simple boundary limits beyond your listed requirements, assuming you have no further indexes:
First, given a time period (t1, t2) and a IP, query how many urls this IP has visited in the given period.
If you have 10k users then you can expect at worst a scan of all records in a time window would result in only needing to return in 10k records accessed (on average).
Second, given a time period (t1, t2) and a url, query how many times this url has been visited.
Depending on how many urls you have in the system say 1000, then this again means that a simple scan results in 999 of 1000 records scanned not being returned.
Lets say you have only 100,000 unique urls, you could greatly reduce the space consumed by the database (by using a guid / int foreign key instead), this also means the average url is accessed 1M times on your 100Bn records.
Even with all this it tells us nothing completely, because we don't have numbers / statistics on how clusteded by time the records are for the given search times. Are we getting 1000 page requests every second and searching for a 12month time range, or are we getting 100 requests per second and searching for a 1hour time block (360k requests).
Assuming the 100Bn represents 12 months of data that's 3170 requests per second. Does that sound reasonable?
Why is this important? Because it highlights one key thing you overlooked in your answer.
With 100Bn records in the past 12months, that means in 12months time you'll have 200Bn records to deal with. If 100bn records is for 20 years then it's not such an issue, you can expect to grow by only another 25-30bn in the next 5 years... but it's unlikely that your existing data is over such a long time frame.
Your solution only answers one side of the equation (reading data), you don't consider any complications with writing that much data. A vast majority of the time you will be inserting data into whatever data store you create, will it be able to handle a constant 3k insert requests per second?
If you insert 3k records and each record is just 3x 64bit integers representing Time (in ticks), IP Address and a Foreign key to the url. Then that is only ~75kb/s of writing data which will be fine to maintain. If every URL is to be assumed unique, then you could easily run into performance issues due to IO speeds (ignoring the space requirements).
One other thing the interviewer would be interested in seeing is your thoughts on supporting IPv6.
Lastly, if you provided a solution like you have then the interviewer should have asked a followup question. "How would your system perform if I now want to know when a specific ip address last accessed a specific url?"
So yes, if you don't know about MapReduce and other distributed processing query systems then yours should be a reasonable answer.
It will be an interval tree which is also a B-Tree. An interval tree because all the queries have input as time interval only, and B-Tree due to the size of the input(billions).

Even partitioning of nonuniform ranged data in cassandra

I've got a rather tricky one, bear with me as I try not to stumble over my words here. I'm doing some research, and my group is transitioning to a cassandra database. Our research used MySQL before, but the data outgrew the database (192 million rows in memory # 16G -- it was the only way to query the data fast enough). The data itself is kinda-sorta static. There's a whole lot of it, but any new data is a somewhat slow trickle at this point.
The data consists of a boatload of classifier-score pairs. We formulate queries for the database which basically say, "give me the top 500 for the following classifiers". Then the database returns that many scores. For example, if we ask for the top 500 scores for 2 classifiers, we get back 1000 rows (each row consisting of a classifier ID and a score -- i.e. [4, 9100]). The scores themselves are non-uniform (the distribution tends to clump toward one end of the values -- which by the way are from -10000 to 10000)
As we transition to cassandra, there are a number of requirements. First of all, we need to be able to query for the top and bottom N scores on a per-classifier basis. Normally I can see that an ordered partitioner would be appropriate for this, however like I said the scores tends to clump at the extremes (which would put too much of a burden on one node). So my first question is, how do I evenly distribute the classifier/score pairs while still being able to query for the top or bottom N.
There is a secondary requirement which pretty much screws up the first one. Sometimes it is necessary to find all scores that are near another score. So if I see classifier 6 with a score of 400, I might ask, show me 500 scores that are the closest to that (all within classifier 6). I'm absolutely stumped about this one. I've read that cassandra supports secondary indices (yay) but only hash type (boo - no ranges). Do we create a seperate ColumnFamily for this use case?
And finally, speed is paramount. The data is being used in an interactive GUI application. Ideally, queries should only take a few seconds. And if data all gets stuck on one particular node, it will slow things down.
We've tried all kinds of clever tricks. Our best idea was to put the data into buckets, so that the top 500 went into bucket 1, the next 500 went into bucket 2, and so on. The advantage is that to get the top 500 we just ask for bucket 1. Also all of the data WOULD be evenly distributed using a random partitioner. However since MOST of our queries are interested only in bucket 1, it would put a lot of burden on just one node (remember, if N classifiers are involved, it's actually 500 * N scores per bucket). The real disadvantage of this scheme is that it falls apart when we need to query based on nearness to a score (we'd have to do some kind of weird binary search over the buckets to find our starting value).
At this point we're running low on ideas. Everything I've seen about cassandra makes me wonder if it's even appropriate for this task. We chose it mainly because of it's horizontal scalability, which is important (much easier to add a node than to shard an RDBM). So I suppose my overall question is: how would you approach this? If cassandra, please address any of the above issues. Otherwise any insight or wisdom would be appreciated. Thanks.
Why not storing the classifier as a column family row key and the score in column name. Since columns are sorted it is really fast to query the top/bottom 500 columns for a given classifier. The second type of query is also possible, when you are looking for the scores near s you can for instance select 500 columns before s and 500 columns after s and then filter the 500 columns near s.

Inspiration needed: Selecting large amounts of data for a highscore

I need some inspiration for a solution...
We are running an online game with around 80.000 active users - we are hoping to expand this and are therefore setting a target of achieving up to 1-500.000 users.
The game includes a highscore for all the users, which is based on a large set of data. This data needs to be processed in code to calculate the values for each user.
After the values are calculated we need to rank the users, and write the data to a highscore table.
My problem is that in order to generate a highscore for 500.000 users we need to load data from the database in the order of 25-30.000.000 rows totalling around 1.5-2gb of raw data. Also, in order to rank the values we need to have the total set of values.
Also we need to generate the highscore as often as possible - preferably every 30 minutes.
Now we could just use brute force - load the 30 mio records every 30 minutes, calculate the values and rank them, and write them in to the database, but I'm worried about the strain this will cause on the database, the application server and the network - and if it's even possible.
I'm thinking the solution to this might be to break up the problem some how, but I can't see how. So I'm seeking for some inspiration on possible alternative solutions based on this information:
We need a complete highscore of all ~500.000 teams - we can't (won't unless absolutely necessary) shard it.
I'm assuming that there is no way to rank users without having a list of all users values.
Calculating the value for each team has to be done in code - we can't do it in SQL alone.
Our current method loads each user's data individually (3 calls to the database) to calculate the value - it takes around 20 minutes to load data and generate the highscore 25.000 users which is too slow if this should scale to 500.000.
I'm assuming that hardware size will not an issue (within reasonable limits)
We are already using memcached to store and retrieve cached data
Any suggestions, links to good articles about similar issues are welcome.
Interesting problem. In my experience, batch processes should only be used as a last resort. You are usually better off having your software calculate values as it inserts/updates the database with the new data. For your scenario, this would mean that it should run the score calculation code every time it inserts or updates any of the data that goes into calculating the team's score. Store the calculated value in the DB with the team's record. Put an index on the calculated value field. You can then ask the database to sort on that field and it will be relatively fast. Even with millions of records, it should be able to return the top n records in O(n) time or better. I don't think you'll even need a high scores table at all, since the query will be fast enough (unless you have some other need for the high scores table other than as a cache). This solution also gives you real-time results.
Assuming that most of your 2GB of data is not changing that frequently you can calculate and cache (in db or elsewhere) the totals each day and then just add the difference based on new records provided since the last calculation.
In postgresql you could cluster the table on the column that represents when the record was inserted and create an index on that column. You can then make calculations on recent data without having to scan the entire table.
First and formost:
The computation has to take place somewhere.
User experience impact should be as low as possible.
One possible solution is:
Replicate (mirror) the database in real time.
Pull the data from the mirrored DB.
Do the analysis on the mirror or on a third, dedicated, machine.
Push the results to the main database.
Results are still going to take a while, but at least performance won't be impacted as much.
How about saving those scores in a database, and then simply query the database for the top scores (so that the computation is done on the server side, not on the client side.. and thus there is no need to move the millions of records).
It sounds pretty straight forward... unless I'm missing your point... let me know.
Calculate and store the score of each active team on a rolling basis. Once you've stored the score, you should be able to do the sorting/ordering/retrieval in the SQL. Why is this not an option?
It might prove fruitless, but I'd at least take a gander at the way sorting is done on a lower level and see if you can't manage to get some inspiration from it. You might be able to grab more manageable amounts of data for processing at a time.
Have you run tests to see whether or not your concerns with the data size are valid? On a mid-range server throwing around 2GB isn't too difficult if the software is optimized for it.
Seems to me this is clearly a job for chacheing, because you should be able to keep the half-million score records semi-local, if not in RAM. Every time you update data in the big DB, make the corresponding adjustment to the local score record.
Sorting the local score records should be trivial. (They are nearly in order to begin with.)
If you only need to know the top 100-or-so scores, then the sorting is even easier. All you have to do is scan the list and insertion-sort each element into a 100-element list. If the element is lower than the first element, which it is 99.98% of the time, you don't have to do anything.
Then run a big update from the whole DB once every day or so, just to eliminate any creeping inconsistencies.

Resources