B+ Tree CPU search time - database

I was just wondering how would you calculate the worst case time for an unclustered and clustered b+ tree?
For example, say I had 1,000,000 records (1 row = 100 bytes), disk pages are 4000 bytes, a key was 20 bytes, and the access time of a page is 40ms. How would I calculate using these variables the unclustered and clustered b+ tree worse case time?
I know that to calculate the height/levels of a b+ tree you use the following (i think):
logF(keys)
where F = number of praches branches.
With the height, you can use it to calculate the final worst case time, but I don't know how to do that... I've tried searching around but all I could fine was times for average cases or examples that weren't very clear.
Any help is appreciated!

I would say logF(keys) its the worst case to finding the page, but after that the worst case would be an unclustured index with all your rids pointing to different pages, wich mean
logF(keys) + N being N the number of rids in the index node.
so in the end it would be
H= height of the tree wich would be like 3 or 4.
H + N = 4 + (4000/20) = 204 I/Os
wich let say they are in memory and want to see the CPU time then it would be
CPU = 204*0.04 = 8.16 secs. althought 40ms for moving a page in memory its quite a lot of time i think (for disk reading could make sense) but i think the calculations are fine.

Related

How searching a million keys organized as B-tree will need 114 comparisons?

Please explain how it will take 114 comparisons. The following is the screenshot taken from my book (Page 350, Data Structures Using C, 2nd Ed. Reema Thareja, Oxford Univ. Press). My reasoning is that in worst case each node will have just minimum number of children (i.e. 5), so I took log base 5 of a million, and it comes to 9. So assuming at each level of the tree we search minimum number of keys (i.e. 4), it comes to somewhere like 36 comparisons, nowhere near 114.
Consider a situation in which we have to search an un-indexed and
unsorted database that contains n key values. The worst case running
time to perform this operation would be O(n). In contrast, if the data
in the database is indexed with a B tree, the same search operation
will run in O(log n). For example, searching for a single key on a set
of one million keys will at most require 1,000,000 comparisons. But if
the same data is indexed with a B tree of order 10, then only 114
comparisons will be required in the worst case.
Page 350, Data Structures Using C, 2nd Ed. Reema Thareja, Oxford Univ. Press
The worst case tree has the minimum number of keys everywhere except on the path you're searching.
If the size of each internal node is in [5,10), then in the worst case, a tree with a million items will be about 10 levels deep, when most nodes have 5 keys.
The worst case path to a node, however, might have 10 keys in each node. The statement seems to assume that you'll do a linear search instead of a binary search inside each node (I would advise to do a binary search instead), so that can lead to around 10*10 = 100 comparisons.
If you carefully consider the details, the real number might very well come out to 114.
(This is not an Answer to the question asked, but a related discussion.)
Sounds like a textbook question, not a real-life question.
Counting comparisons is likely to be the best way to judge an in-memory tree, but not for a disk-based dataset.
Even so, the "average" number of comparisons (for in-memory) or disk hits (for disk-based) is likely to be the metric to compute.
(Sure, it is good to compute the maximum numbers as a useful exercise for understanding the structures.)
Perhaps the optimal "tree" for in memory searching is a Binary tree, but with 3-way fan out. And keep the tree balanced with 2 or 3 elements in each node.
For disk based searching -- think databases -- the optimal is likely to be a BTree with the size of a block being based on what is efficient to read from disk. Counting comparisons in a poor second when it comes to the overall time taken to fetch a row.

K nearest elements

This is an interview question.
There are billions and billions of stars in the universe. Which data structure would you use to answer the query
"Give me the k stars nearest to Earth".
I thought of heaps. As we can do heapify in O(n) and get the n_smallest in O(logn). Is there a better data structure suited for this purpose?
Assuming the input could not be all stored in memory at the same time (that would be a challenge!), but would be a stream of the stars in the universe -- like you would get an iterator or something -- you could benefit from using a Max Heap (instead of a Min Heap which might come to mind first).
At the start you would just push the stars in the heap, keyed by their distance to earth, until your heap has k entries.
From then on, you ignore any new star when it has a greater distance than the root of your heap. When it is closer than the root-star, substitute the root with that new star and sift it down to restore the heap property.
Your heap will not grow greater than k elements, and at all times it will consist of the k closest stars among those you have processed.
Some remarks:
Since it is a Max Heap, you don't know which is the closest star (in constant time). When you would stop the algorithm and then pull out the root node one after the other, you would get the k closest stars in order of descending distance.
As the observable(!) Universe has an estimated 1021 number of stars, you would need one of the best supercomputers (1 exaFLOPS) to hope to process all those stars in a reasonable time. But at least this algorithm only needs to keep k stars in memory.
The first problem you're going to run into is scale. There are somewhere between 100 billion and 400 billion stars in the Milky Way galaxy alone. There is an estimated 10 billion galaxies. If we assume an average of 100 billion stars per galaxy, that's 10^19 stars in the universe. It's unlikely you'll have the memory for that. And even if you did have enough memory, you probably don't have the time. Assuming your heapify operation could do a billion iterations per second, it would take a trillion seconds (31,700 years). And then you have to add the time it would take to remove the k smallest from the heap.
It's unlikely that you could get a significant improvement by using multiple threads or processes to build the heap.
The key here will be to pre-process the data and store it in a form that lets you quickly eliminate the majority of possibilities. The easiest way would be to have a sorted list of stars, ordered by their distance from Earth. So Sol would be at the top of the list, Proxima Centauri would be next, etc. Then, getting the nearest k stars would be an O(k) operation: just read the top k items from the list.
A sorted list would be pretty hard to update, though. A better alternative would be a k-d tree. It's easier to update, and getting the k nearest neighbors is still reasonably quick.

Generated unique id with 6 characters - handling when too much ids already used

​In my program you can book an item. This item has an id with 6 characters from 32 possible characters.
So my possibilities are 32^6. Every id must be unique.
func tryToAddItem {
if !db.contains(generateId()) {
addItem()
} else {
tryToAddItem()
}
}
For example 90% of my ids are used. So the probability that I call tryToAddItem 5 times is 0,9^5 * 100 = 59% isn't it?
So that is quite high. This are 5 database queries on a lot of datas.
When the probability is so high I want to implement a prefix „A-xxxxxx“.
What is a good condition for that? At which time do I will need a prefix?
In my example 90% ids were use. What is about the rest? Do I threw it away?
What is about database performance when I call tryToAddItem 5 times? I could imagine that this is not best practise.
For example 90% of my ids are used. So the probability that I call tryToAddItem 5 times is 0,9^5 * 100 = 59% isn't it?
Not quite. Let's represent the number of call you make with the random variable X, and let's call the probability of an id collision p. You want the probability that you make the call at most five times, or in general at most k times:
P(X≤k) = P(X=1) + P(X=2) + ... + P(X=k)
= (1-p) + (1-p)*p + (1-p)*p^2 +... + (1-p)*p^(k-1)
= (1-p)*(1 + p + p^2 + .. + p^(k-1))
If we expand this out all but two terms cancel and we get:
= 1- p^k
Which we want to be greater than some probability, x:
1 - p^k > x
Or with p in terms of k and x:
p < (1-x)^(1/k)
where you can adjust x and k for your specific needs.
If you want less than a 50% probability of 5 or more calls, then no more than (1-0.5)^(1/5) ≈ 87% of your ids should be taken.
First of all make sure there is an index on the id columns you are looking up. Then I would recommend thinking more in terms of setting a very low probability of a very bad event occurring. For example maybe making 20 calls slows down the database for too long, so we'd like to set the probability of this occurring to <0.1%. Using the formula above we find that no more than 70% of ids should be taken.
But you should also consider alternative solutions. Is remapping all ids to a larger space one time only a possibility?
Or if adding ids with prefixes is not a big deal then you could generate longer ids with prefixes for all new items going forward and not have to worry about collisions.
Thanks for response. I searched for alternatives and want show three possibilities.
First possibility: Create an UpcomingItemIdTable with 200 (more or less) valid itemIds. A task in the background can calculate them every minute (or what you need). So the action tryToAddItem will always get a valid itemId.
Second possibility
Is remapping all ids to a larger space one time only a possibility?
In my case yes. I think for other problems the answer will be: it depends.
Third possibility: Try to generate an itemId and when there is a collision try it again.
Possible collisions handling: Do some test before. Measure the time to generate itemIds when there are already 1000,10.000,100.000,1.000.000 etc. entries in the table. When the tryToAddItem method needs more than 100ms (or what you prefer) then increase your length from 6 to 7,8,9 characters.
Some thoughts
every request must be atomar
create an index on itemId
Disadvantages for long UUIDs in API: See https://zalando.github.io/restful-api-guidelines/#144
less usable, because...
-cannot be memorized and easily communicated by humans
-harder to use in debugging and logging analysis
-less convenient for consumer facing usage
-quite long: readable representation requires 36 characters and comes with higher memory and bandwidth consumption
-not ordered along their creation history and no indication of used id volume
-may be in conflict with additional backward compatibility support of legacy ids
[...]
TLDR: For my case every possibility is working. As so often it depends on the problem. Thanks for input.

How does the HyperLogLog algorithm work?

I've been learning about different algorithms in my spare time recently, and one that I came across which appears to be very interesting is called the HyperLogLog algorithm - which estimates how many unique items are in a list.
This was particularly interesting to me because it brought me back to my MySQL days when I saw that "Cardinality" value (which I always assumed until recently that it was calculated not estimated).
So I know how to write an algorithm in O(n) that will calculate how many unique items are in an array. I wrote this in JavaScript:
function countUniqueAlgo1(arr) {
var Table = {};
var numUnique = 0;
var numDataPoints = arr.length;
for (var j = 0; j < numDataPoints; j++) {
var val = arr[j];
if (Table[val] != null) {
continue;
}
Table[val] = 1;
numUnique++;
}
return numUnique;
}
But the problem is that my algorithm, while O(n), uses a lot of memory (storing values in Table).
I've been reading this paper about how to count duplicates in a list in O(n) time and using minimal memory.
It explains that by hashing and counting bits or something one can estimate within a certain probability (assuming the list is evenly distributed) the number of unique items in a list.
I've read the paper, but I can't seem to understand it. Can someone give a more layperson's explanation? I know what hashes are, but I don't understand how they are used in this HyperLogLog algorithm.
The main trick behind this algorithm is that if you, observing a stream of random integers, see an integer which binary representation starts with some known prefix, there is a higher chance that the cardinality of the stream is 2^(size of the prefix).
That is, in a random stream of integers, ~50% of the numbers (in binary) starts with "1", 25% starts with "01", 12,5% starts with "001". This means that if you observe a random stream and see a "001", there is a higher chance that this stream has a cardinality of 8.
(The prefix "00..1" has no special meaning. It's there just because it's easy to find the most significant bit in a binary number in most processors)
Of course, if you observe just one integer, the chance this value is wrong is high. That's why the algorithm divides the stream in "m" independent substreams and keep the maximum length of a seen "00...1" prefix of each substream. Then, estimates the final value by taking the mean value of each substream.
That's the main idea of this algorithm. There are some missing details (the correction for low estimate values, for example), but it's all well written in the paper. Sorry for the terrible english.
A HyperLogLog is a probabilistic data structure. It counts the number of distinct elements in a list. But in comparison to a straightforward way of doing it (having a set and adding elements to the set) it does this in an approximate way.
Before looking how the HyperLogLog algorithm does this, one has to understand why you need it. The problem with a straightforward way is that it consumes O(distinct elements) of space. Why there is a big O notation here instead of just distinct elements? This is because elements can be of different sizes. One element can be 1 another element "is this big string". So if you have a huge list (or a huge stream of elements) it will take a lot memory.
Probabilistic Counting
How can one get a reasonable estimate of a number of unique elements? Assume that you have a string of length m which consists of {0, 1} with equal probability. What is the probability that it will start with 0, with 2 zeros, with k zeros? It is 1/2, 1/4 and 1/2^k. This means that if you have encountered a string starting with k zeros, you have approximately looked through 2^k elements. So this is a good starting point. Having a list of elements that are evenly distributed between 0 and 2^k - 1 you can count the maximum number of the biggest prefix of zeros in binary representation and this will give you a reasonable estimate.
The problem is that the assumption of having evenly distributed numbers from 0 t 2^k-1 is too hard to achieve (the data we encountered is mostly not numbers, almost never evenly distributed, and can be between any values. But using a good hashing function you can assume that the output bits would be evenly distributed and most hashing function have outputs between 0 and 2^k - 1 (SHA1 give you values between 0 and 2^160). So what we have achieved so far is that we can estimate the number of unique elements with the maximum cardinality of k bits by storing only one number of size log(k) bits. The downside is that we have a huge variance in our estimate. A cool thing that we almost created 1984's probabilistic counting paper (it is a little bit smarter with the estimate, but still we are close).
LogLog
Before moving further, we have to understand why our first estimate is not that great. The reason behind it is that one random occurrence of high frequency 0-prefix element can spoil everything. One way to improve it is to use many hash functions, count max for each of the hash functions and in the end average them out. This is an excellent idea, which will improve the estimate, but LogLog paper used a slightly different approach (probably because hashing is kind of expensive).
They used one hash but divided it into two parts. One is called a bucket (total number of buckets is 2^x) and another - is basically the same as our hash. It was hard for me to get what was going on, so I will give an example. Assume you have two elements and your hash function which gives values form 0 to 2^10 produced 2 values: 344 and 387. You decided to have 16 buckets. So you have:
0101 011000 bucket 5 will store 1
0110 000011 bucket 6 will store 4
By having more buckets you decrease the variance (you use slightly more space, but it is still tiny). Using math skills they were able to quantify the error (which is 1.3/sqrt(number of buckets)).
HyperLogLog
HyperLogLog does not introduce any new ideas, but mostly uses a lot of math to improve the previous estimate. Researchers have found that if you remove 30% of the biggest numbers from the buckets you significantly improve the estimate. They also used another algorithm for averaging numbers. The paper is math-heavy.
And I want to finish with a recent paper, which shows an improved version of hyperLogLog algorithm (up until now I didn't have time to fully understand it, but maybe later I will improve this answer).
The intuition is if your input is a large set of random number (e.g. hashed values), they should distribute evenly over a range. Let's say the range is up to 10 bit to represent value up to 1024. Then observed the minimum value. Let's say it is 10. Then the cardinality will estimated to be about 100 (10 × 100 ≈ 1024).
Read the paper for the real logic of course.
Another good explanation with sample code can be found here:
Damn Cool Algorithms: Cardinality Estimation - Nick's Blog

Cache oblivious lookahead array

I am trying to understand simipiled cache oblivious lookahead array which is described at here, and from the page 35 of this presentation
Analysis of Insertion into Simplified
Fractal Tree:
Cost to merge 2 arrays of size X is O(X=B) block I/Os. Merge is very
I/O efficient.
Cost per element to merge is O(1/B) since O(X) elements were
merged.
Max # of times each element is merged is O(logN).
Average insert cost is O(logN/B)
I can understhand #1,#2 and #3, but I can't understand #4, From the paper, merge can be considered as binary addition carry, for example, (31)B could be presented:
11111
when inserting a new item(plus 1), there should be 5 = log(32) merge(5 carries). But, in this situation, we have to merge 32 elements! In addition, if each time we plus 1, then how many carryies will be performed from 0 to 2^k ? The anwser should be 2^k - 1. In other words, one merge per insertion!
so How does #4 is computed?
While you are right on both that the number of merged elements (and so transfers) is N in worst case and that the number of total merges is also of the same order, the average insertion cost is still logarithmic. It comes from two facts: merges vary in cost, and the number of low-cost merges is much higher than the number of high-cost ones.
It might be easier to see by example.
Let's set B=1 (i.e. 1 element per block, worst case of each merge having a cost) and N=32 (e.g. we insert 32 elements into an initially empty array).
Half of the insertions (16) put an element into the empty subarray of size 1, and so do not cause a merge. Of the remaining insertions, one (the last) needs to merge (move) 32 elements, one (16th) moves 16, two (8th and 24th) move 8 elements, four move 4 elements, and eight move 2 elements. Thus, overall number of element moves is 96, giving the average of 3 moves per insertion.
Hope that helps.
The first log B levels fit in (a single page of) memory, and so any stuff that happens in those levels does not incur an I/O. (This also fixes the problem with rrenaud's analysis that there's O(1) merges per insertion, since you only start paying for them after the first log B merges.)
Once you are merging at least B elements, then Fact 2 kicks in.
Consider the work from an element's point of view. It gets merged O(log N) times. It gets charged O(1/B) each time that happens. It's total cost of insertion is O((log N)/B) (need the extra parens to differentiate from O(log N/B), which would be quite bad insertion performance -- even worse than a B-tree).
The "average" cost is really the amortized cost -- it's the amount you charge to that element for its insertion. A little more formally it's the total work for inserting N elements, then divide by N. An amortized cost of O((log N)/B) really means that inserting N elements is O((N log N)/B) I/Os -- for the whole sequence. This compares quite favorable with B-trees, which for N insertions do a total of O((N log N)/log B) I/Os. Dividing by B is obviously a whole lot better than dividing by log B.
You may complain that the work is lumpy, that you sometimes do an insertion that causes a big cascade of merges. That's ok. You don't charge all the merges to the last insertion. Everyone is paying its own small amount for each merge they participate in. Since (log N)/B will typically be much less than 1, everyone is being charged way less than a single I/O over the course of all of the merges it participates in.
What happens if you don't like amortized analysis, and you say that even though the insertion throughput goes up by a couple of orders of magnitude, you don't like it when a single insertion can cause a huge amount of work? Aha! There are standard ways to deamortize such a data structure, where you do a bit of preemptive merging during each insertion. You get the same I/O complexity (you'll have to take my word for it), but it's pretty standard stuff for people who care about amortized analysis and deamortizing the result.
Full disclosure: I'm one of the authors of the COLA paper. Also, rrenaud was in my algorithms class. Also, I'm a founder of Tokutek.
In general, the amortized number of changed bits per increment is 2 = O(1).
Here is a proof by logic/reasoning. http://www.cs.princeton.edu/courses/archive/spr11/cos423/Lectures/Binary%20Counting.pdf
Here is a "proof" by experimentation. http://codepad.org/0gWKC3rW

Resources