Understanding the time complexities of the R-Tree? - database

A quick search on Wikipedia reveals that the R-Tree's worst case performance for a search is undefined and the average case is O(logMn).
I suppose the worst case is this way because we can't know how many times a search has to be performed in this structure until we find the item, indeed, Guttman does say that "more than one subtree under a node visited may need to be searched, hence it is not possible to guarantee good worst-case performance." Can we express the worst case in terms of the number of searches that have to be performed?
Regarding the average case, I do not understand how this is calculated. And what about the best case?

I'd say worst case is O(n + logM n): Imagine you store lots of overlapping rectangles in the R-Tree. Now store a single small rectangle that is located in the area where all other rectangles overlap. A query for that rectangle will have to traverse all subtrees: nodes -> O(logM n) and entries -> O(n).
Best case is O(log n). An R-Tree has the same depth in every branch, and data is only stored in leaf-nodes, so you will always have to traverse O(logM n) nodes and all entries in that node, so it should be O(M * logM n).
I'm not sure you can really calculate average O(logM n). But if you have some average normally distributed data (whatever that means) with few overlaps (whatever few means) than your average query (whatever average is) should not have to traverse more than a few (1 or 2?) subtrees. I'd actually say the average is O(M * logM n), because of the traversal of M entries in a node.

Related

How searching a million keys organized as B-tree will need 114 comparisons?

Please explain how it will take 114 comparisons. The following is the screenshot taken from my book (Page 350, Data Structures Using C, 2nd Ed. Reema Thareja, Oxford Univ. Press). My reasoning is that in worst case each node will have just minimum number of children (i.e. 5), so I took log base 5 of a million, and it comes to 9. So assuming at each level of the tree we search minimum number of keys (i.e. 4), it comes to somewhere like 36 comparisons, nowhere near 114.
Consider a situation in which we have to search an un-indexed and
unsorted database that contains n key values. The worst case running
time to perform this operation would be O(n). In contrast, if the data
in the database is indexed with a B tree, the same search operation
will run in O(log n). For example, searching for a single key on a set
of one million keys will at most require 1,000,000 comparisons. But if
the same data is indexed with a B tree of order 10, then only 114
comparisons will be required in the worst case.
Page 350, Data Structures Using C, 2nd Ed. Reema Thareja, Oxford Univ. Press
The worst case tree has the minimum number of keys everywhere except on the path you're searching.
If the size of each internal node is in [5,10), then in the worst case, a tree with a million items will be about 10 levels deep, when most nodes have 5 keys.
The worst case path to a node, however, might have 10 keys in each node. The statement seems to assume that you'll do a linear search instead of a binary search inside each node (I would advise to do a binary search instead), so that can lead to around 10*10 = 100 comparisons.
If you carefully consider the details, the real number might very well come out to 114.
(This is not an Answer to the question asked, but a related discussion.)
Sounds like a textbook question, not a real-life question.
Counting comparisons is likely to be the best way to judge an in-memory tree, but not for a disk-based dataset.
Even so, the "average" number of comparisons (for in-memory) or disk hits (for disk-based) is likely to be the metric to compute.
(Sure, it is good to compute the maximum numbers as a useful exercise for understanding the structures.)
Perhaps the optimal "tree" for in memory searching is a Binary tree, but with 3-way fan out. And keep the tree balanced with 2 or 3 elements in each node.
For disk based searching -- think databases -- the optimal is likely to be a BTree with the size of a block being based on what is efficient to read from disk. Counting comparisons in a poor second when it comes to the overall time taken to fetch a row.

why is the size of an Avl tree O(n)?

The AVL tree only has O(logn) for all his operation since its a balanced tree. The height is O(logn) as well so how come the size of the AVL tree itself is O(n) can someone explain that to me? I know that you have to to calculate left subtree+1(for root)+ right subtree to get the size of the whole tree.Howevery the operation to get for exmaple the size of the right subtree is log(n) and logn + logn+1 doesnt equal O(n)
When we talk about time complexity or space complexity, we mean the rate at which the time or space requirements change with respect to the size of input. Eg. when we say O(1), we mean regardless of the size of input, the time (in case of time complexity) or space (in case of space complexity) is constant. So O(1) does not mean 1 second or 1 minute. It just means constant with respect to input size. If you plot the execution time against different input sizes, you'd get a horizontal line. Similar is the case for O(n) or O(log n).
Now with this understanding, let's talk about AVL tree. AVL tree is a balanced binary search tree. Therefore the average time complexity to search for a node in the tree is O(log n). Note that to search a node, you don't visit every single node of the tree (unlike a LinkedList). If you had to visit every single node, you'd have said the time complexity is O(n). In case of AVL tree, every time you find a mismatch, you discard one half of the tree and move on to search in the remaining half.
In the worst case you'd make one comparision at each level of the tree i.e. equal to the hight of the tree, so the search time complexity is O(log n). Size of left tree is not O(log n).
Talking about size, you do need space to store each node. if you have to store 1 node, you'd need 1 unit space, for 2 nodes, 2 units, for 3 nodes, 3 units and so on. This unit could be anything 10 bytes, 1 KB, 5 KB anything. Thr point is if you plot the space requirement of the input in computer memory against the number of trees, all you get is a linear graph starting at zero. That's O(n).
Too further clarify, while computing the time or space complexity of an algorithm, if the complexity comes as O(1 + log n + 4n + 2^n + 100), we call it O(2^n) i.e. we take the largest value because we are not calculating the absolute value, we are calculating the rate of change with respect to the size of input and thus the largest value is what matters.
If you talk about the time complexity of the algorithm to calculate the size of the tree, you need yo visit every node in the tree. Since the total number of nodes is n, it will be O(n).
To calculate the size of a tree you will have to traverse each node once present in the tree.Hence if there are n nodes in a tree traversing each node once will eventually lead to the time complexity of o(n).

Sorting algorithm vs. Simple iterations

I'm just getting started in algorithms and sorting, so bear with me...
Let's say I have an array of 50000 integers.
I need to select the smallest 30000 of them.
I thought of two methods :
1. I iterate the entire array and find each smallest integer
2. I first sort the entire array , and then simply select the first 30000.
Can anyone tell me what's the difference, which method would be faster, and why?
What if the array was smaller or bigger? Would the answer change?
Option 1 sounds like the naive solution. It would involve passing through the array to find the smallest item 30000 times. Each time it finds the smallest, presumably it would swap that item to the beginning or end of the array. In basic terms, this is O(n^2) complexity.
The actual number of operations involved would be less than n^2 because n reduces every time. So you would have roughly 50000 + 49999 + 49998 + ... + 20001, which amounts to just over 1 billion (1000 million) iterations.
Option 2 would employ an algorithm like quicksort or similar, which is commonly O(n.logn).
Here it's harder to provide actual figures, because some efficient sorting algorithms can have a worst-case of O(n^2). But let's say you use a well-behaved one that is guaranteed to be O(n.logn). This would amount to 50000 * 15.61 which is about 780 thousand.
So it's clear that Option 2 wins in this case.
What if the array was smaller or bigger? Would the answer change?
Unless the array became trivially small, the answer would still be Option 2. And the larger your array becomes, the more beneficial Option 2 becomes. This is the nature of time complexity. O(n^2) grows much faster than O(n.logn).
A better question to ask is "what if I want fewer smallest values, and when does Option 1 become preferable?". Although the answer is slightly more complex because of numerous factors (such as what constitutes "one operation" in Option 1 vs Option 2, plus other issues like memory access patterns etc), you can get the simple answer directly from time complexity. Option 1 would become preferable when the number of smallest values to select drops below n.logn. In the case of a 50000-element array, that would mean if you want to select 15 or less smallest elements, then Option 1 wins.
Now, consider an Option 3, where you transform the array into a min-heap. Building a heap is O(n), and removing one item from it is O(logn). You are going to remove 30000 items. So you have the cost of building plus the cost of removal: 50000 + 30000 * 15.6 = approximately 520 thousand. And this is ignoring the fact that n gets smaller every time you remove an element. It's still O(n.logn), like Option 2 but it is probably faster: you've saved time by not bothering to sort the elements you don't care about.
I should mention that in all three cases, the result would be the smallest 30000 values in sorted order. There may be other solutions that would give you these values in no particular order.
30k is close to 50k. Just sort the array and get the smallest 30k e.g., in Python: sorted(a)[:30000]. It is O(n * log n) operation.
If you were needed to find 100 smallest items instead (100 << 50k) then a heap might be more suitable e.g., in Python: heapq.nsmallest(100, a). It is O(n * log k).
If the range of integers is limited—you could consider O(n) sorting methods such as counting sort and radix sort.
Simple iterative method is O(n**2) (quadratic) here. Even for a moderate n that is around a million; it leads to ~10**12 operations that is much worse than ~10**6 for a linear algorithm.
For nearly all practical purposes, sorting and taking the first 30,000 is the likely to be best. In most languages, this is one or two lines of code. Hard to get wrong.
If you have a truly demanding application or are just out to fiddle, you can use a selection algorithm to find the 30,000th largest number. Then one more pass through the array will find 29,999 that are no bigger.
There are several well known selection algorithms that require only O(n) comparisons and some that are sub-linear for data with specific properties.
The fastest in practice is QuickSelect, which - as its name implies - works roughly like a partial QuickSort. Unfortunately, if the data happens to be very badly ordered, QuickSelect can require O(n^2) time (just as QuickSort can). There are various tricks for selecting pivots that the make it virtually impossible to get the worst case run time.
QuickSelect will finish with the array reordered so the smallest 30,000 elements are in the first part (unsorted) followed by the rest.
Because standard selection algorithms are comparison-based, they'll work on any kind of comparable data, not just integers.
You can do this in potentially O(N) time with radix sort or counting sort, given that your input is integers.
Another method is to get the 30000th largest integer by quickselect and simply iterate through the original array. This has Θ(N) time complexity, but in the worst case has O(N^2) for quickselect.

Cache oblivious lookahead array

I am trying to understand simipiled cache oblivious lookahead array which is described at here, and from the page 35 of this presentation
Analysis of Insertion into Simplified
Fractal Tree:
Cost to merge 2 arrays of size X is O(X=B) block I/Os. Merge is very
I/O efficient.
Cost per element to merge is O(1/B) since O(X) elements were
merged.
Max # of times each element is merged is O(logN).
Average insert cost is O(logN/B)
I can understhand #1,#2 and #3, but I can't understand #4, From the paper, merge can be considered as binary addition carry, for example, (31)B could be presented:
11111
when inserting a new item(plus 1), there should be 5 = log(32) merge(5 carries). But, in this situation, we have to merge 32 elements! In addition, if each time we plus 1, then how many carryies will be performed from 0 to 2^k ? The anwser should be 2^k - 1. In other words, one merge per insertion!
so How does #4 is computed?
While you are right on both that the number of merged elements (and so transfers) is N in worst case and that the number of total merges is also of the same order, the average insertion cost is still logarithmic. It comes from two facts: merges vary in cost, and the number of low-cost merges is much higher than the number of high-cost ones.
It might be easier to see by example.
Let's set B=1 (i.e. 1 element per block, worst case of each merge having a cost) and N=32 (e.g. we insert 32 elements into an initially empty array).
Half of the insertions (16) put an element into the empty subarray of size 1, and so do not cause a merge. Of the remaining insertions, one (the last) needs to merge (move) 32 elements, one (16th) moves 16, two (8th and 24th) move 8 elements, four move 4 elements, and eight move 2 elements. Thus, overall number of element moves is 96, giving the average of 3 moves per insertion.
Hope that helps.
The first log B levels fit in (a single page of) memory, and so any stuff that happens in those levels does not incur an I/O. (This also fixes the problem with rrenaud's analysis that there's O(1) merges per insertion, since you only start paying for them after the first log B merges.)
Once you are merging at least B elements, then Fact 2 kicks in.
Consider the work from an element's point of view. It gets merged O(log N) times. It gets charged O(1/B) each time that happens. It's total cost of insertion is O((log N)/B) (need the extra parens to differentiate from O(log N/B), which would be quite bad insertion performance -- even worse than a B-tree).
The "average" cost is really the amortized cost -- it's the amount you charge to that element for its insertion. A little more formally it's the total work for inserting N elements, then divide by N. An amortized cost of O((log N)/B) really means that inserting N elements is O((N log N)/B) I/Os -- for the whole sequence. This compares quite favorable with B-trees, which for N insertions do a total of O((N log N)/log B) I/Os. Dividing by B is obviously a whole lot better than dividing by log B.
You may complain that the work is lumpy, that you sometimes do an insertion that causes a big cascade of merges. That's ok. You don't charge all the merges to the last insertion. Everyone is paying its own small amount for each merge they participate in. Since (log N)/B will typically be much less than 1, everyone is being charged way less than a single I/O over the course of all of the merges it participates in.
What happens if you don't like amortized analysis, and you say that even though the insertion throughput goes up by a couple of orders of magnitude, you don't like it when a single insertion can cause a huge amount of work? Aha! There are standard ways to deamortize such a data structure, where you do a bit of preemptive merging during each insertion. You get the same I/O complexity (you'll have to take my word for it), but it's pretty standard stuff for people who care about amortized analysis and deamortizing the result.
Full disclosure: I'm one of the authors of the COLA paper. Also, rrenaud was in my algorithms class. Also, I'm a founder of Tokutek.
In general, the amortized number of changed bits per increment is 2 = O(1).
Here is a proof by logic/reasoning. http://www.cs.princeton.edu/courses/archive/spr11/cos423/Lectures/Binary%20Counting.pdf
Here is a "proof" by experimentation. http://codepad.org/0gWKC3rW

Exhaustive searches vs sorting followed by binary search

This is a direct quote from the textbook, Invitation to Computer Science by G. Michael Scneider and Judith L. Gersting.
At the end of Section 3.4.2, we talked about the tradeoff between using sequential search on an unsorted list as opposed to sorting the list and then using binary search. If the list size is n=100,000 about how many worst-case searches must be done before the second alternative is better in terms of number of comparisons?
I don't really get what the question is asking for.
Sequential search is of order (n) and binary is of order (lgn) which in any case lgn will always be less than n. And in this case n is already given so what am I supposed to find.
This is one of my homework assignment but I don't really know what to do. Could anyone explain the question in plain English for me?
and binary is of order (lgn) which in any case lgn will always be less than n
This is where you're wrong. In assignment, you're asked to consider the cost of sorting array too.
Obviously, if you need only one search, first approach is better than sorting array and doing binary search: n < n*logn + logn. And you're asked, how many searches you need for second approach to become more effective.
End of hint.
The question is how to decide which approach to choose - to just use linear search or to sort and then use binary search.
If you only search a couple of times linear search is better - it is O(n), while sorting is already O(n*logn). If you search very often on the same collection sorting is better - searching multiple times can become O(n*n) but sorting and then searching with binary search is again O(n*logn) + NumberOfSearches*O(logn) which can be less or more than using linear search depending on how NumberOfSearches and n relate.
The task is to determine the exact value of NumberOfSearches (not the exact number, but a function of n) which will make one of the options preferable:
NumberOfSearches * O(n) <> O(n*logn) + NumberOfSearches * O(logn)
don't forget that each O() can have a different constant value.
The order of the methods is not important here. It tells you something how well algorithms scale when the problem becomes bigger and bigger. You can't do any exact calculations if you only know O(n) == it complexity grows linear in the size of the problem. It won't give you any numbers.
This can well mean that an algorithm with O(n) complexity is faster than a O(logn) algorithm, for some n. Because O(log(n)) scales better when it gets larger, we know for sure, there is a n (a problem size) where the algorithm with O(logn) complexity is faster. We just don't know when (for what n).
In plain english:
If you want to know 'how many searches', you need exact equations to solve, you need exact numbers. How many comparisons does it take to search sequential? (Remember n is given, so you can give a number.) How many comparisons (in the worst case!) does it take to search with a binary search? Before you can do a binary search, you have to sort. Let's add the number of comparisons needed to sort to the cost of binary search. Now compare the two numbers, which one is less?
The binary search is fast, but the sorting is slow. The sequential search is slower than binary search, but faster than sorting. However the sorting needs to be done only once, no matter how many times you search. So, when does one heavy sort outweigh having to do a slow (sequential) search every time?
Good luck!
For sequential search, the worst case is n = 100000, so for p searches p × 100000 comparisons are required.
Using a Θ(n2) sorting algorithm would require 100000 × 100000 comparisons.
Binary search would require 1 + log n = 1 + log 100000 = 17 comparisons for each search,
together there would be 100000×100000 + 17p comparisons.
The first expression is larger than the second, meaning
100000p > 100000^2 + 17p
For p > 100017.
The question is about appreciating the number NUM_SEARCHES needed to compensate the cost of sorting. So we'll have:
time( NUM_SEARCHES * O(n) ) > time( NUM_SEARCHES * O(log(n)) + O(n* log(n)) )
Thank you guys. I think I get the point now. Could you take a look at my answer and see whether I'm on the right track.
For worst case searches
Number of comparison for sequential search is n = 100,000.
Number of comparison for binary search is lg(n) = 17.
Number of comparison for sorting is (n-1)/2 * n = (99999)(50000).
(I'm following my textbook and used the selection sort algorithm covered in my class)
So let p be the number of worst case searches, then 100,000p > (99999)(50000) + 17p
OR p > 50008
In conclusion, I need 50,008 worst case searches to make sorting and using binary search better than a sequential search for a list of n=100,000.

Resources