I have a doubt about R-Tree data structure. What is fan-out in R-Tree. Is it a Maximum number of entries?
How can we determine the minimum and maximum number of entries in R-Tree? Let say if i have 10000 points and my page size i 8kb.
Thanks
Fan-out, in any tree, is number of pointers to child nodes in a node.
Different trees have different fan-out.
A binary tree has fanout 2.
A B-tree has a fan-out B, with all nodes except leaves having between B/2 and B children. External (on-disk) implementation often relax the minimal number of children restriction to save some updates.
In databases, B-trees or their variant called B+-trees is often used so that each node has size of 1 page and the fan out determined by number of sort keys and pointers that fit in that space.
An R-tree is a search tree where indices are multi-dimensional intervals. These may possibly overlap. It may have any fan-out. Usual is number of 2 to number of dimensions (so 4 for 2-dimensional, 8 for 3-dimensional etc.). But it may have higher fanout too and organizing it similar to B-tree is certainly possible.
How can we determine the minimum and maximum number of entries in R-Tree? Let say if I have 10000 points and my page size is 8KiB.
The size of the tree node does not have to match page size. If it does (usually used for external, i.e. on disk, implementations), you still need to know how large the sort key is and how large the pointer is. An R-tree needs 2 coordinate values, minimum and maximum, per dimension. So a 2-dimensional R-tree with double precision coordinates (the common case appearing in mapping applications) will have four 64 bit values describing the rectangle plus a child pointer, for which an external implementation probably wants to use 64 bits as well. That is 20 B per child and you can squeeze 409 of these in an 8 KiB page. The number of points does not matter. Dimension and precision of coordinate system does.
In memory, trees with low fanout are more efficient, because though they are deeper, they need fewer comparisons per search. However on disk (in databases) the slow operation is reading and since that can only be done in blocks, it is faster to reduce number of nodes by having each node fill whole block and have correspondingly higher fanout.
"Fanout" refers to the number of pointers per node which R-Tree is having
Related
This is an interview question.
There are billions and billions of stars in the universe. Which data structure would you use to answer the query
"Give me the k stars nearest to Earth".
I thought of heaps. As we can do heapify in O(n) and get the n_smallest in O(logn). Is there a better data structure suited for this purpose?
Assuming the input could not be all stored in memory at the same time (that would be a challenge!), but would be a stream of the stars in the universe -- like you would get an iterator or something -- you could benefit from using a Max Heap (instead of a Min Heap which might come to mind first).
At the start you would just push the stars in the heap, keyed by their distance to earth, until your heap has k entries.
From then on, you ignore any new star when it has a greater distance than the root of your heap. When it is closer than the root-star, substitute the root with that new star and sift it down to restore the heap property.
Your heap will not grow greater than k elements, and at all times it will consist of the k closest stars among those you have processed.
Some remarks:
Since it is a Max Heap, you don't know which is the closest star (in constant time). When you would stop the algorithm and then pull out the root node one after the other, you would get the k closest stars in order of descending distance.
As the observable(!) Universe has an estimated 1021 number of stars, you would need one of the best supercomputers (1 exaFLOPS) to hope to process all those stars in a reasonable time. But at least this algorithm only needs to keep k stars in memory.
The first problem you're going to run into is scale. There are somewhere between 100 billion and 400 billion stars in the Milky Way galaxy alone. There is an estimated 10 billion galaxies. If we assume an average of 100 billion stars per galaxy, that's 10^19 stars in the universe. It's unlikely you'll have the memory for that. And even if you did have enough memory, you probably don't have the time. Assuming your heapify operation could do a billion iterations per second, it would take a trillion seconds (31,700 years). And then you have to add the time it would take to remove the k smallest from the heap.
It's unlikely that you could get a significant improvement by using multiple threads or processes to build the heap.
The key here will be to pre-process the data and store it in a form that lets you quickly eliminate the majority of possibilities. The easiest way would be to have a sorted list of stars, ordered by their distance from Earth. So Sol would be at the top of the list, Proxima Centauri would be next, etc. Then, getting the nearest k stars would be an O(k) operation: just read the top k items from the list.
A sorted list would be pretty hard to update, though. A better alternative would be a k-d tree. It's easier to update, and getting the k nearest neighbors is still reasonably quick.
I have arrays of 1024 bytes (8192 bits) which are mostly zero.
Between 0.01% and 10% of bits will be set (random, no pattern).
How could these be compressed, given the lack of structure and the relatively small size?
(My first thought was to store the distances between set bits. I need 13 bits for each distance, but at worst case 10% occupancy this needs 13 * 816 / 8 = 1326 bytes, which is not an improvement.)
This is for ultra-low bandwidth comms, so every byte matters.
I've dealt deeply with a similar problem, but my sets are much bigger (30 million possible values with between 1 and 30 million elements in each set), so they both gain much more from compression and the compression metadata is insignificant compared to the size of the data. I have never gone down to squeezing things into units smaller than uint16_t, so the things I write below might not apply if you start chopping up 13 bit values into pieces. It feels like it should work, but caveat emptor.
What I've found works is to employ several strategies that depend on the particular data we have. The good news is that the count of elements in each set is a very good indicator of which compression strategy will work best for a particular set. So all the metadata you need is a count of elements in the set. In my data format the first and only metadata value (I'll be unspecific and just call it "value", you can squeeze things in bytes, 16 bit values or 13 bit values however you feel) is the count of elements in the set, the rest is just the encoding of the set elements.
The strategies are:
If very few elements are in the set, you can't do better than an array that says "1, 4711, 8140", so in this case the data is encoded as: [3, 1, 4711, 8140]
If almost all elements are in the set, you can just keep track of elements that aren't. For example [8190, 17, 42].
If around half of the elements are in the set you pretty much can't do much better than a bitmap, so you get [4000, {bitmap}], this is the only case where your data ends up being longer than strictly uncompressed.
If more than "a few" but many fewer than "around half" elements are set, I found another strategy. Divide the bits of your possible values in the set in half. Let's say we have 2^16 (it's easier to describe, it should probably work for 2^13) possible values. The values are divided into 256 ranges with each range with 256 possible values. We then have an array with 256 bytes, each of these bytes describes how many values are in each range (so byte 0 tells us how many elements are [0,255], byte 1 gives us [256,511], etc.) immediately after follow arrays with the values in each range mod 256. The trick here is that while every element in the set encoded as an array (strategy 1) would be 2 bytes, in this scheme each element is only 1 bytes + 256 static bytes for the counts of elements. This means that as soon as we have more than 256 elements in the set this saves us space by switching from strategy 1 to 4.
Strategy 4 can be refined (probably meaningless if your data is random as you mention, but my data had more patterns sometimes, so it worked for me). Since we still need 8 bits for each element in the previous encoding, as soon as a sub-array of elements goes over 32 elements (256 bytes), we can store it as a bitmap instead. This is also a good breakpoint for switching strategies between 4/5 to 3. If all the arrays in this strategy are just bitmaps, then we should just use strategy 3 (it's more complicated than that, but the breakpoint between strategies can be precomputed quite accurately that you'll end up picking the most likely efficient strategy each time).
I have only vaguely tried saving deltas between numbers in the set. Quick experiments showed that they weren't really much more efficient than the strategies I mentioned above, had unpredictable degenerate cases, but most importantly, the application I work with really likes to not have to deserialise its data, just use it raw straight from disk (mmap).
The raw data can be described as a fixed number of columns (on the order of a few thousand) and a large (on the order of billions) and variable number of rows. Each cell is a bit. The desired query would be something like find all rows where bits 12,329,2912,3020 are set. Something like
for (i=0;i< max_ents;i++)
if (entry[i].data & mask == mask)
add_result(i);
In a typical case not many (e.g. 5%) bits are set in any particular row, but that's not guaranteed, there's a degree of variability.
On a higher level the data describes a bitwise fingerprint of entries and the data itself is a kind of search index so maximal speed is desired. What algorithm would be good for this kind of search? At the moment I'm thinking of having separate sparse (packed/compressed) bit vectors for each column separately. I doubt it's optimal though.
This looks similar to "text search", in particular to that of intersecting reverse indexes. Let me go through the simplest algorithm for doing that.
First, you should create sorted lists of numbers where each bit is set. E.g., for the table of numbers:
Row 1 -> 10110
Row 2 -> 00111
Row 3 -> 11110
Row 4 -> 00011
Row 5 -> 01010
Row 6 -> 10101
you can create an reverse index:
Bit 0 is set in -> 2, 4, 6
Bit 1 is set in -> 1, 2, 3, 4, 5
Bit 2 is set in -> 1, 2, 3, 6
etc.
Now, for a query (let's say bits 0 & 1 & 2), you just have to merge these sorted lists using a merge sort like algorithm,. To do this, you can do it by first merging lists 0, 1, giving you {2, 4}, and then merge this with list 2 giving you {2}.
Several optimizations are possible, including, but not limited to, compressing these lists, since the difference between consecutive items is typically small, doing more efficient merging etc.
But, to save more hassle, why not reuse work that others have already done? ;)... You can readily use (should be possible in less than 1 day of coding) any open source text search engine (I suggest Lucene) to perform this task, and it should contain several optimizations which people have built over a long time ;). (Hint: You should treat each row as a "doc" in text search parlance, and each bit as a "token").
Edit (adding some of the algorithms by request of the question author):
a) Compression: One of the most effective things you can do is compression of postings lists (the sorted list corresponding to each position). Most algorithms generally take differences of consecutive terms, and then compress them according to some encoding (Gamma Coding, Varint Encoding) to name a few. This compresses the inverted list so that it either consumes less file space (thus less file I/O), or uses less memory for encoding the same set of numbers. In your case, I can estimate that each posting list will contain ~ 5% * 1e9 = 5e7 elements. If they are uniformly distributed across 0 - 1e9, the gaps should be around 20, and so let us say encoding each gap takes ~ 8b on an average (this is a large overestimation), adding up to 500MB. So for 1000 lists you will need 500GB of space, which definitely needs a disk space. This in turn means that you should go for as good a compression algorithm as possible, since a better compression means less file I/O and you are going to be I/O bound.
b) Intersection Order: You should always intersect lists starting from the smallest, since that is guaranteed to create the smallest sized intermediate lists, which means less comparisons later, by techniques shown below.
c) Merge algorithm: Since your index almost certainly spills to disk, there is probably not much you can do at an algorithmic level. But some of the ideas that are used is to use a binary search based procedure for merging two lists instead of the straightforward linear merge procedure in case one of the lists is much smaller than the other (this will lead to O(N*log(M)) complexity instead of O(N+M) where M >> N). But for file based indices this is almost never a good idea since binary search makes many random accesses, which can completely screw up your disk latency, whereas the linear merge procedure is strictly sequential.
d) Skip Lists: This is another great data structure used to store sorted postings lists, which can also then support efficient "binary search" mentioned before. The key idea here is that the upper levels of the skip list can be kept in memory, and this can greatly speed up the last stages of your intersection algorithm, when you can simply search through the in-memory upper levels to get to a disk offset, and then do disk access from there. There is a point when binary search + skiplist based merge becomes more efficient than linear merge and can be found by experimentation.
e) Caching: No-brainer. If some of your terms occur frequently, cache them in-memory so that you can get them more efficiently in the future. Note that the cache can also be, e.g. a faster flash based disk, which can give you better throughput as well as probably cache a significant number of the more frequent terms (a 32GB memory can only hold ~ 64 of these lists, whereas a 256GB flash disk can hold ~ 512).
Given a k-dimensional continuous (euclidean) space filled with rather unpredictably moving/growing/shrinking hyperspheres I need to repeatedly find the hypersphere whose surface is nearest to a given coordinate. If some hyperspheres are of the same distance to my coordinate, then the biggest hypersphere wins. (The total count of hyperspheres is guaranteed to stay the same over time.)
My first thought was to use a KDTree but it won't take the hyperspheres' non-uniform volumes into account.
So I looked further and found BVH (Bounding Volume Hierarchies) and BIH (Bounding Interval Hierarchies), which seem to do the trick. At least in 2-/3-dimensional space. However while finding quite a bit of info and visualizations on BVHs I could barely find anything on BIHs.
My basic requirement is a k-dimensional spatial data structure that takes volume into account and is either super fast to build (off-line) or dynamic with barely any unbalancing.
Given my requirements above, which data structure would you go with? Any other ones I didn't even mention?
Edit 1: Forgot to mention: hypershperes are allowed (actually highly expected) to overlap!
Edit 2: Looks like instead of "distance" (and "negative distance" in particular) my described metric matches the power of a point much better.
I'd expect a QuadTree/Octree/generalized to 2^K-tree for your dimensionality of K would do the trick; these recursively partition space, and presumably you can stop when a K-subcube (or K-rectangular brick if the splits aren't even) does not contain a hypersphere, or contains one or more hyperspheres such that partitioning doesn't separate any, or alternatively contains the center of just a single hypersphere (probably easier).
Inserting and deleting entities in such trees is fast, so a hypersphere changing size just causes a delete/insert pair of operations. (I suspect you can optimize this if your sphere size changes by local additional recursive partition if the sphere gets smaller, or local K-block merging if it grows).
I haven't worked with them, but you might also consider binary space partitions. These let you use binary trees instead of k-trees to partition your space. I understand that KDTrees are a special case of this.
But in any case I thought the insertion/deletion algorithms for 2^K trees and/or BSP/KDTrees was well understood and fast. So hypersphere size changes cause deletion/insertion operations but those are fast. So I don't understand your objection to KD-trees.
I think the performance of all these are asymptotically the same.
I would use the R*Tree extension for SQLite. A table would normally have 1 or 2 dimensional data. SQL queries can combine multiple tables to search in higher dimensions.
The formulation with negative distance is a little weird. Distance is positive in geometry, so there may not be much helpful theory to use.
A different formulation that uses only positive distances may be helpful. Read about hyperbolic spaces. This might help to provide ideas for other ways to describe distance.
I have some data, up to a between a million and a billion records, each which is represented by a bitfield, about 64 bits per key. The bits are independent, you can imagine them basically as random bits.
If I have a test key and I want to find all values in my data with the same key, a hash table will spit those out very easily, in O(1).
What algorithm/data structure would efficiently find all records most similar to the query key? Here similar means that most bits are identical, but a minimal number are allowed to be wrong. This is traditionally measured by Hamming distance., which just counts the number of mismatched bits.
There's two ways this query might be made, one might be by specifying a mismatch rate like "give me a list of all existing keys which have less than 6 bits that differ from my query" or by simply best matches, like "give me a list of the 10,000 keys which have the lowest number of differing bits from my query."
You might be temped to run to k-nearest-neighbor algorithms, but here we're talking about independent bits, so it doesn't seem likely that structures like quadtrees are useful.
The problem can be solved by simple brute force testing a hash table for low numbers of differing bits. If we want to find all keys that differ by one bit from our query, for example, we can enumerate all 64 possible keys and test them all. But this explodes quickly, if we wanted to allow two bits of difference, then we'd have to probe 64*63=4032 times. It gets exponentially worse for higher numbers of bits.
So is there another data structure or strategy that makes this kind of query more efficient?
The database/structure can be preprocessed as much as you like, it's the query speed that should be optimized.
What you want is a BK-Tree. It's a tree that's ideally suited to indexing metric spaces (your problem is one), and supports both nearest-neighbour and distance queries. I wrote an article about it a while ago.
BK-Trees are generally described with reference to text and using levenshtein distance to build the tree, but it's straightforward to write one in terms of binary strings and hamming distance.
This sounds like a good fit for an S-Tree, which is like a hierarchical inverted file. Good resources on this topic include the following papers:
Hierarchical Bitmap Index: An Efficient and Scalable Indexing Technique for Set-Valued Attributes.
Improved Methods for Signature-Tree Construction (2000)
Quote from the first one:
The hierarchical bitmap index efficiently supports dif-
ferent classes of queries, including subset, superset and similarity queries.
Our experiments show that the hierarchical bitmap index outperforms
other set indexing techniques significantly.
These papers include references to other research that you might find useful, such as M-Trees.
Create a binary tree (specifically a trie) representing each key in your start set in the following way: The root node is the empty word, moving down the tree to the left appends a 0 and moving down the right appends a 1. The tree will only have as many leaves as your start set has elements, so the size should stay manageable.
Now you can do a recursive traversal of this tree, allowing at most n "deviations" from the query key in each recursive line of execution, until you have found all of the nodes in the start set which are within that number of deviations.
I'd go with an inverted index, like a search engine. You've basically got a fixed vocabulary of 64 words. Then similarity is measured by hamming distance, instead of cosine similarity like a search engine would want to use. Constructing the index will be slow, but you ought to be able to query it with normal search enginey speeds.
The book Introduction to Information Retrieval covers the efficient construction, storage, compression and querying of inverted indexes.
"Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions", from 2008, seems to be the best result as of then. I won't try to summarize since I read it over a year ago and it's hairy. That's from a page on locality-sensitive hashing, along with an implementation of an earlier version of the scheme. For more general pointers, read up on nearest neighbor search.
This kind of question has been asked before: Fastest way to find most similar string to an input?
The database/structure can be
preprocessed as much as you like
Well...IF that is true. Then all you need is a similarity matrix of your hamming distances. Make the matrix sparse by pruning out large distances. It doesn't get any faster and not that much of a memory hog.
Well, you could insert all of the neighbor keys along with the original key. That would mean that you store (64 choose k) times as much data, for k differing bits, and it will require that you decide k beforehand. Though you could always extend k by brute force querying neighbors, and this will automatically query the neighbors of your neighbors that you inserted. This also gives you a time-space tradeoff: for example, if you accept a 64 x data blowup and 64 times slower you can get two bits of distance.
I haven't completely thought this through, but I have an idea of where I'd start.
You could divide the search space up into a number of buckets where each bucket has a bucket key and the keys in the bucket are the keys that are more similar to this bucket key than any other bucket key. To create the bucket keys, you could randomly generate 64 bit keys and discard any that are too close to any previously created bucket key, or you could work out some algorithm that generates keys that are all dissimilar enough. To find the closest key to a test key, first find the bucket key that is closest, and then test each key in the bucket. (Actually, it's possible, but not likely, for the closest key to be in another bucket - do you need to find the closest key, or would a very close key be good enough?)
If you're ok with doing it probabilistically, I think there's a good way to solve question 2. I assume you have 2^30 data and cutoff and you want to find all points within cutoff distance from test.
One_Try()
1. Generate randomly a 20-bit subset S of 64 bits
2. Ask for a list of elements that agree with test on S (about 2^10 elements)
3. Sort that list by Hamming distance from test
4. Discard the part of list after cutoff
You repeat One_Try as much as you need while merging the lists. The more tries you have, the more points you find. For example, if x is within 5 bits, you'll find it in one try with about (2/3)^5 = 13% probability. Therefore if you repeat 100 tries you find all but roughly 10^{-6} of such x. Total time: 100*(1000*log 1000).
The main advantage of this is that you're able to output answers to question 2 as you proceed, since after the first few tries you'll certainly find everything within distance not more than 3 bits, etc.
If you have many computers, you give each of them several tries, since they are perfectly parallelizable: each computer saves some hash tables in advance.
Data structures for large sets described here: Detecting Near-Duplicates for Web Crawling
or
in memory trie: Judy-arrays at sourceforge.net
Assuming you have to visit each row to test its value (or if you index on the bitfield then each index entry), then you can write the actual test quite efficiently using
A xor B
To find the difference bits, then bit-count the result, using a technique like this.
This effectively gives you the hamming distance.
Since this can compile down to tens of instructions per test, this can run pretty fast.
If you are okay with a randomized algorithm (monte carlo in this case), you can use the minhash.
If the data weren't so sparse, a graph with keys as the vertices and edges linking 'adjacent' (Hamming distance = 1) nodes would probably be very efficient time-wise. The space would be very large though, so in your case, I don't think it would be a worthwhile tradeoff.