Which data structure should i use to represent a large amount of records each presents range of items? - database

I'm looking for software represenation for a very large amount of records ( more than 400K records)
Each record has two keys . one for lower bound and one for upper bound. These number represent a range. Also each record has some piece of information lets call it I . In other words , each record aggregates common item indexs , and has some common description about them.
My software is given an item number , and I have to retrive that info about it.
I thought about AVL , B- Tress or fibonaci . But i'm sure which will be the best for that large amount of record . I would deffinately go for AVL / balanced AVL for a small database.

From a data structure point of view, you search for an interval tree.
The wikipedia article is quite good. What you can do, is to augment a (balanced) binary search tree like AVL or Red-Black-Trees like. Interval trees based on binary search tree have an own section in the classical DS book by Cormen et al..
A good data structure scales well to large amounts of data. The complexity for the major directory operations are O(k + log n) where n is the number of intervals in the tree and k is the number of overlapping intervals in the range. This is usually pretty good. It grows slowly with the number of of interval items, except for cases where a lot or most intervals overlap all others.
If you cannot hold your data in main memory, a B-Tree would be a good choice.

Any database will do what you want just fine.
If you are searching on an index, the increase in look-up speed when going from 2 to 4 records is the same as going from 2 million to 4 million records...one more level to the tree...it's an exponential relationship.

Related

How searching a million keys organized as B-tree will need 114 comparisons?

Please explain how it will take 114 comparisons. The following is the screenshot taken from my book (Page 350, Data Structures Using C, 2nd Ed. Reema Thareja, Oxford Univ. Press). My reasoning is that in worst case each node will have just minimum number of children (i.e. 5), so I took log base 5 of a million, and it comes to 9. So assuming at each level of the tree we search minimum number of keys (i.e. 4), it comes to somewhere like 36 comparisons, nowhere near 114.
Consider a situation in which we have to search an un-indexed and
unsorted database that contains n key values. The worst case running
time to perform this operation would be O(n). In contrast, if the data
in the database is indexed with a B tree, the same search operation
will run in O(log n). For example, searching for a single key on a set
of one million keys will at most require 1,000,000 comparisons. But if
the same data is indexed with a B tree of order 10, then only 114
comparisons will be required in the worst case.
Page 350, Data Structures Using C, 2nd Ed. Reema Thareja, Oxford Univ. Press
The worst case tree has the minimum number of keys everywhere except on the path you're searching.
If the size of each internal node is in [5,10), then in the worst case, a tree with a million items will be about 10 levels deep, when most nodes have 5 keys.
The worst case path to a node, however, might have 10 keys in each node. The statement seems to assume that you'll do a linear search instead of a binary search inside each node (I would advise to do a binary search instead), so that can lead to around 10*10 = 100 comparisons.
If you carefully consider the details, the real number might very well come out to 114.
(This is not an Answer to the question asked, but a related discussion.)
Sounds like a textbook question, not a real-life question.
Counting comparisons is likely to be the best way to judge an in-memory tree, but not for a disk-based dataset.
Even so, the "average" number of comparisons (for in-memory) or disk hits (for disk-based) is likely to be the metric to compute.
(Sure, it is good to compute the maximum numbers as a useful exercise for understanding the structures.)
Perhaps the optimal "tree" for in memory searching is a Binary tree, but with 3-way fan out. And keep the tree balanced with 2 or 3 elements in each node.
For disk based searching -- think databases -- the optimal is likely to be a BTree with the size of a block being based on what is efficient to read from disk. Counting comparisons in a poor second when it comes to the overall time taken to fetch a row.

What is fanout in R-Tree?

I have a doubt about R-Tree data structure. What is fan-out in R-Tree. Is it a Maximum number of entries?
How can we determine the minimum and maximum number of entries in R-Tree? Let say if i have 10000 points and my page size i 8kb.
Thanks
Fan-out, in any tree, is number of pointers to child nodes in a node.
Different trees have different fan-out.
A binary tree has fanout 2.
A B-tree has a fan-out B, with all nodes except leaves having between B/2 and B children. External (on-disk) implementation often relax the minimal number of children restriction to save some updates.
In databases, B-trees or their variant called B+-trees is often used so that each node has size of 1 page and the fan out determined by number of sort keys and pointers that fit in that space.
An R-tree is a search tree where indices are multi-dimensional intervals. These may possibly overlap. It may have any fan-out. Usual is number of 2 to number of dimensions (so 4 for 2-dimensional, 8 for 3-dimensional etc.). But it may have higher fanout too and organizing it similar to B-tree is certainly possible.
How can we determine the minimum and maximum number of entries in R-Tree? Let say if I have 10000 points and my page size is 8KiB.
The size of the tree node does not have to match page size. If it does (usually used for external, i.e. on disk, implementations), you still need to know how large the sort key is and how large the pointer is. An R-tree needs 2 coordinate values, minimum and maximum, per dimension. So a 2-dimensional R-tree with double precision coordinates (the common case appearing in mapping applications) will have four 64 bit values describing the rectangle plus a child pointer, for which an external implementation probably wants to use 64 bits as well. That is 20 B per child and you can squeeze 409 of these in an 8 KiB page. The number of points does not matter. Dimension and precision of coordinate system does.
In memory, trees with low fanout are more efficient, because though they are deeper, they need fewer comparisons per search. However on disk (in databases) the slow operation is reading and since that can only be done in blocks, it is faster to reduce number of nodes by having each node fill whole block and have correspondingly higher fanout.
"Fanout" refers to the number of pointers per node which R-Tree is having

What is a good representation for a searchable bit matrix with fixed number of columns?

The raw data can be described as a fixed number of columns (on the order of a few thousand) and a large (on the order of billions) and variable number of rows. Each cell is a bit. The desired query would be something like find all rows where bits 12,329,2912,3020 are set. Something like
for (i=0;i< max_ents;i++)
if (entry[i].data & mask == mask)
add_result(i);
In a typical case not many (e.g. 5%) bits are set in any particular row, but that's not guaranteed, there's a degree of variability.
On a higher level the data describes a bitwise fingerprint of entries and the data itself is a kind of search index so maximal speed is desired. What algorithm would be good for this kind of search? At the moment I'm thinking of having separate sparse (packed/compressed) bit vectors for each column separately. I doubt it's optimal though.
This looks similar to "text search", in particular to that of intersecting reverse indexes. Let me go through the simplest algorithm for doing that.
First, you should create sorted lists of numbers where each bit is set. E.g., for the table of numbers:
Row 1 -> 10110
Row 2 -> 00111
Row 3 -> 11110
Row 4 -> 00011
Row 5 -> 01010
Row 6 -> 10101
you can create an reverse index:
Bit 0 is set in -> 2, 4, 6
Bit 1 is set in -> 1, 2, 3, 4, 5
Bit 2 is set in -> 1, 2, 3, 6
etc.
Now, for a query (let's say bits 0 & 1 & 2), you just have to merge these sorted lists using a merge sort like algorithm,. To do this, you can do it by first merging lists 0, 1, giving you {2, 4}, and then merge this with list 2 giving you {2}.
Several optimizations are possible, including, but not limited to, compressing these lists, since the difference between consecutive items is typically small, doing more efficient merging etc.
But, to save more hassle, why not reuse work that others have already done? ;)... You can readily use (should be possible in less than 1 day of coding) any open source text search engine (I suggest Lucene) to perform this task, and it should contain several optimizations which people have built over a long time ;). (Hint: You should treat each row as a "doc" in text search parlance, and each bit as a "token").
Edit (adding some of the algorithms by request of the question author):
a) Compression: One of the most effective things you can do is compression of postings lists (the sorted list corresponding to each position). Most algorithms generally take differences of consecutive terms, and then compress them according to some encoding (Gamma Coding, Varint Encoding) to name a few. This compresses the inverted list so that it either consumes less file space (thus less file I/O), or uses less memory for encoding the same set of numbers. In your case, I can estimate that each posting list will contain ~ 5% * 1e9 = 5e7 elements. If they are uniformly distributed across 0 - 1e9, the gaps should be around 20, and so let us say encoding each gap takes ~ 8b on an average (this is a large overestimation), adding up to 500MB. So for 1000 lists you will need 500GB of space, which definitely needs a disk space. This in turn means that you should go for as good a compression algorithm as possible, since a better compression means less file I/O and you are going to be I/O bound.
b) Intersection Order: You should always intersect lists starting from the smallest, since that is guaranteed to create the smallest sized intermediate lists, which means less comparisons later, by techniques shown below.
c) Merge algorithm: Since your index almost certainly spills to disk, there is probably not much you can do at an algorithmic level. But some of the ideas that are used is to use a binary search based procedure for merging two lists instead of the straightforward linear merge procedure in case one of the lists is much smaller than the other (this will lead to O(N*log(M)) complexity instead of O(N+M) where M >> N). But for file based indices this is almost never a good idea since binary search makes many random accesses, which can completely screw up your disk latency, whereas the linear merge procedure is strictly sequential.
d) Skip Lists: This is another great data structure used to store sorted postings lists, which can also then support efficient "binary search" mentioned before. The key idea here is that the upper levels of the skip list can be kept in memory, and this can greatly speed up the last stages of your intersection algorithm, when you can simply search through the in-memory upper levels to get to a disk offset, and then do disk access from there. There is a point when binary search + skiplist based merge becomes more efficient than linear merge and can be found by experimentation.
e) Caching: No-brainer. If some of your terms occur frequently, cache them in-memory so that you can get them more efficiently in the future. Note that the cache can also be, e.g. a faster flash based disk, which can give you better throughput as well as probably cache a significant number of the more frequent terms (a 32GB memory can only hold ~ 64 of these lists, whereas a 256GB flash disk can hold ~ 512).

Efficient comparison of 1 million vectors containing (float, integer) tuples

I am working in a chemistry/biology project. We are building a web-application for fast matching of the user's experimental data with predicted data in a reference database. The reference database will contain up to a million entries. The data for one entry is a list (vector) of tuples containing a float value between 0.0 and 20.0 and an integer value between 1 and 18. For instance (7.2394 , 2) , (7.4011, 1) , (9.9367, 3) , ... etc.
The user will enter a similar list of tuples and the web-app must then return the - let's say - top 50 best matching database entries.
One thing is crucial: the search algorithm must allow for discrepancies between the query data and the reference data because both can contain small errors in the float values (NOT in the integer values). (The query data can contain errors because it is derived from a real-life experiment and the reference data because it is the result of a prediction.)
Edit - Moved text to answer -
How can we get an efficient ranking of 1 query on 1 million records?
You should add a physicist to the project :-) This is a very common problem to compare functions e.g. look here:
http://en.wikipedia.org/wiki/Autocorrelation
http://en.wikipedia.org/wiki/Correlation_function
In the first link you can read: "The SEQUEST algorithm for analyzing mass spectra makes use of autocorrelation in conjunction with cross-correlation to score the similarity of an observed spectrum to an idealized spectrum representing a peptide."
An efficient linear scan of 1 million records of that type should take a fraction of a second on a modern machine; a compiled loop should be able to do it at about memory bandwidth, which would transfer that in a two or three milliseconds.
But, if you really need to optimise this, you could construct a hash table of the integer values, which would divide the job by the number of integer bins. And, if the data is stored sorted by the floats, that improves the locality of matching by those; you know you can stop once you're out of tolerance. Storing the offsets of each of a number of bins would give you a position to start.
I guess I don't see the need for a fancy algorithm yet... describe the problem a bit more, perhaps (you can assume a fairly high level of chemistry and physics knowledge if you like; I'm a physicist by training)?
Ok, given the extra info, I still see no need for anything better than a direct linear search, if there's only 1 million reference vectors and the algorithm is that simple. I just tried it, and even a pure Python implementation of linear scan took only around three seconds. It took several times longer to make up some random data to test with. This does somewhat depend on the rather lunatic level of optimisation in Python's sorting library, but that's the advantage of high level languages.
from cmath import *
import random
r = [(random.uniform(0,20), random.randint(1,18)) for i in range(1000000)]
# this is a decorate-sort-undecorate pattern
# look for matches to (7,9)
# obviously, you can use whatever distance expression you want
zz=[(abs((7-x)+(9-y)),x,y) for x,y in r]
zz.sort()
# return the 50 best matches
[(x,y) for a,x,y in zz[:50]]
Can't you sort the tuples and perform binary search on the sorted array ?
I assume your database is done once for all, and the positions of the entries is not important. You can sort this array so that the tuples are in a given order. When a tuple is entered by the user, you just look in the middle of the sorted array. If the query value is larger of the center value, you repeat the work on the upper half, otherwise on the lower one.
Worst case is log(n)
If you can "map" your reference data to x-y coordinates on a plane there is a nifty technique which allows you to select all points under a given distance/tolerance (using Hilbert curves).
Here is a detailed example.
One approach we are trying ourselves which allows for the discrepancies between query and reference is by binning the float values. We are testing and want to offer the user the choice of different bin sizes. Bin sizes will be 0.1 , 0.2 , 0.3 or 0.4. So binning leaves us with between 50 and 200 bins, each with a corresponding integer value between 0 and 18, where 0 means there was no value within that bin. The reference data can be pre-binned and stored in the database. We can then take the binned query data and compare it with the reference data. One approach could be for all bins, subtract the query integer value from the reference integer value. By summing up all differences we get the similarity score, with the the most similar reference entries resulting in the lowest scores.
Another (simpler) search option we want to offer is where the user only enters the float values. The integer values in both query as reference list can then be set to 1. We then use Hamming distance to compute the difference between the query and the reference binned values. I have previously asked about an efficient algorithm for that search.
This binning is only one way of achieving our goal. I am open to other suggestions. Perhaps we can use Principal Component Analysis (PCA), as described here

Data structure for finding nearby keys with similar bitvalues

I have some data, up to a between a million and a billion records, each which is represented by a bitfield, about 64 bits per key. The bits are independent, you can imagine them basically as random bits.
If I have a test key and I want to find all values in my data with the same key, a hash table will spit those out very easily, in O(1).
What algorithm/data structure would efficiently find all records most similar to the query key? Here similar means that most bits are identical, but a minimal number are allowed to be wrong. This is traditionally measured by Hamming distance., which just counts the number of mismatched bits.
There's two ways this query might be made, one might be by specifying a mismatch rate like "give me a list of all existing keys which have less than 6 bits that differ from my query" or by simply best matches, like "give me a list of the 10,000 keys which have the lowest number of differing bits from my query."
You might be temped to run to k-nearest-neighbor algorithms, but here we're talking about independent bits, so it doesn't seem likely that structures like quadtrees are useful.
The problem can be solved by simple brute force testing a hash table for low numbers of differing bits. If we want to find all keys that differ by one bit from our query, for example, we can enumerate all 64 possible keys and test them all. But this explodes quickly, if we wanted to allow two bits of difference, then we'd have to probe 64*63=4032 times. It gets exponentially worse for higher numbers of bits.
So is there another data structure or strategy that makes this kind of query more efficient?
The database/structure can be preprocessed as much as you like, it's the query speed that should be optimized.
What you want is a BK-Tree. It's a tree that's ideally suited to indexing metric spaces (your problem is one), and supports both nearest-neighbour and distance queries. I wrote an article about it a while ago.
BK-Trees are generally described with reference to text and using levenshtein distance to build the tree, but it's straightforward to write one in terms of binary strings and hamming distance.
This sounds like a good fit for an S-Tree, which is like a hierarchical inverted file. Good resources on this topic include the following papers:
Hierarchical Bitmap Index: An Efficient and Scalable Indexing Technique for Set-Valued Attributes.
Improved Methods for Signature-Tree Construction (2000)
Quote from the first one:
The hierarchical bitmap index efficiently supports dif-
ferent classes of queries, including subset, superset and similarity queries.
Our experiments show that the hierarchical bitmap index outperforms
other set indexing techniques significantly.
These papers include references to other research that you might find useful, such as M-Trees.
Create a binary tree (specifically a trie) representing each key in your start set in the following way: The root node is the empty word, moving down the tree to the left appends a 0 and moving down the right appends a 1. The tree will only have as many leaves as your start set has elements, so the size should stay manageable.
Now you can do a recursive traversal of this tree, allowing at most n "deviations" from the query key in each recursive line of execution, until you have found all of the nodes in the start set which are within that number of deviations.
I'd go with an inverted index, like a search engine. You've basically got a fixed vocabulary of 64 words. Then similarity is measured by hamming distance, instead of cosine similarity like a search engine would want to use. Constructing the index will be slow, but you ought to be able to query it with normal search enginey speeds.
The book Introduction to Information Retrieval covers the efficient construction, storage, compression and querying of inverted indexes.
"Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions", from 2008, seems to be the best result as of then. I won't try to summarize since I read it over a year ago and it's hairy. That's from a page on locality-sensitive hashing, along with an implementation of an earlier version of the scheme. For more general pointers, read up on nearest neighbor search.
This kind of question has been asked before: Fastest way to find most similar string to an input?
The database/structure can be
preprocessed as much as you like
Well...IF that is true. Then all you need is a similarity matrix of your hamming distances. Make the matrix sparse by pruning out large distances. It doesn't get any faster and not that much of a memory hog.
Well, you could insert all of the neighbor keys along with the original key. That would mean that you store (64 choose k) times as much data, for k differing bits, and it will require that you decide k beforehand. Though you could always extend k by brute force querying neighbors, and this will automatically query the neighbors of your neighbors that you inserted. This also gives you a time-space tradeoff: for example, if you accept a 64 x data blowup and 64 times slower you can get two bits of distance.
I haven't completely thought this through, but I have an idea of where I'd start.
You could divide the search space up into a number of buckets where each bucket has a bucket key and the keys in the bucket are the keys that are more similar to this bucket key than any other bucket key. To create the bucket keys, you could randomly generate 64 bit keys and discard any that are too close to any previously created bucket key, or you could work out some algorithm that generates keys that are all dissimilar enough. To find the closest key to a test key, first find the bucket key that is closest, and then test each key in the bucket. (Actually, it's possible, but not likely, for the closest key to be in another bucket - do you need to find the closest key, or would a very close key be good enough?)
If you're ok with doing it probabilistically, I think there's a good way to solve question 2. I assume you have 2^30 data and cutoff and you want to find all points within cutoff distance from test.
One_Try()
1. Generate randomly a 20-bit subset S of 64 bits
2. Ask for a list of elements that agree with test on S (about 2^10 elements)
3. Sort that list by Hamming distance from test
4. Discard the part of list after cutoff
You repeat One_Try as much as you need while merging the lists. The more tries you have, the more points you find. For example, if x is within 5 bits, you'll find it in one try with about (2/3)^5 = 13% probability. Therefore if you repeat 100 tries you find all but roughly 10^{-6} of such x. Total time: 100*(1000*log 1000).
The main advantage of this is that you're able to output answers to question 2 as you proceed, since after the first few tries you'll certainly find everything within distance not more than 3 bits, etc.
If you have many computers, you give each of them several tries, since they are perfectly parallelizable: each computer saves some hash tables in advance.
Data structures for large sets described here: Detecting Near-Duplicates for Web Crawling
or
in memory trie: Judy-arrays at sourceforge.net
Assuming you have to visit each row to test its value (or if you index on the bitfield then each index entry), then you can write the actual test quite efficiently using
A xor B
To find the difference bits, then bit-count the result, using a technique like this.
This effectively gives you the hamming distance.
Since this can compile down to tens of instructions per test, this can run pretty fast.
If you are okay with a randomized algorithm (monte carlo in this case), you can use the minhash.
If the data weren't so sparse, a graph with keys as the vertices and edges linking 'adjacent' (Hamming distance = 1) nodes would probably be very efficient time-wise. The space would be very large though, so in your case, I don't think it would be a worthwhile tradeoff.

Resources