High-Performance Hierarchical Text Search - database

I'm now in the final stages of upgrading the hierarchy design in a major transactional system, and I have been staring for a while at this 150-line query (which I'll spare you all the tedium of reading) and thinking that there has got to be a better way.
A quick summary of the question is as follows:
How would you implement a hierarchical search that matches several search terms at different levels in the hierarchy, optimized for fastest search time?
I found a somewhat related question, but it's really only about 20% of the answer I actually need. Here is the full scenario/specification:
The end goal is to find one or several arbitrary items at arbitrary positions in the hierarchy.
The complete hierarchy is about 80,000 nodes, projected to grow up to 1M within a few years.
The full text of an entire path down the hierarchy is unique and descriptive; however, the text of an individual node may not be. This is a business reality, and not a decision that was made lightly.
Example: a node might have a name like "Door", which is meaningless by itself, but the full context, "Aaron > House > Living Room > Liquor Cabinet > Door", has clear meaning, it describes a specific door in a specific location. (Note that this is just an example, the real design is far less trivial)
In order to find this specific door, a user might type "aaron liquor door", which would likely turn up only one result. The query is translated as a sequence: An item containing the text "door", under an item containing the text "liquor", under another item containing the text "aaron."
Or, a user might just type "house liquor" to list all the liquor cabinets in people's houses (wouldn't that be nice). I mention this example explicitly to indicate that the search need not match any particular root or leaf level. This user knows exactly which door he is looking for, but can't remember offhand who owns it, and would remember if the name popped up in front of him.
All terms must be matched in the specified sequence, but as the above examples suggest, levels in the hierarchy can be "skipped." The term "aaron booze cabinet" would not match this node.
The platform is SQL Server 2008, but I believe that this is a platform-independent problem and would prefer not to restrict answers to that platform.
The hierarchy itself is based on hierarchyid (materialized path), indexed both breadth-first and depth-first. Each hierarchy node/record has a Name column which is to be queried on. Hierarchy queries based on the node are extremely fast, so don't worry about those.
There is no strict hierarchy - a root may contain no nodes at all or may contain 30 subtrees fanning out to 10,000 leaf nodes.
The maximum nesting is arbitrary, but in practice it tends to be no more than 4-8 levels.
The hierarchy can and does change, although infrequently. Any node can be moved to any other node, with the obvious exceptions (parent can't be moved into its own child, etc.)
In case this wasn't already implied: I do have control over the design and can add indexes, fields, tables, whatever might be necessary to get the best results.
My "dream" is to provide instant feedback to the user, as in a progressive search/filter, but I understand that this may be impossible or extremely difficult. I'd be happy with any significant improvement over the current method, which usually takes between 0.5s to 1s depending on the number of results.
For the sake of completeness, the existing query (stored procedure) starts by gathering all leaf nodes containing the final term, then joins upward and excludes any whose paths don't match with the earlier terms. If this seems backward to anyone, rest assured, it is a great deal more efficient than starting with the roots and fanning out. That was the "old" way and could easily take several seconds per search.
So my question again: Is there a better (more efficient) way to perform this search?
I'm not necessarily looking for code, just approaches. I have considered a few possibilities but they all seem to have some problems:
Create a delimited "path text" column and index it with Full-Text Search. The trouble is that a search on this column would return all child nodes as well; "aaron house" also matches "aaron house kitchen" and "aaron house basement".
Created a NamePath column that is actually a nested sequence of strings, using a CLR type, similar to hierarchyid itself. Problem is, I have no idea how Microsoft is able to "translate" queries on this type to index operations, and I'm not even sure if it's possible to do on a UDT. If the net result is just a full index scan, I've gained nothing by this approach.
It's not really the end of the world if I can't do better than what I already have; the search is "pretty fast", nobody has complained about it. But I'm willing to bet that somebody has tackled this problem before and has some ideas. Please share them!

take a look at Apache Lucene. You can implement very flexible yet efficient searches using Lucene. It may be useful.
Also take a look at the Search Patterns - What you are describing may fit into the Faceted Search pattern.
It is quite easy implement a trivial "Aaron House Living Door" algorithm, but not sure the regular SVM/classification/entropy based algorithms would scale to a large data set. You may also want to look at the "approximation search" concepts by Motwani and Raghavan.
Please post back what you find, if possible :-)

Hi Aarron, I have the following idea:
From your description I have the following image in my mind:
Aaron
/ \
/ \
/ \
House Cars
| / \
Living Room Ferrari Mercedes
|
Liquor Cabinet
/ | \
Table Door Window
This is how your search tree might look like. Now I would sort the nodes on every level:
Aaron
/ \
/ \
/ \
Cars House
/ \ /
/ \ /
/ \ /
/ \ /
/ X
/ / \
/ / \
/ / \
/ / \
| / \
| / \
Ferrari Living Room Mercedes
|
Liquor Cabinet
/ | \
Door Table Window
Now it should be easy and fast to process a query:
Start with the last word in the query and the lowest node level(leafs)
Since all the nodes are sorted within one level, You can use binary search and therefore find a match in O(log N) time, where N is the node count.
Do this for every level. There are O(log N) levels in the tree.
Once You find a match, process all parent nodes to see, if the path matches your query. The path has length O(log N). If it matches, store it in the results, that should be shown to the user.
Let be M the number of overall matches (number of nodes matching the last word in the query). Then our processing time is: O( (log N)^2 + M * (log N) ):
Binary search takes O(log N) time per level and there are O(log N) levels, therefore we have to spend at least O( (log N)^2 ) time. Now, for every match, we have to test, whether the complete path from our matching node up to the root matches the complete query. The path has length O(log N). Thus, given M matches overall, we spend another M * O(log N) time, thus the resulting execution time is O( (log N)^2 + M * (log N) ).
When You have few matches, the processing time approaches O( (log N)^2 ), which is pretty good. On the opposite if the worst case occurs (every single path matches the query (M = N)), the processing time approaches O(N log N) which is not too good, but also not too likely.
Implementation:
You said, that You only wanted an idea. Further my knowledge on databases is very limited, so I won't write much here, just outline some ideas.
The node table could look like this:
ID : int
Text : string
Parent : int -> node ID
Level : int //I don't expect this to change too often, so You can save it and update it, as the database changes.
This table would have to be sorted by the "Text" column. Using the algorithm described above a sql query inside the loop might look like:
SELECT ID FROM node WHERE Level = $i AND Text LIKE $text
Hope one can get my point.
One could speed things even more up, by not only sorting the table by the "Text" column, but by the combined "Text" and "Level" columns, that is, all entries within Level=20 sorted, all entries within Level=19 sorted etc.(but no overall sorting over the complete table necessary). However, the node count PER LEVEL is in O(N), so there is no asymptotic runtime improvement, but I think it's worth to try out, considering the lower constants You get in reality.
Edit: Improvement
I just noticed, that the iterative algorithm is completely unnecessary(thus the Level information can be abandoned). It is fully sufficient to:
Store all nodes sorted by their text value
Find all matches for the last word of the query at once using binary search over all nodes.
From every match, trace the path up to the root and check if the path matches the whole query.
This improves the runtime to O(log N + M * (log N)).

Related

How searching a million keys organized as B-tree will need 114 comparisons?

Please explain how it will take 114 comparisons. The following is the screenshot taken from my book (Page 350, Data Structures Using C, 2nd Ed. Reema Thareja, Oxford Univ. Press). My reasoning is that in worst case each node will have just minimum number of children (i.e. 5), so I took log base 5 of a million, and it comes to 9. So assuming at each level of the tree we search minimum number of keys (i.e. 4), it comes to somewhere like 36 comparisons, nowhere near 114.
Consider a situation in which we have to search an un-indexed and
unsorted database that contains n key values. The worst case running
time to perform this operation would be O(n). In contrast, if the data
in the database is indexed with a B tree, the same search operation
will run in O(log n). For example, searching for a single key on a set
of one million keys will at most require 1,000,000 comparisons. But if
the same data is indexed with a B tree of order 10, then only 114
comparisons will be required in the worst case.
Page 350, Data Structures Using C, 2nd Ed. Reema Thareja, Oxford Univ. Press
The worst case tree has the minimum number of keys everywhere except on the path you're searching.
If the size of each internal node is in [5,10), then in the worst case, a tree with a million items will be about 10 levels deep, when most nodes have 5 keys.
The worst case path to a node, however, might have 10 keys in each node. The statement seems to assume that you'll do a linear search instead of a binary search inside each node (I would advise to do a binary search instead), so that can lead to around 10*10 = 100 comparisons.
If you carefully consider the details, the real number might very well come out to 114.
(This is not an Answer to the question asked, but a related discussion.)
Sounds like a textbook question, not a real-life question.
Counting comparisons is likely to be the best way to judge an in-memory tree, but not for a disk-based dataset.
Even so, the "average" number of comparisons (for in-memory) or disk hits (for disk-based) is likely to be the metric to compute.
(Sure, it is good to compute the maximum numbers as a useful exercise for understanding the structures.)
Perhaps the optimal "tree" for in memory searching is a Binary tree, but with 3-way fan out. And keep the tree balanced with 2 or 3 elements in each node.
For disk based searching -- think databases -- the optimal is likely to be a BTree with the size of a block being based on what is efficient to read from disk. Counting comparisons in a poor second when it comes to the overall time taken to fetch a row.

General Big-Data principles for finding pairs of similar objects - "fuzzy inner join"

Firstly, sorry for the vague title and if this question has been asked before, but I was not entirely sure how to phrase it.
I am looking for general design principles for finding pairs of 'similar' objects from two different data sources.
Lets for simplicity say that we have two databases, A and B, both containing large volumes of objects, each with time-stamp and geo-location, along with some other data that we don't care about here.
Now I want to perform a search along these lines:
Within as certain time-frame and location dictated as search tiem, find pairs of objects from A and B respectively, ordered by some similarity score. Here for example some scalar 'time/space distance' function, distance(a,b), that calculates the distance in time and space between the objects.
I am expecting to get a (potentially ginormous) set of results where the first result is a pair of data points which has the minimum 'distance'.
I realize that the full search space is cardinality(A) x cardinality(B).
Are there any general guidelines on how to do this in a reasonable efficient way? I assume that I would need to replicate the two databases into a common repository like Hadoop? But then what? I am not sure how to perform such a query in Hadoop either.
What is this this type of query called?
To me, this is some kind of "fuzzy inner join" that I struggle wrapping my head around how to construct, let along efficiently at scale.
SQL joins don't have to be based on equality. You can use ">", "<", "BETWEEN".
You can even do something like this:
select a.val aval, b.val bval, a.val - b.val diff
from A join B on abs(a.val - b.val) < 100
What you need is a way to divide your objects into buckets in advance, without comparing them (or at least making a linear, rather than square, number of comparisons). That way, at query time, you will only be comparing a small number of items.
There is no "one-size-fits-all" way to bucket your items. In your case the bucketing can be based on time, geolocation, or both. Time-based bucketing is very natural, and can also scales elastically (increase or decrease the bucket size). Geo-clustering buckets can be based on distance from a particular point in space (if the space is abstract), or on some finite division of the space (for example, if you divide the entire Earth's world map into tiles, which can also scale nicely if done right).
A good question to ask is "if my data starts growing rapidly, can I handle it by just adding servers?" If not, you might need to rethink the design.

Sorting in Beam Search

Although I have good understanding of beam search but I have a query regarding beam search. When we select n best paths should we sort them or simply we should keep them in the order in which they exist and just discard other expensive nodes?
I searched a lot about this but every where it says that keep best. Nothing is found about should we sort them or not?
I think that we should sort them because by applying sorting we will reach to goal node quickly. But I want confirmation of my sorting idea and I did not found it till now.
I will be thankful to you if you can help me in improving my concepts.
When we select n best paths should we sort them or simply we should keep them in the order in which they exist and just discard other expensive nodes?
We just sort them and keep the top k.
At each step after the initialization you sort the beam_size * vocabulary_size hypotheses and choose the top k. For each of the beam_size * vocabulary_size hypotheses, its weight/probability is the product of all probabilities along its history normalized by the length(length normalization).
One problem arises from the fact that the completed hypotheses may have different lengths. Because models generally assign lower probabilities to longer strings, a naive algorithm would also choose shorter strings for y. This was not an issue during the earlier steps of decoding; due to the breadth-first nature of beam search all the hypotheses being compared had the same length. The usual solution to this is to apply some form of length normalization to each of the hypotheses, for example simply dividing the negative log probability by the number of words:
For more information please refer to this answer.
Reference:
https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf
****Beam search uses breadth-first search to build its search tree. At each level of the tree, it generates all successors of the states at the current level, ***
sorting them in increasing order of heuristic cost
***. However, it only stores a predetermined number of best states at each level (called the beam width). Only those states are expanded next. The greater the beam width, the fewer states are pruned. With an infinite beam width, no states are pruned and beam search is identical to breadth-first search.
NOTE: (I got this information from WikipediA during my search.)may be its helpful.****

runtime optimization of a matching algorithm

I made the following matching algorithm, but of course i will having big runtimes...
Has anybody an idea to make this matching faster (by changing the code or changing the algorithm)
for (i=0;i<AnzEntity;i++){
for (j=0;j<8;j++){
if (Entity[i].GID[j] > 0){
for (k=0;k<AnzGrid;k++){
if (Entity[i].Module == Grid[k].Module && Entity[i].GID[j] == Grid[k].ID ){
Entity[i].GIDCoord[j][0] = Grid[k].XYZ[0];
Entity[i].GIDCoord[j][1] = Grid[k].XYZ[1];
Entity[i].GIDCoord[j][2] = Grid[k].XYZ[2];
continue;
}
}
}
}
}
A very general question... for which one can give only a very general answer.
All faster search algorithms come down to divide and conquer. There's a whole family of searches which start by sorting the data to be searched, so that you can progressively halve (or better) the number of things you are searching through (eg: binary search lists, trees of all kinds, ...). There's a whole family of searches where you use some property of each value to cut the search to some (small) subset of the data (hashing). There are searches which cache recent results, which can work in some cases (eg: bring to front lists). Which of these may be suitable depends on the data.
The big thing to look at, however, is whether the data being search changes, and if so how often. If the data does not change, then you can hit it with a big hammer and crunch out a simple structure to search. If the data changes all the time, then you need a more complicated structure so that changes are not prohibitively expensive and search speed is maintained. Depending on the circumstances the trade-off will vary.
You are exhaustively comparing all Entity[i] (with a positive GID[j]) to all Grid[k]. This implies a total of AnzEntity * AnzGrid comparisons.
Instead, you can sort the Entity and Grid elements in increasing lexicographical order (by ID value and Module value in case of a tie). You should duplicate all Entity having nonzero Entity.GID.
Exploiting the sorted order, the number of comparisons will drop to 8.AnzEntity + AnzGrid.
Taking the sorts into account, O(NM) is turned to O(NLog(N)+MLog(M)).
ALTERNATIVE:
Another option is to enter one of Entity or Grid items in a hash table, using pairs ID/Module for the key, and use the hash table for fast lookups. This should result in a behavior close to linear O(N + M).

Need algorithm for fast storage and retrieval (search) of sets and subsets

I need a way of storing sets of arbitrary size for fast query later on.
I'll be needing to query the resulting data structure for subsets or sets that are already stored.
===
Later edit: To clarify, an accepted answer to this question would be a link to a study that proposes a solution to this problem. I'm not expecting for people to develop the algorithm themselves.
I've been looking over the tuple clustering algorithm found here, but it's not exactly what I want since from what I understand it 'clusters' the tuples into more simple, discrete/aproximate forms and loses the original tuples.
Now, an even simpler example:
[alpha, beta, gamma, delta] [alpha, epsilon, delta] [gamma, niu, omega] [omega, beta]
Query:
[alpha, delta]
Result:
[alpha, beta, gama, delta] [alpha, epsilon, delta]
So the set elements are just that, unique, unrelated elements. Forget about types and values. The elements can be tested among them for equality and that's it. I'm looking for an established algorithm (which probably has a name and a scientific paper on it) more than just creating one now, on the spot.
==
Original examples:
For example, say the database contains these sets
[A1, B1, C1, D1], [A2, B2, C1], [A3, D3], [A1, D3, C1]
If I use [A1, C1] as a query, these two sets should be returned as a result:
[A1, B1, C1, D1], [A1, D3, C1]
Example 2:
Database:
[Gasoline amount: 5L, Distance to Berlin: 240km, car paint: red]
[Distance to Berlin: 240km, car paint: blue, number of car seats: 2]
[number of car seats: 2, Gasoline amount: 2L]
Query:
[Distance to berlin: 240km]
Result
[Gasoline amount: 5L, Distance to Berlin: 240km, car paint: red]
[Distance to Berlin: 240km, car paint: blue, number of car seats: 2]
There can be an unlimited number of 'fields' such as Gasoline amount. A solution would probably involve the database grouping and linking sets having common states (such as Gasoline amount: 240) in such a way that the query is as efficient as possible.
What algorithms are there for such needs?
I am hoping there is already an established solution to this problem instead of just trying to find my own on the spot, which might not be as efficient as one tested and improved upon by other people over time.
Clarifications:
If it helps answer the question, I'm intending on using them for storing states:
Simple example:
[Has milk, Doesn't have eggs, Has Sugar]
I'm thinking such a requirement might require graphs or multidimensional arrays, but I'm not sure
Conclusion
I've implemented the two algorithms proposed in the answers, that is Set-Trie and Inverted Index and did some rudimentary profiling on them. Illustrated below is the duration of a query for a given set for each algorithm. Both algorithms worked on the same randomly generated data set consisting of sets of integers. The algorithms seem equivalent (or almost) performance wise:
I'm confident that I can now contribute to the solution. One possible quite efficient way is a:
Trie invented by Frankling Mark Liang
Such a special tree is used for example in spell checking or autocompletion and that actually comes close to your desired behavior, especially allowing to search for subsets quite conveniently.
The difference in your case is that you're not interested in the order of your attributes/features. For your case a Set-Trie was invented by Iztok Savnik.
What is a Set-Tree? A tree where each node except the root contains a single attribute value (number) and a marker (bool) if at this node there is a data entry. Each subtree contains only attributes whose values are larger than the attribute value of the parent node. The root of the Set-Tree is empty. The search key is the path from the root to a certain node of the tree. The search result is the set of paths from the root to all nodes containing a marker that you reach when you go down the tree and up the search key simultaneously (see below).
But first a drawing by me:
The attributes are {1,2,3,4,5} which can be anything really but we just enumerate them and therefore naturally obtain an order. The data is {{1,2,4}, {1,3}, {1,4}, {2,3,5}, {2,4}} which in the picture is the set of paths from the root to any circle. The circles are the markers for the data in the picture.
Please note that the right subtree from root does not contain attribute 1 at all. That's the clue.
Searching including subsets Say you want to search for attributes 4 and 1. First you order them, the search key is {1,4}. Now startin from root you go simultaneously up the search key and down the tree. This means you take the first attribute in the key (1) and go through all child nodes whose attribute is smaller or equal to 1. There is only one, namely 1. Inside you take the next attribute in the key (4) and visit all child nodes whose attribute value is smaller than 4, that are all. You continue until there is nothing left to do and collect all circles (data entries) that have the attribute value exactly 4 (or the last attribute in the key). These are {1,2,4} and {1,4} but not {1,3} (no 4) or {2,4} (no 1).
Insertion Is very easy. Go down the tree and store a data entry at the appropriate position. For example data entry {2.5} would be stored as child of {2}.
Add attributes dynamically Is naturally ready, you could immediately insert {1,4,6}. It would come below {1,4} of course.
I hope you understand what I want to say about Set-Tries. In the paper by Iztok Savnik it's explained in much more detail. They probably are very efficient.
I don't know if you still want to store the data in a database. I think this would complicate things further and I don't know what is the best to do then.
How about having an inverse index built of hashes?
Suppose you have your values int A, char B, bool C of different types. With std::hash (or any other hash function) you can create numeric hash values size_t Ah, Bh, Ch.
Then you define a map that maps an index to a vector of pointers to the tuples
std::map<size_t,std::vector<TupleStruct*> > mymap;
or, if you can use global indices, just
std::map<size_t,std::vector<size_t> > mymap;
For retrieval by queries X and Y, you need to
get hash value of the queries Xh and Yh
get the corresponding "sets" out of mymap
intersect the sets mymap[Xh] and mymap[Yh]
If I understand your needs correctly, you need a multi-state storing data structure, with retrievals on combinations of these states.
If the states are binary (as in your examples: Has milk/doesn't have milk, has sugar/doesn't have sugar) or could be converted to binary(by possibly adding more states) then you have a lightning speed algorithm for your purpose: Bitmap Indices
Bitmap indices can do such comparisons in memory and there literally is nothing in comparison on speed with these (ANDing bits is what computers can really do the fastest).
http://en.wikipedia.org/wiki/Bitmap_index
Here's the link to the original work on this simple but amazing data structure: http://www.sciencedirect.com/science/article/pii/0306457385901086
Almost all SQL databases supoort Bitmap Indexing and there are several possible optimizations for it as well(by compression etc.):
MS SQL: http://technet.microsoft.com/en-us/library/bb522541(v=sql.105).aspx
Oracle: http://www.orafaq.com/wiki/Bitmap_index
Edit:
Apparently the original research work on bitmap indices is no longer available for free public access.
Links to recent literature on this subject:
Bitmap Index Design Choices and Their Performance
Implications
Bitmap Index Design and Evaluation
Compressing Bitmap Indexes for Faster Search Operations
This problem is known in the literature as subset query. It is equivalent to the "partial match" problem (e.g.: find all words in a dictionary matching A??PL? where ? is a "don't care" character).
One of the earliest results in this area is from this paper by Ron Rivest from 19761. This2 is a more recent paper from 2002. Hopefully, this will be enough of a starting point to do a more in-depth literature search.
Rivest, Ronald L. "Partial-match retrieval algorithms." SIAM Journal on Computing 5.1 (1976): 19-50.
Charikar, Moses, Piotr Indyk, and Rina Panigrahy. "New algorithms for subset query, partial match, orthogonal range searching, and related problems." Automata, Languages and Programming. Springer Berlin Heidelberg, 2002. 451-462.
This seems like a custom made problem for a graph database. You make a node for each set or subset, and a node for each element of a set, and then you link the nodes with a relationship Contains. E.g.:
Now you put all the elements A,B,C,D,E in an index/hash table, so you can find a node in constant time in the graph. Typical performance for a query [A,B,C] will be the order of the smallest node, multiplied by the size of a typical set. E.g. to find {A,B,C] I find the order of A is one, so I look at all the sets A is in, S1, and then I check that it has all of BC, since the order of S1 is 4, I have to do a total of 4 comparisons.
A prebuilt graph database like Neo4j comes with a query language, and will give good performance. I would imagine, provided that the typical orders of your database is not large, that its performance is far superior to the algorithms based on set representations.
Hashing is usually an efficient technique for storage and retrieval of multidimensional data. Problem is here that the number of attributes is variable and potentially very large, right? I googled it a bit and found Feature Hashing on Wikipedia. The idea is basically the following:
Construct a hash of fixed length from each data entry (aka feature vector)
The length of the hash must be much smaller than the number of available features. The length is important for the performance.
On the wikipedia page there is an implementation in pseudocode (create hash for each feature contained in entry, then increase feature-vector-hash at this index position (modulo length) by one) and links to other implementations.
Also here on SO is a question about feature hashing and amongst others a reference to a scientific paper about Feature Hashing for Large Scale Multitask Learning.
I cannot give a complete solution but you didn't want one. I'm quite convinced this is a good approach. You'll have to play around with the length of the hash as well as with different hashing functions (bloom filter being another keyword) to optimize the speed for your special case. Also there might still be even more efficient approaches if for example retrieval speed is more important than storage (balanced trees maybe?).

Resources