I have a large set of short strings. What are some algorithms and indexing strategies for filtering the list on items that contain a substring? For example, suppose I have a list:
val words = List(
"pick",
"prepick",
"picks",
"picking",
"kingly"
...
)
How could I find strings that contain the substring "king"? I could brute force the problem like so:
words.filter(_.indexOf("king") != -1) // yields List("picking", "kingly")
This is only practical for small sets; Today I need to support 10 million strings, with a future goal in the billions. Obviously I need to build an index. What kind of index?
I have looked at using an ngram index stored in MySQL, but I am not sure if this is the best approach. I'm not sure how to optimally query the index when the search string is longer than the ngram size.
I have also considered using Lucene, but this is optimized around token matching, not substring matching, and does not seem to support the requirement of simple substring matching. Lucene does have a few classes related to ngrams (org.apache.lucene.analysis.ngram.NGramTokenFilter is one example), but these seem to be intended for spell check and autocomplete use cases, not substring matching, and the documentation is thin.
What other algorithms and indexing strategies should I consider? Are there any open source libraries that support this? Can the SQL or Lucene strategies (above) be made to work?
Another way to illustrate the requirement is with SQL:
SELECT word FROM words WHERE word LIKE CONCAT('%', ?, '%');
Where ? is a user provided search string, and the result is a list of words that contain the search string.
How big is the longest word?
if that's about 7-8 char you may find all substrings for each and every string and insert that substrings in trie (the one is used in Aho-Corasik - http://en.wikipedia.org/wiki/Aho-Corasick)
It will take some time to build the tree but then searching for all occurances will be O(length(searched word)).
Postgres has a module which does a trigram index
That seems an interesting idea too- building a trigram index.
About a comment in your question regarding how to break down text searches greater than n-gram length:
Here's one approach which will work:
Say we have a search string as "abcde" , and we have built a trigram index. (You have strings which are of smaller lengths-this could hit a sweet spot for you)
Let the search results of abc= S1, bcd=S2,cde=S3 (where S1,S2,S3 are sets of indexes )
Then the longest common substring of S1,S2,S3 will give the indexes that we want.
We can transform each set of indexes,as a single string separated by a delimiter (say space) before doing LCS.
After we find the LCS,we would have to search the indexes for the complete pattern,since we have broken down the search term. ie we would have to prune results which have "abc-XYZ- bcd-HJI-def"
The LCS of a set of strings can be efficiently found Suffix Arrays. or Suffix trees
Related
Suppose I have a very long string, such as a filepath, and I want to search for something in it. For example, something like the $ find command. It seems like a basic implementation of this would be along the lines of:
if(strstr(sent, word) != NULL) {
return 1;
}
Would there be any performance difference between doing that and something like Boyer Moore? Or does strstr already do something just as efficient?
Basically, I have about a billion very long strings, and I'm looking to do a fast(ish) find on them (without any indexing), based on the most efficient substring implementation. What should I use?
Update: To give a more concrete example, let's say I have a billion filepaths I want to search through:
/archive/1002/myfile.txt
/archive/1002/newer.mov
/user/tom/local_2014version1.mov
And from this I would search either one or more strings. Example samples would be:
"1002" // would return the first two fileds
"mov version tom" // would return the first row
Advanced search algorithms like Boyer-Moore and Aho-Corasick work by precomputing lookup tables from the string(s) to be searched for, which incurs a large start-up time. It's very unlikely that searching something as small as a pathname would be able to make up for that high overhead. You really have to be searching something like multi-page documents before those algorithms show their value.
I have tested all available Analyzers on my Search Index. But none, except Keyword Analyzer, gave me proper sorted results in alphabetical order. But Keyword Analyzer doesn't fit in my filtering requirements. With Keyword Analyzer i couldn't search for a sub-string in a given sentence.
Example: "description": "This is 2 test different Analyzers in a Search Index"
Whitespace Analyzer gives proper search results but it doesn't help me with sorting. Does anyone have pointers on how we can achieve both sorting and searching with Search Index?
The analyzers define how text is broken into words and how those words are truncated (stemmed) into tokens for indexing. For example, the keyword analyzer keeps words intact in their entirety which is handy for tags.
Analyzers don't have much to do with sorting. By default, sorting is by "best match first" i.e. the documents that are the closest match to your input string appear first, which is what you might expect from a search engine.
You can override the default sort by supplying a sort parameter. e.g.
e.g. q=frank+sinatra&sort=date
See https://console.bluemix.net/docs/services/Cloudant/api/search.html#search for further sorting options.
I've got a vocabulary, a, abandon, ... , z.
For some reason, I will use array rather than Trie to store them.
Thus a simple method can be: wordA\0wordB\0wordC\0...word\0
But there are some more economic methods for memory I think.
Since like is a substring of likely, we can only store the first position and length of like instead of the string itself. Thus we generate a "large string" which contains every words in vocabulary and use position[i] and length[i] to get the i-th word.
For example, vocabulary contains three words ab, cd and bc.
I construct abcd as the "large string".
position[0] = 0, length[0] = 2
position[1] = 2, length[1] = 2
position[2] = 1, length[2] = 2
So how to generate the "large string" is the key to this problem, are there any cool suggestions?
I think the problem is similar to TSP problem(Traveling Salesman Problem), which is a NP problem.
The search keyword you're looking for is "dictionary". i.e. data structures that can be used to store a list of words, and test other strings for being present or absent in the dictionary.
Your idea is more compact than storing every word separately, but far less compact than a good data structure like a DAWG. As you note, it isn't obvious how to optimally choose how to overlap your strings. What you're doing is a bit like what a lossless compression scheme (like gzip) would do. If you don't need to check words against your compact dictionary, maybe just use gzip or LZMA to compress a sorted word list. Let their algorithms find the redundancy and represent it compactly.
I looked into dictionaries for a recent SO answer that caught my interest: Memory-constrained external sorting of strings, with duplicates combined&counted, on a critical server (billions of filenames)
For a dictionary that doesn't have to have new words added on the fly, a Directed Acyclic Word Graph is the way to go. You match a string against it by following graph nodes until you either hit a point where there's no edge matching the next character, or you get to the end of your input string and find that the node in the DAWG is marked as being a valid end-of-word. (Rather than simply a substring that is only a prefix of some words). There are algorithms for building these state-machines from a simple array-of-words dictionary in reasonable time.
Your method can only take advantage of redundancy when a whole word is a substring of another word, or end-of-one, start-of-another. A DAWG can take advantage of common substrings everywhere, and is also quite fast to match words against. Probably comparable speed to binary-searching your data structure, esp. if the giant string is too big to fit in the cache. (Once you start exceeding cache size, compactness of the data structure starts to outweigh code complexity for speed.)
Less complex but still efficient is a Trie (or Radix Trie), where common prefixes are merged, but common substrings later in words don't converge again.
If you don't need to modify your DAWG or Trie at all, you can store it efficiently in a single block of memory, rather than dynamically allocating each node. You didn't say why you didn't want to use a Trie, and also didn't acknowledge the existence of the other data structures that do this job better than a plain Trie.
I have a large document that I want to build an index of for word searching. (I hear this type of array is really called a concordances). Currently it takes about 10 minutes. Is there a fast way to do it? Currently I iterate through each paragraph and if I find a word I have not encountered before, I add it too my word array, along with the paragraph number in a subsidiary array, any time I encounter that same word again, I add the paragraph number to the index. :
associativeArray={chocolate:[10,30,35,200,50001],parsnips:[5,500,100403]}
This takes forever, well, 5 minutes or so. I tried converting this array to a string, but it is so large it won't work to include in a program file, even after removing stop words, and would take a while to convert back to an array anyway.
Is there a faster way to build a text index other than linear brute force? I'm not looking for a product that will do the index for me, just the fastest known algorithm. The index should be accurate, not fuzzy, and there will be no need for partial searches.
I think the best idea is to build a trie, adding a word at the time of your text, and having for each leaf a List of location you can find that word.
This would not only save you some space, since storing word with similar prefixes will require way less space, but the search will be faster too. Search time is O(M) where M is the maximum string length, and insert time is O(n) where n is the length of the key you are inserting.
Since the obvious alternative is an hash table, here you can find some more comparison between the two.
I would use a HashMap<String, List<Occurrency>> This way you can check if a word is already in yoz index in about O(1).
At the end, when you have all word collected and want to search them very often, you might try to find a hash-function that has no or nearly no collisions. This way you can guarantee O(1) time for the search (or nearly O(1) if you have still some collisions).
Well, apart from going along with MrSmith42's suggestion of using the built in HashMap, I also wonder how much time you are spending tracking the paragraph number?
Would it be faster to change things to track line numbers instead? (Especially if you are reading the input line-by-line).
There are a few things unclear in your question, like what do you mean in "I tried converting this array to a string, but it is so large it won't work to include in a program file, even after removing stop words, and would take a while to convert back to an array anyway."?! What array, is your input in form of array of paragraphs or do you mean the concordance entries per word, or what.
It is also unclear why your program is so slow, probably there is something inefficient there - i suspect is you check "if I find a word I have not encountered before" - i presume you look up the word in the dictionary and then iterate through the array of occurrences to see if paragraph# is there? That's slow linear search, you will be better served to use a set there (think hash/dictionary where you care only about the keys), kind of
concord = {
'chocolate': {10:1, 30:1, 35:1, 200:1, 50001:1},
'parsnips': {5:1, 500:1, 100403:1}
}
and your check then becomes if paraNum in concord[word]: ... instead of a loop or binary search.
PS. actually assuming you are keeping list of occurrences in array AND scanning the text from 1st to last paragraph, that means arrays will form sorted, so you only need to check the very last element there if word in concord and paraNum == concord[word][-1]:. (Examples are in pseudocode/python but you can translate to your language)
I have created a fulltext catalog that stores the data from some of the columns in a table, but the contents seem to have been split apart by characters that I don't really want to be considered word delimiters. ("/", "-", "_" etc..)
I know that I can set the language for word breaker, and http://msdn.microsoft.com/en-us/library/ms345188.aspx gives som idea on how to install new languages - but I need more direct control than that, because all of those languages still break on the characters I want to not break on.
Is there a way to define my own language to use for finding word breakers?
Full text indexes only consider the characters _ and ` while indexing. All the other characters are ignored and the words get split where these characters occur. This is mainly because full text indexes are designed to index large documents and there only proper words are considered to make it a more refined search.
We faced a similar problem. To solve this we actually had a translation table, where characters like #,-, / were replaced with special sequences like '`at`','`dash`','`slash`' etc. While searching in the full text, u've to again replace ur characters in the search string with these special sequences and search. This should take care of the special characters.
The ability to configure FTS indexing is fairly limited out of the box. I don't think that you can use languages to do this.
If you are up for a challenge, and have access to some C++ knowledge, you can always write a custom IFilter implementation. It's not trivial, but not too difficult. See here for IFilter resources.