solr fuzzy vs wildcard vs stemmer - solr

I have couple of questions here.
I want to search a term jumps
With Fuzzy search, I can do jump~
With wild card search, I can do jump*
With stemmer I can do, jump
My understanding is that, fuzzy search gives pump. Wildcard search gives jumping as well. Stemmer gives "jumper" also.
I totally agree with the results.
What is the performance of thes three?
Wild card is not recommended if it is at the beginning of the term - my understanding as it has to match with all the tokens in the index - But in this case, it would be all the tokens which starts jump
Fuzzy search gives me unpredicted results - It has to do something kind of spellcheck I assume.
Stemmer suits only particular scenarios like it can;t match pumps.
How should I use these things which can give more relevant results?
I probably more confused about all these because of this section. Any suggestions please?

Question 1
Wildcard queries are (generally) not analysed (i.e. they're not tokenized or run through filters), meaning that anything that depend on filters doing their processing of the input/output tokens will give weird results (for example if the input string is broken into multiple strings).
The matching happens on the tokens, so what you've input is almost (lowercasing still works) matched directly against the prefix / postfix of the tokens in the index. Generally you'd want to avoid wildcard queries for general search queries, since they're rather limited for natural search and can give weird results (as shown).
Fuzzy search is based on "edit distance" - i.e. a number that tells Solr how many characters can be removed/inserted/changed to get to the resulting token. This will give your users OK-ish results, but might be hard to decipher in the sense of "why did this give me a hit" when the allowed distance is larger (Lucene/Solr supports up to 2 in edit distance which is also the default if no edit distance is given).
Stemming is usually the way to go, as it's the actual "formal" process of taking a term and reducing it down to its stem - the actual "meaning" (it doesn't really know anything about the meaning as in the natural language processing term, but it does it according to a set of static rules and exceptions for the language configured) of the word . It can be adjusted per language to rules suitable for that language, which neither of the two other options can.
For your downside regarding stemming ("Since it can't match pumps") - that might actually be a good thing. It'll be clearer to your users what the search results are based on, and instead of including pumps in your search result, include it as a spelling correction ("Did you mean pump / pumps instead?"). It'll give a far better experience for any user, where the search results will more closely match what they're searching for.
The requirements might differ based on what your actual use case is; i.e. if it's just for programmatic attempts to find terms that look similar.
Question 2
Present those results you deem more relevant as the first hits - if you're doing wildcard or fuzzy searches you can't do this through scoring alone, so you'll have to make several queries and then present them after each other. I usually suggest making that an explicit action by the user of the search when discussing this in projects.
Instead, as the main search, you can use an NGramFilter in a separate field and use a copyfield instruction to get the same content into both fields - and then score the ngramfilter far lower than hits in the more "exact" field. Usually you want three fields in that case - one for exact hits (non-stemmed), one for stemmed hits and one for ngram hits - and then score them appropriately with the qf parameter to edismax. It usually gives you the quickest and easiest results to a decent search results for your users, but make sure to give them decent ways of either filtering the result set (facets) or change their queries into something more meaningful (did you mean, also see xyz, etc.).
Guessing the user's intent is usually very hard unless you have invested a lot of time and resources into personalisation (think Google), so leave that for later - most users are happy as long as they have a clear and distinct way of solving their own problems, even if you don't get it perfect for the first result.

For question 2 you can go strict to permissive.
Option one: Only give strict search result. If no result found give stemmer results. Continue with fuzzy or wildcard search if no result found previously.
Option two: Give all results but rank them by level (ie. first exact match, then stemmer result, ...)

Related

Using Topic Model, how should we set up a "stop words" list?

There are some standard stop lists, giving words like "a the of not" to be removed from corpus. However, I'm wondering, should the stop list change case by case?
For example, I have 10K of articles from a journal, then because of the structure of an article, basically you will see words like "introduction, review, conclusion, page" in every article. My concern is: should we remove these words from our corpus? (the words that every document has?) Thanks to every comment and suggestion.
I am working on a similar problem, but of text categorization. From my experience, it is good to have a domain specific set of stop word list along with the standard .
list. Otherwise, these words like "introduction","review" etc. will come up in the term frequency matrix, if you have tried out analysing it. It can mislead your models by giving more weights to these domain specific keywords.
Worth to consider is that the stop words might not affect your model as much as you fear. Have you tried not removing them and compared the results?
See also this 2017 paper: "Pulling Out the Stops: Rethinking Stopword Removal for Topic Models." http://www.cs.cornell.edu/~xanda/stopwords2017.pdf
In conclusion they say (paraphrasing) that removing stopwords had no real negative effect on the quality of the LDA model, and if needed they could still be removed afterwards without impacting the model.
Alternatively you can always remove words with a high document frequency automatically, i.e. set a treshold of the amount of documents the word can appear in (e.g. 50%) and just remove all words that are more frequent than those as stopwords.
I don't think this will meaningfully impact the model itself, but I'm sure it'll speed up the computations of the model, by virtue of there being less words to compute.

How to avoid appengine search QueryErrors resulting from syntax errors in the query string?

When I enter "I am searching for an entire stringoni google, it doesn't barf up an error at me. Instead, it tries to guess what I mean. What's the best way to make sure that whatever the user enters as a query, they'll always get something (preferably something usefu) back when their query is passed into search.Index("myIndex").search()?
Well, there's no easy way to do that out of the box.
What Google does is running very complex algorithms to infer not only what you said, but "what you actually meant". This of course is a very complex problem to solve, and certainly not something you can do by simply flipping a switch.
One thing that you can do though, is to start using the ~ character in your searches. This is called "Stemming":
To search for common variations of a word, like plural forms and verb
endings, use the ~ stem operator (the tilde character). This is a
prefix operator which must precede a value with no intervening space.
The value ~cat will match "cat" or "cats," and likewise ~dog matches
"dog" or "dogs." The stemming algorithm is not fool-proof. The value
~care will match "care" and "caring," but not "cares" or "cared."
Stemming is only used when searching text and HTML fields.
There are a bunch other stuff you can do to make your searches "smarter". Check the following link:
https://developers.google.com/appengine/docs/python/search/query_strings

Getting Solr to provide spelling suggestions for "correctly" spelled words

Let's say we have two words with easily confused spellings. Let's say we take the terms:
derailer (a device used to prevent fouling of a rail track)
vs.
Derailleur (a device used for changing gear ratios on a bicycle)
Now for some reason we have both terms in our spelling suggestions. As a result a search for one or the other will never yield spelling suggestions, despite it being likely that you'll get poor (or no) results if you meant the other term.
So the question is how can I convince solr to give me spelling suggestions if you search for one or the other, and what controls do I have to ensure that not every search results in showing spelling suggestions?

how to do fuzzy search in big data

I'm new to that area and I wondering mostly what the state-of-the-art is and where I can read about it.
Let's assume that I just have a key/value store and I have some distance(key1,key2) defined somehow (not sure if it must be a metric, i.e. if the triangle inequality must hold always).
What I want is mostly a search(key) function which returns me all items with keys up to a certain distance to the search-key. Maybe that distance-limit is configureable. Maybe this is also just a lazy iterator. Maybe there can also be a count-limit and an item (key,value) is with some probability P in the returned set where P = 1/distance(key,search-key) or so (i.e., the perfect match would certainly be in the set and close matches at least with high probability).
One example application is fingerprint matching in MusicBrainz. They use the AcoustId fingerprint and have defined this compare function. They use the PostgreSQL GIN Index and I guess (although I haven't fully understood/read the acoustid-server code) the GIN Partial Match Algorithm but I haven't fully understand wether that is what I asked for and how it works.
For text, what I have found so far is to use some phonetic algorithm to simplify words based on their pronunciation. An example is here. This is mostly to break the search-space down to a smaller space. However, that has several limitations, e.g. it must still be a perfect match in the smaller space.
But anyway, I am also searching for a more generic solution, if that exists.
There is no (fast) generic solution, each application will need different approach.
Neither of the two examples actually does traditional nearest neighbor search. AcoustID (I'm the author) is just looking for exact matches, but it searches in a very high number of hashes in hope that some of them will match. The phonetic search example uses metaphone to convert words to their phonetic representation and is also only looking for exact matches.
You will find that if you have a lot of data, exact search using huge hash tables is the only thing you can realistically do. The problem then becomes how to convert your fuzzy matching to exact search.
A common approach is to use locality-sensitive hashing (LSH) with a smart hashing method, but as you can see in your two examples, sometimes you can get away with even simpler approach.
Btw, you are looking specifically for text search, the simplest way you can do it split your input to N-grams and index those. Depending on how your distance function is defined, that might give you the right candidate matches without too much work.
I suggest you take a look at FLANN Fast Approximate Nearest Neighbors. Fuzzy search in big data is also known as approximate nearest neighbors.
This library offers you different metric, e.g Euclidian, Hamming and different methods of clustering: LSH or k-means for instance.
The search is always in 2 phases. First you feed the system with data to train the algorithm, this is potentially time consuming depending on your data.
I successfully clustered 13 millions data in less than a minute though (using LSH).
Then comes the search phase, which is very fast. You can specify a maximum distance and/or the maximum numbers of neighbors.
As Lukas said, there is no good generic solution, each domain will have its tricks to make it faster or find a better way using the inner property of the data your using.
Shazam uses a special technique with geometrical projections to quickly find your song. In computer vision we often use the BOW: Bag of words, which originally appeared in text retrieval.
If you can see your data as a graph, there are other methods for approximate matching using spectral graph theory for instance.
Let us know.
Depends on what your key/values are like, the Levenshtein algorithm (also called Edit-Distance) can help. It calculates the least number of edit operations that are necessary to modify one string to obtain another string.
http://en.wikipedia.org/wiki/Levenshtein_distance
http://www.levenshtein.net/

Solr - Nearest Match - Does this functionality exist?

Can Solr give you a nearest match when comparing "fingerprint" type data stored in the Solr datastore. For example,
eJyFk0uyJSEIBbcEyEeWAwj7X8JzfDvKnuTAJIojWACwGB4QeM
HWCw0vLHlB8IWeF6hf4PNC2QunX3inWvDCO9WsF7heGHrhvYV3qvPEu-
87s9ELLi_8J9VzknReEH1h-BOKRULBwyZiEulgQZZr5a6OS8tqCo00cd
p86ymhoxZrbtQdgUxQvX5sIlF_2gUGQUDbM_ZoC28DDkpKNCHVkKCgpd
OHf-wweX9adQycnWtUoDjABumQwbJOXSZNur08Ew4ra8lxnMNuveIem6
LVLQKsIRLAe4gbj5Uxl96RpdOQ_Noz7f5pObz3_WqvEytYVsa6P707Jz
j4Oa7BVgpbKX5tS_qntcB9G--1tc7ZDU1HamuDI6q07vNpQTFx22avyR
Can it find this record if it was presented with something extremely similar? And can it provide back a confidence score?
one straighforward approach could be to use a fuzzy search, and pick the first hit (by score), then you need to check whether the hit is good a match or not, maybe by testing you could find some good rule of thumbs.
But not sure if perf would be an issue with such long tokens. Use Lucene4.0 where fuzzy perf is much improved.
You may try experimenting with Ngram filter factory. You may pick a min/max gram size that is consistent with a matching/similar finger print.
If you have a tight range of minGramSize and maxGramSize, you can match documents with similar fingerprint without having to iterate over false positives.

Resources