Solr/Lucene fuzzy search too slow - solr

I am trying to implement location(cities, regions, countries, objects) fuzzy search using Solr server. Currently, my index contains about 0.8-1.0 M items. It works really well using fuzzy search (~0.7) but is too slow for me (0.2-0.6 sec very often). The tokenizer that is used is <tokenizer class="solr.StandardTokenizerFactory"/>. As an alternative I tried <tokenizer class="solr.WhitespaceTokenizerFactory"/> - it is great in terms of performance (about 100x faster) but it does not offer fuzzy search:(
Do you know any different approach I could use? I would like to benefit using fuzzy search feature but in a much faster way, if possible.
Thanks a lot!

Your problem is not related to the analyzer that you use. When you search for Califrna~0.7 Lucene iterates over all terms in index and calculates the (Levenshtein) edit distance between "Califrna" and all terms. This is a very expensive operation.
This issue will be solved with Lucene version 4.0. Lucene version that comes with Solr is using old brute force approach unfortunately.
https: //issues.apache.org/jira/browse/LUCENE-2089
http: //java.dzone.com/news/lucenes-fuzzyquery-100-times
If it is OK for you, I would suggest to download Solr/Lucene from trunk and test how the new fuzzy query works.
http://wiki.apache.org/solr/NightlyBuilds
Even though trunk is stable it is not recommended for production use. I can suggest you two similar methods:
1 - SpellChecker
http://wiki.apache.org/solr/SpellCheckComponent
http ://www.lucidimagination.com/blog/2010/08/31/getting-started-spell-checking-with-apache-lucene-and-solr/
SpellChecker builds its small index with n-grams in order to perform fast lookup. It is also using Levenshtein distance but instead of iterating on all terms it only calculates the distance on related terms.
You need to first execute spell checker for "Califrna" and it will suggest you "Californa". Then you can use "California" in your query on your main index without fuzzy query.
2- Auto Suggest
http ://wiki.apache.org/solr/Suggester
You can offer the correct spelling as user type query with suggester component. This will be a lot faster. It support fuzzy search with JaspellLookup class. JaspellLookup needs to be updated in order to enable fuzzy search. Wiki does not say much about what needs to be updated though. if usePrefix is set to false it should perform fuzzy lookup I guess.

Related

How to disable Solr query analysis?

I'm working with solr5.2 and I'm using termVectors with solrj (but an answer not using solrj would be nice as well).
From a first query, I obtain termVectors, and I'd like to query again my index with some of the terms from these termVectors.
However the terms from termVectors are obviously already stemmed, and I'd like to go directly to the corresponding entry in the index, without going through the query analysis step (otherwise, my stem will be stemmed again, which can lead to a different entry).
A workaround would be to stem all terms at indexing time, and to index them in a separate String field, but I'd like to avoid this ugly solution.
Is there a better way?
You can define separate analysis chains for query and indexing (I read your caveat as having to do it outside of Solr, as you're talking about String fields):
<analyzer type="index">
So you could have one field that does not perform stemming on query, just on indexing. That might not be suitable for your primary field, so add a second one and use copyField to index into that field as well.

lucene Fields vs. DocValues

I'm using and playing with Lucene to index our data and I've come across some strange behaviors concerning DocValues Fields.
So, Could anyone please just explain the difference between a regular Document field (like StringField, TextField, IntField etc.) and DocValues fields
(like IntDocValuesField, SortedDocValuesField (the types seem to have change in Lucene 5.0) etc.) ?
First, why can't I access DocValues using document.get(fieldname)? if so, how can I access them?
Second, I've seen that in Lucene 5.0 some features are changed, for example sorting can only be done on DocValues... why is that?
Third, DocValues can be updated but regular fields cannot (you have to delete and add the whole document)...
Also, and perhaps most important, when should I use DocValues and when regular fields?
Joseph
Most of these questions are quickly answered by either referring to the Solr Wiki or to a web search, but to get the gist of DocValues: they're useful for all the other stuff associated with a modern Search service except for the actual searching. From the Solr Community Wiki:
DocValues are a way of recording field values internally that is more efficient for some purposes, such as sorting and faceting, than traditional indexing.
...
DocValue fields are now column-oriented fields with a document-to-value mapping built at index time. This approach promises to relieve some of the memory requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster.
This should also answer why Lucene 5 requires DocValues for sorting - it's a lot more efficient than the previous approach.
The reason for this is that the storage format is turned around from the standard format when gathering data for these operations, where the application previously have to go through each document to find the values, it can now look up the values and find the corresponding documents instead. Which is very useful when you already have a list of documents that you need to perform an intersection on.
If I remember correctly, updating a DocValue-based field involves yanking the document out from the previous token list, and then re-inserting it into the new location, compared to the previous approach where it would change loads of dependencies (and reindexing was the only viable strategy).
Use DocValues for fields that need any of the properties mentioned above, such as sorting / faceting / etc.

Complex queries with Solr 4

I would like to fire complex queries in Solr 4. If I am using Lucene, I can search using XML Query parser and get the results I need. However, I am not able to see how to use the XML Query Parser in Solr.
I need to be able to execute queries with proximity searches, booleans, wildcards, span or, phrases (although these can be handled by proximity searches).
Guidance on material on how to proceed also welcome.
Regards
Puneet
As far as I know it's still a work in progress. More info can be found at their Jira. You can of course use the normal query language, it's also capable of doing pretty complex things, for example:
"a proximity search"~2 AND *wildcards* OR "a phrase"
As you can see you can search for phrases, boolean operators (AND, OR, ...), span, proximity and wildcards. For more information about the query syntax look at the Lucene documentation. Solr also added some extra features on top of the Lucene query parser and more information about that can be found at the Solr wiki.
Solr 4.8 now has the "complexphrase" query parser built in that can construct all sorts of complex proximity queries (i.e. phrase queries with embedded boolean logic and wildcards).
you can use the query url as
http://xx.xxx.xx.xx:8983/solr/collectionname/select?indent=on&q=
{!complexphrase%20inOrder=true}"good*"&wt=json&fl=Category,keywords,ImageID

Apply Solr filter query to only part of the search results

I have a Solr solution working which requires two queries, but I'm looking for a way to do it in a single query. My idea is that if I can figure out a way to do this, I wont have to incur the overhead of twice the load on the Solr cluster.
The details: I'm running a simple query like "q=camera" with a query filter of say "fq=type:digital". The second query is identical to the first, but the filter is the inverse, like "fq=-type:digital" I'm imagining that if there's a way to run a single query while applying the first filter to get the first set of topDocs, then generate a second set with the second filter the results could be merged and returned ( it doesn't matter if sorting resorts and mixes the two sets).
I experimented with partitioning the data by marking a specific field during indexing, into two different groups and then using Solr "grouping" queries, but the response time for these wasn't acceptable in my setup.
I'm looking for suggestions the most Solr congruent approach to experiment with: tuning to improve the two-query solution performance, or investigating a kind of custom Solr post-filter ( I read Yonik's 2/2012 blog post ).
I have to implement this in Solr 3.5, although if there's a slam dunk solution in 4.0 I'll eventually be able to move to that.
I can think of two alternate approaches :-
Instead of filter the results, use a variable higher boost so that all the results for type:digital come on top and rest of the documents would follow. No need for separate queries. The boost can be changes as per the type value.
Other approach is not to display the results for type other then digital. However, you can display the facets for the other types with the counts for the same for users to know if the other types exist for the search term. You can check on tagging and excluding filters
Result grouping might give you what you want. Just group by that parameter and specify sufficient top number of documents in each group.
But I would test whether its performance is any better than two queries. Just because it mentions performance in limitations section.

Create a Solr Index using Lucene IndexWriter

I need to index vast amounts of content in extremely short order, I have tried various techniques using Solrnet/solr using threading and TPL, however the speeds leave a lot to be desired. Hence considering a move to using Lucene.net index writer to create an index (preliminarily I see almost an order of magnitude of speed improvement) . Any "gotchas" to be aware of?
I am not too sure if:
1. Trie based Numeric Range query would continue to be available for query via Solr. ( I am using NumericFields in Lucene)?
2. Faceting etc. would continue to be available ?
Anything else I need to watch out for?
Please see Scaling Lucene and Solr about improving run times.
If you decide to go with Lucene:
You need a unique id field for the index to be a valid Solr index.
The schema must match the Solr schema.
The Lucene version must be the same as in Solr.
I think the range query and faceting will be available, as long as you index the respective fields according to the requirements in Solr, and use the same analyzers.

Resources