I'm working with solr5.2 and I'm using termVectors with solrj (but an answer not using solrj would be nice as well).
From a first query, I obtain termVectors, and I'd like to query again my index with some of the terms from these termVectors.
However the terms from termVectors are obviously already stemmed, and I'd like to go directly to the corresponding entry in the index, without going through the query analysis step (otherwise, my stem will be stemmed again, which can lead to a different entry).
A workaround would be to stem all terms at indexing time, and to index them in a separate String field, but I'd like to avoid this ugly solution.
Is there a better way?
You can define separate analysis chains for query and indexing (I read your caveat as having to do it outside of Solr, as you're talking about String fields):
<analyzer type="index">
So you could have one field that does not perform stemming on query, just on indexing. That might not be suitable for your primary field, so add a second one and use copyField to index into that field as well.
Related
Is there any parameter like (edismax or dismax or any other) that i can set for stemming to work in Solr or i need to make changes in schema.xml of Solr to implement the stemming ?
Problem is if i change schema.xml by default stemming/phoentic work which i dont want ? I am using Solr from third party application and in UI we have checkbox for stemming to check/uncheck , i pass these paramaters to Solr and get the data from Solr, i cant pass this UI parameter to SOlr, so if there is any parameter at Solr side i can pass that for stemming to work ?
Please let me know ?
Stemming is performed as part of the analysis chain, and therefor is part of how the schema for that particular field is defined.
The reason for this becomes apparent when you consider how stemming works - for stemming to make sense, the term has to be stemmed when it's being indexed, as well as when being queried.
Lucene takes your input string, runs it through your analysis chain and saves the generated tokens to its index. Giving it what are you asking will probably end up as what, are, you, ask after tokenizing by whitespace and applying stemming.
The same operation happens when querying, so if someone searches for asks, the token gets stemmed to ask - and then compared against what's in the index. If stemming hadn't taken place when indexing, you'd end up with asking in the index, and ask when querying - and that isn't a match, since the tokens aren't the same.
In your third party application the stemming option probably performs stemming inside the application before sending the content to Solr.
You can also use the Schema API to dynamically update and change field type definitions.
We run a legal search engine. Lawyers, being rather particular generally want synonyms and stemming turned on, but sometimes want to turn them off for certain queries.
For example, we have one user that wants to search for:
judgments
Not:
judgements (with two e's)
Or:
judgment (singular, not plural)
Is there a way to do this? I know it will blow up my index size a bit.
the easiest way would be:
index this into two fields (use copyField), one with synonyms and one without (index or query time, that decision is orthogonal to this).
when running your queries, match against one field or the other depending whether you want synonyms used or not.
I'm using and playing with Lucene to index our data and I've come across some strange behaviors concerning DocValues Fields.
So, Could anyone please just explain the difference between a regular Document field (like StringField, TextField, IntField etc.) and DocValues fields
(like IntDocValuesField, SortedDocValuesField (the types seem to have change in Lucene 5.0) etc.) ?
First, why can't I access DocValues using document.get(fieldname)? if so, how can I access them?
Second, I've seen that in Lucene 5.0 some features are changed, for example sorting can only be done on DocValues... why is that?
Third, DocValues can be updated but regular fields cannot (you have to delete and add the whole document)...
Also, and perhaps most important, when should I use DocValues and when regular fields?
Joseph
Most of these questions are quickly answered by either referring to the Solr Wiki or to a web search, but to get the gist of DocValues: they're useful for all the other stuff associated with a modern Search service except for the actual searching. From the Solr Community Wiki:
DocValues are a way of recording field values internally that is more efficient for some purposes, such as sorting and faceting, than traditional indexing.
...
DocValue fields are now column-oriented fields with a document-to-value mapping built at index time. This approach promises to relieve some of the memory requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster.
This should also answer why Lucene 5 requires DocValues for sorting - it's a lot more efficient than the previous approach.
The reason for this is that the storage format is turned around from the standard format when gathering data for these operations, where the application previously have to go through each document to find the values, it can now look up the values and find the corresponding documents instead. Which is very useful when you already have a list of documents that you need to perform an intersection on.
If I remember correctly, updating a DocValue-based field involves yanking the document out from the previous token list, and then re-inserting it into the new location, compared to the previous approach where it would change loads of dependencies (and reindexing was the only viable strategy).
Use DocValues for fields that need any of the properties mentioned above, such as sorting / faceting / etc.
I am trying to implement location(cities, regions, countries, objects) fuzzy search using Solr server. Currently, my index contains about 0.8-1.0 M items. It works really well using fuzzy search (~0.7) but is too slow for me (0.2-0.6 sec very often). The tokenizer that is used is <tokenizer class="solr.StandardTokenizerFactory"/>. As an alternative I tried <tokenizer class="solr.WhitespaceTokenizerFactory"/> - it is great in terms of performance (about 100x faster) but it does not offer fuzzy search:(
Do you know any different approach I could use? I would like to benefit using fuzzy search feature but in a much faster way, if possible.
Thanks a lot!
Your problem is not related to the analyzer that you use. When you search for Califrna~0.7 Lucene iterates over all terms in index and calculates the (Levenshtein) edit distance between "Califrna" and all terms. This is a very expensive operation.
This issue will be solved with Lucene version 4.0. Lucene version that comes with Solr is using old brute force approach unfortunately.
https: //issues.apache.org/jira/browse/LUCENE-2089
http: //java.dzone.com/news/lucenes-fuzzyquery-100-times
If it is OK for you, I would suggest to download Solr/Lucene from trunk and test how the new fuzzy query works.
http://wiki.apache.org/solr/NightlyBuilds
Even though trunk is stable it is not recommended for production use. I can suggest you two similar methods:
1 - SpellChecker
http://wiki.apache.org/solr/SpellCheckComponent
http ://www.lucidimagination.com/blog/2010/08/31/getting-started-spell-checking-with-apache-lucene-and-solr/
SpellChecker builds its small index with n-grams in order to perform fast lookup. It is also using Levenshtein distance but instead of iterating on all terms it only calculates the distance on related terms.
You need to first execute spell checker for "Califrna" and it will suggest you "Californa". Then you can use "California" in your query on your main index without fuzzy query.
2- Auto Suggest
http ://wiki.apache.org/solr/Suggester
You can offer the correct spelling as user type query with suggester component. This will be a lot faster. It support fuzzy search with JaspellLookup class. JaspellLookup needs to be updated in order to enable fuzzy search. Wiki does not say much about what needs to be updated though. if usePrefix is set to false it should perform fuzzy lookup I guess.
We are planning on using Solr to show the users the "n" most frequent terms from a field and we want to apply stemming so that similar terms get grouped.
Now, we need to show the terms to the users but the stemmed terms are not always human readable. Is there any way to get an example of the original terms that got stemmed so that those could be shown to the user?
The only solution we can think of is quering two different fields, one with stemming and one without and then do the matching ourselves. But we think that is going to be expensive (two queries) and may be error prone (the matching may produce errors).
Is there any other way to implement this on Solr? Thanks in advance.
Stemming is applied at both query time and index time so I don't think there is an easy way to accomplish what you're trying to do. However, it may be possible, depending on the number of results in your database, to do this by employing a combination of faceting and highlighting. The highlighted term will be the entire matching term rather than the stemmed term (so, for example, the stemmed term might be "associ" but the highlighted terms will be "associated", "association", "associations", etc.). Perhaps what you could do is the following:
?q=keyword&facet=true&facet.field=myfield&&facet.limit=20hl=true&hl.fl=myfield&hl.fragsize=0&rows=10
Getting 10 rows and examining the highlighted results (by default, these are highlighted using <em> </em> tags but you can change this by using hl.simple.pre and hl.simple.post -- for example, using &hl.simple.pre=[&hl.simple.post=] would wrap the matching terms in square brackets) should at least give a sample of the "original" matching terms. hl.fragsize=0 returns the entire field along with highlighting.
Hope this helps. You can read more about highlighting parameters here:
http://wiki.apache.org/solr/HighlightingParameters