Stemming parameter in Solr

Stemming parameter in Solr - solr

Is there any parameter like (edismax or dismax or any other) that i can set for stemming to work in Solr or i need to make changes in schema.xml of Solr to implement the stemming ?
Problem is if i change schema.xml by default stemming/phoentic work which i dont want ? I am using Solr from third party application and in UI we have checkbox for stemming to check/uncheck , i pass these paramaters to Solr and get the data from Solr, i cant pass this UI parameter to SOlr, so if there is any parameter at Solr side i can pass that for stemming to work ?
Please let me know ?

Stemming is performed as part of the analysis chain, and therefor is part of how the schema for that particular field is defined.
The reason for this becomes apparent when you consider how stemming works - for stemming to make sense, the term has to be stemmed when it's being indexed, as well as when being queried.
Lucene takes your input string, runs it through your analysis chain and saves the generated tokens to its index. Giving it what are you asking will probably end up as what, are, you, ask after tokenizing by whitespace and applying stemming.
The same operation happens when querying, so if someone searches for asks, the token gets stemmed to ask - and then compared against what's in the index. If stemming hadn't taken place when indexing, you'd end up with asking in the index, and ask when querying - and that isn't a match, since the tokens aren't the same.
In your third party application the stemming option probably performs stemming inside the application before sending the content to Solr.
You can also use the Schema API to dynamically update and change field type definitions.

Related

How to help my Solr engine to understand related terms?

I have a big list of related terms (not synonyms) that I would like my solr engine to take into account when searching. For example:
Database --> PostgreSQL, Oracle, Derby, MySQL, MSSQL, RabbitMQ, MongoDB
For this kind of list, I would like Solr to take into account that if a user is searching for "postgresql configuration" he might also bring results related to "RabbitMQ" or "Oracle", but not as absolute synonyms. Just to boost results that have these keywords/terms.
What is the best approach to implement such connection? Thanks!

You've already discovered that these are synonyms - and that you want to use that metainformation as a boost (which is a good idea).
The key is then to define a field that does what you want - in addition to your regular field. Most of these cases are implemented by having a second field that does the "less accurate" version of the field, and apply a lower boost to matches in that field compared to the accurate version.
You define both fields - one with synonyms (for example content_synonyms) and one without (content), and then add a copyField instruction from the content field (this means that Solr will take anything submitted to the content field and "copy" it as the source text for the content_synonyms field as well.
Using edismax you can then use qf to query both fields and give a higher weight to the exact content field: qf=content^10 content_synonyms will score hits in content 10x higher than hits in content_synonyms, in effect using the synonym field for boosting content.
The exact weights will have to be adjusted to fit your use case, document profile and query profile.

How to disable Solr query analysis?

I'm working with solr5.2 and I'm using termVectors with solrj (but an answer not using solrj would be nice as well).
From a first query, I obtain termVectors, and I'd like to query again my index with some of the terms from these termVectors.
However the terms from termVectors are obviously already stemmed, and I'd like to go directly to the corresponding entry in the index, without going through the query analysis step (otherwise, my stem will be stemmed again, which can lead to a different entry).
A workaround would be to stem all terms at indexing time, and to index them in a separate String field, but I'd like to avoid this ugly solution.
Is there a better way?

You can define separate analysis chains for query and indexing (I read your caveat as having to do it outside of Solr, as you're talking about String fields):
<analyzer type="index">
So you could have one field that does not perform stemming on query, just on indexing. That might not be suitable for your primary field, so add a second one and use copyField to index into that field as well.

Solr queries stored within Solr field

I have a set of keywords defined by client requirements stored in a SOLR field. I also have a never ending stream of sentences entering the system.
By using the sentence as the query against the keywords I am able to find those sentences that match the keywords. This is working well and I am pleased. What I have essentially done is reverse the way in which SOLR is normally used by storing the query in Solr and passing the text in as the query.
Now I would like to be able to extend the idea of having just a keyword in a field to having a more fully formed SOLR query in a field. Doing so would allow proximity searching etc. But, of course, this is where life becomes awkward. Placing SOLR query operators into a field will not work as they need to be escaped.
Does anyone know if it might be possible to use the SOLR "query" function or perhaps write a java class that would enable such functionality? Or is the idea blowing just a bit too much against the SOLR winds?
Thanks in advance.

ES has percolate for this - for Solr you'll usually index the document as a single document in a memory based core / index and then run the queries against that (which is what ES at least used to do internally, IIRC).

I would check out the percolate api with ElasticSearch. It would sure be easier using this api than having to write your own in Solr.

Fuzzy matching for every query term in Solr

With the Levenshtein implementation of Lucene 4 claiming to be 100 times faster than before (http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html) I would like to do fuzzy matching of all terms in a query. The idea is that a search for 'gren hose' should be able to find the document 'green house' (I don't really care about phrases at this point, the quotes are just here to make this more readable).
I am using Lucene 4 + Solr 4. As I'm doing some pre- and post-processing there is a small wrapper servlet around Solr, the servlet is using SolrJ to eventually access Solr
I'm currently a little lost on what would be the right way to achieve this. My basic approach is to break the search query down into terms and append the tilde / fuzzy operator to each term. Thus 'gren hose' would become 'gren~ hose~' . Now the question is how to properly do this. I can see several ways:
Brute force: Assume that the terms are delimited by whitespace, so just parse the query and append a tilde before each whitespace (ie. after each term)
Two steps: Send the query to Solr with query debugging turned on. This will give me a list of query terms as parsed by Solr. I can then extract the terms from the debug output, append the tilde operator and re-run the query with the added tilde operators
Internally: Hook into the search request handler and append the tilde operator after the query has been parsed into terms
Method 1 stinks a lot, as it circumvents Solr's query parsing entirely, so I would rather not do that. Method 2 sounds quite doable if the cost of parsing the query twice is not too high. Method 3 sound just right, but I have yet to figure out where I have to hook into the processing chain.
Maybe there is a completely different way to achieve what I want to do, or maybe it's just a stupid idea on my part. Anyway, I would really appreciate a few pointers, maybe someone else has already done something like this. Thanks!

I would propose the following methods:
Implement a query handler module in your application where you can build solr query from the input user query. This way nothing changes in the SOLR side and your application has all the control on what goes into SOLR.
Implement your own query parser , you can start from Standard SOLR query parser (org.apache.solr.search.QParser) and make your changes. Your application just needs to select your custom query parser and rest your implementation should take care.
I would prefer method 1 as this makes the system completely agnostic to SOLR upgrades, any new release of Solr will not require me to update the custom qparser and you will not have to update/build and setup your custom qparser in the new version.
If you dont have any control on the app and dont want to go through the qparser route , then you can implement a Servlet filter that transforms the solr query before it is dispatched to solr request filter.

Showing human readable most frequent indexed terms using a stemmed field with Solr faceted search

We are planning on using Solr to show the users the "n" most frequent terms from a field and we want to apply stemming so that similar terms get grouped.
Now, we need to show the terms to the users but the stemmed terms are not always human readable. Is there any way to get an example of the original terms that got stemmed so that those could be shown to the user?
The only solution we can think of is quering two different fields, one with stemming and one without and then do the matching ourselves. But we think that is going to be expensive (two queries) and may be error prone (the matching may produce errors).
Is there any other way to implement this on Solr? Thanks in advance.

Stemming is applied at both query time and index time so I don't think there is an easy way to accomplish what you're trying to do. However, it may be possible, depending on the number of results in your database, to do this by employing a combination of faceting and highlighting. The highlighted term will be the entire matching term rather than the stemmed term (so, for example, the stemmed term might be "associ" but the highlighted terms will be "associated", "association", "associations", etc.). Perhaps what you could do is the following:
?q=keyword&facet=true&facet.field=myfield&&facet.limit=20hl=true&hl.fl=myfield&hl.fragsize=0&rows=10
Getting 10 rows and examining the highlighted results (by default, these are highlighted using <em> </em> tags but you can change this by using hl.simple.pre and hl.simple.post -- for example, using &hl.simple.pre=[&hl.simple.post=] would wrap the matching terms in square brackets) should at least give a sample of the "original" matching terms. hl.fragsize=0 returns the entire field along with highlighting.
Hope this helps. You can read more about highlighting parameters here:
http://wiki.apache.org/solr/HighlightingParameters