Disable synonyms and stemming on a per query basis in Solr - solr

We run a legal search engine. Lawyers, being rather particular generally want synonyms and stemming turned on, but sometimes want to turn them off for certain queries.
For example, we have one user that wants to search for:
judgments
Not:
judgements (with two e's)
Or:
judgment (singular, not plural)
Is there a way to do this? I know it will blow up my index size a bit.

the easiest way would be:
index this into two fields (use copyField), one with synonyms and one without (index or query time, that decision is orthogonal to this).
when running your queries, match against one field or the other depending whether you want synonyms used or not.

Related

Find similar results with Lucene / SOLR index

We have an application for tagging user selections over a large corpus of MS Word documents. We tag these selections with one or more keyword tags, and usually a title tag. We want to add a feature where the selected text is instantly analyzed, and the tagger is presented with a list of most-likely keyword and title tags (based on the existing tagged text selections)
We are using a SOLR index. I have been told that we can simply issue the selected text as the query itself to return similar selections. However, the selected text could be anywhere between 200 and 6000 words long. A 6000 word query may be a problem in terms of memory usage!
I thought we could do some very aggressive stopword removal to significantly reduce the number of words in the queries, leaving only the very meaningful words. We have been working with this corpus for the last 10 years and we are very familiar with the subject matter and the vocabulary used, so this would be easy for us to do. But the problem is that we also use the same index for allowing the normal users to search the index, and if we remove too many common words, then their normal queries may not work properly (especially phrase queries).
We would also like to boost the results that contain the text of the query within a smaller range, rather than just spread arbitrarily throughout the document.
Another issue is that we allow nested selections. The outer selection may be more general in nature and be around 5000 words long, and the inner selections will be shorter and topically more specific. However, since both selections contain the same text, SOLR ranks them both highly, when the outer selection may not be so relevant
I have spent the last few days going through the SOLR query parser documentation, and it looks like this should be doable, but I'm still not sure exactly what I need to do to make this work. Any suggestions would be much appreciated.
Solr have multi-core facility. So if you can have one core for your internal work and you can reveal the other core for public domain, it may solve your issue.
You can refer this section
http://wiki.apache.org/solr/Solr.xml%20(supported%20through%204.x)
or you can refer Solr cores and solr.xml section in solr reference manual.

Apply Solr filter query to only part of the search results

I have a Solr solution working which requires two queries, but I'm looking for a way to do it in a single query. My idea is that if I can figure out a way to do this, I wont have to incur the overhead of twice the load on the Solr cluster.
The details: I'm running a simple query like "q=camera" with a query filter of say "fq=type:digital". The second query is identical to the first, but the filter is the inverse, like "fq=-type:digital" I'm imagining that if there's a way to run a single query while applying the first filter to get the first set of topDocs, then generate a second set with the second filter the results could be merged and returned ( it doesn't matter if sorting resorts and mixes the two sets).
I experimented with partitioning the data by marking a specific field during indexing, into two different groups and then using Solr "grouping" queries, but the response time for these wasn't acceptable in my setup.
I'm looking for suggestions the most Solr congruent approach to experiment with: tuning to improve the two-query solution performance, or investigating a kind of custom Solr post-filter ( I read Yonik's 2/2012 blog post ).
I have to implement this in Solr 3.5, although if there's a slam dunk solution in 4.0 I'll eventually be able to move to that.
I can think of two alternate approaches :-
Instead of filter the results, use a variable higher boost so that all the results for type:digital come on top and rest of the documents would follow. No need for separate queries. The boost can be changes as per the type value.
Other approach is not to display the results for type other then digital. However, you can display the facets for the other types with the counts for the same for users to know if the other types exist for the search term. You can check on tagging and excluding filters
Result grouping might give you what you want. Just group by that parameter and specify sufficient top number of documents in each group.
But I would test whether its performance is any better than two queries. Just because it mentions performance in limitations section.

Can Solr/Lucene do Fuzzy Field Collapsing?

Edit
Can Solr do fuzzy field collapsing? IE collapsing fields that have similar values, rather than identical ones?
I'd assumed that it could, but now I'm not sure, which makes my original question below invalid.
Original Question
For a large given set of values I need to decide which is the most prevalent. The set of all values will change over time, and so I can expect that the output may change over time too.
I gather Solr can do "field collapsing" to group results by a given field, with a tolerance of similarity. Would it be possible, neigh even appropriate, to use Solr solely to collapse fields, to derive the most common value? We use Solr in other parts of the business, and it would be good to leverage existing code rather than home-brewing a custom solution.
No, solr does not support fuzzy collapsing. (at least not based on what is documented on the wiki)
Solr 4.0 supports group.func which allows you to group results based on the result of a FunctionQuery, so it's possible that at some point in time a function could be created to get you approximately what you want, but none of the existing functions will do what you want.
However, Solr does support result clustering, which will maybe work for your use-case. Clustering is done with Carrot2. If you limit the fields used by carrot to a single field, you may get a similar result to "fuzzy clustering", but you have far less control over what carrot does than you do with field collapsing.
For a normal document you might want all your fields analyzed by carrot, e.g.:
carrot.title=my_title&carrot.snippet=my_title,my_description
But if you have, for example, a manufacturer field with slight variations of spelling or punctuation, it might work to only give carrot a single field for both title and snippet:
carrot.title=manufacturer&carrot.snippet=manufacturer

Showing human readable most frequent indexed terms using a stemmed field with Solr faceted search

We are planning on using Solr to show the users the "n" most frequent terms from a field and we want to apply stemming so that similar terms get grouped.
Now, we need to show the terms to the users but the stemmed terms are not always human readable. Is there any way to get an example of the original terms that got stemmed so that those could be shown to the user?
The only solution we can think of is quering two different fields, one with stemming and one without and then do the matching ourselves. But we think that is going to be expensive (two queries) and may be error prone (the matching may produce errors).
Is there any other way to implement this on Solr? Thanks in advance.
Stemming is applied at both query time and index time so I don't think there is an easy way to accomplish what you're trying to do. However, it may be possible, depending on the number of results in your database, to do this by employing a combination of faceting and highlighting. The highlighted term will be the entire matching term rather than the stemmed term (so, for example, the stemmed term might be "associ" but the highlighted terms will be "associated", "association", "associations", etc.). Perhaps what you could do is the following:
?q=keyword&facet=true&facet.field=myfield&&facet.limit=20hl=true&hl.fl=myfield&hl.fragsize=0&rows=10
Getting 10 rows and examining the highlighted results (by default, these are highlighted using <em> </em> tags but you can change this by using hl.simple.pre and hl.simple.post -- for example, using &hl.simple.pre=[&hl.simple.post=] would wrap the matching terms in square brackets) should at least give a sample of the "original" matching terms. hl.fragsize=0 returns the entire field along with highlighting.
Hope this helps. You can read more about highlighting parameters here:
http://wiki.apache.org/solr/HighlightingParameters

Lucene - few or a lot of indexes

Is it better to use
a lot of indexes (eg. for every user as your application allows that)
in Lucene
or just one, having every document in int
... if you think about:
performance
disk space
health
I am using elasticsearch, therefore I am using Lucene.
In Elastic Search, I think based off your information I would use 1 index. My understanding is users are only searching there own documents, and the documents seems to be relatively similar.
Performance - When searching you can use a Filtered Query to filter to only the documents matching the user. The user id filter is cache-able, and fast.
Scalable - In Elasticsearch, you control sharding and replication at index level. Elasticsearch can handle large numbers of indexes, I just think configuring appropriate shards and replications could be valuable for the entire index.
In a single index, you can still easy wipe away data (see delete by query) , and there should be little concern of seeing others data unless you write your queries wrong. A filtered query with that filters results to only those associated with a user id is very simple. Similar in complexity to searching a different index per user.
Your exact needs might fit a different approach better. Based what I have so far, I would do choose one index though.

Resources