solr proximity search while using OR operator - solr

Is it possible to make a proximity query in solr while using OR operator, example:
term1:"Samsung"
term2:"galaxy note" OR "galaxy 3" OR ...
i need to find if i have both term1 and term2 with ~n proximity.

This is not supported by default in Solr 4.4 which is currently the latest version.
SOLR-1604 has some relevant information.

The easy approach here would be to just generate a set of longer phrase queries, so rather that looking for "Samsung" and "galaxy note" separately, you concatenate them together, to give you something more like:
"Samsung galaxy note" OR "Samsung galaxy 3" ...
and setting it to have the desired amount of slop.
You can also use ComplexPhraseQueryParser which handles similar cases, though I'm not sure it handles alternate phrases within a phrases. SurroundQueryParser also supports this sort of thing, but doesn't do any analysis, which means you'll have to analyze your subqueries before hand (see the Limitations section in the documentation).
If you want to manually generate queries, you can support this more directly using a set of SpanQuerys, which would look something like:
Query rootQuery = parser.parse("\"Samsung\"");
Query option1Query = parser.parse("\"galaxy note\"");
Query option2Query = parser.parse("\"galaxy 3\"");
Query option3Query = parser.parse("\"something else\"");
SpanQuery orQuery = new SpanOrQuery(
new SpanMultiTermQueryWrapper(option1Query),
new SpanMultiTermQueryWrapper(option2Query),
new SpanMultiTermQueryWrapper(option3Query)
);
SpanQuery finalQuery = new SpanNearQuery({new SpanMultiTermQueryWrapper(rootQuery), orQuery}, slop, true);

Related

Solr 7 - How to do Full Text Search w/ Geo Spatial Search

How to do Full Text combined w/ Geo Spatial in Solr 7?
In regards to this: https://lucene.apache.org/solr/guide/7_2/spatial-search.html
I have to do queries that COMBINE full text w/ geo spatial. For example:
box AND full text or spatial sort AND full text.
I was not able to figure out a good query string example that produces this desired result. I would like this as a pure query string rather than some Java method as I'm consuming this on different tech other than Java. Solr is very deep and confusing and I know I must read more but there was no good examples for this anywhere online that I found.
desired query string example
[solr]/select?fq={!bbox sfield=point}&pt=34.04506799999999,-118.260849&d=10000&sort=geodist() asc&{!geofilt}&sfield=point&q=Panini
So in that case, would sort by distance yet also filter by some full text "some text" value.
If this cannot be done, I believe it is possible in Elastic Search but these (Solr and Elastic Search) are both built on top of Lucene so seems like it should work on both if works on one but feel free to supply an answer for Elastic Search as well.
example returned
{
"responseHeader": {
"status": 0,
"QTime": 2,
"params": {
"q": "Panini",
"pt": "34.04506799999999,-118.260849",
"d": "10000",
"{!geofilt}": "",
"fq": "{!bbox sfield=point}",
"sort": "geodist() asc",
"sfield": "point"
}
},
"response": {
"numFound": 0,
"start": 0,
"docs": []
}
}
Docs do contain this phrase 'Panini' but none returned. May be due to default way full text is handled in Solr 7? It is using the same point where the term 'Panini' is used and the field point is of type org.apache.solr.schema.PointType.
UPDATE
I ended up abandoning Solr for Elastic Search. Solr is just very annoying in its strange ways compared with the very easy to use Elastic Search. Things just work as you expect without having to dig into quirks.
I adapted my answer to the solr 7.2.1 example:
Start solr by: ./bin/solr start -e techproducts
I've also visualized the data in google maps:
https://www.google.com/maps/d/u/0/viewer?ll=42.00542239270033%2C-89.81213734375001&hl=en&hl=en&z=4&mid=16gaLvuWdE9TsnhcbK-BMu5DVYMzR9Vir
You need these query parameters:
Bound by Box Filter:
fq={!bbox}
The geo filter query parser bbox needs further parameters:
Solr field: sfield=store
Point to search/sort from: pt=36.35,-97.51
Distance for filter: d=1200
Sort:
sort=geodist() asc
Fulltext query:
q=some+text
Full example queries for solr example data:
Simple:
http://localhost:8983/solr/techproducts/select?fq={!bbox}&sort=geodist()%20asc&sfield=store&pt=36.35,-97.51&d=1200&q=ipod&fl=name,store
UI:
http://localhost:8983/solr/techproducts/browse?fq={!bbox}&sort=geodist()%20asc&sfield=store&pt=36.35,-97.51&d=1200&q=ipod
The result is as expected:
Apple 60 GB iPod
Belkin Power Cord for iPod
Filtered by distance: iPod & iPod Mini USB 2.0 Cable
Hints
The field store must be of type location:
You might Urlencode the special characters:
e.g. fq=%7B%21bbox%20sfield%3DgeoLocation%7D
In your case, you have to combine the full-text search scoring with the spatial distance.
So if your query looks like this:
/select?fq={!bbox sfield=point}&pt=34.04506799999999,-118.260849&d=10000&sort=geodist() asc&{!geofilt}&sfield=point&q=Panini
You should change the sort parameter and either remove it or just set it to score desc. That way you sort by the score given from the full-text search query.
To take the spatial part into consideration you need to include a boosting function to your query. In majority of the cases - the closer the document is from the point of interest the better, so you would probably like to include a boosting function that does X/distance. The X can be as simple as 1 and the function itself can also be more complicated. To do that in dismax query you would use the bf parameter, like bf=div(1,geodist()).
Try that out, it should work, but of course will need some adjustments.

How can I tune the Retrieve and Rank ranker with a dictionary/model of domain specific phrases?

We are trying to group phrases together in order to improve results.
For instance, if the user asks a question like "When do I have to change the filter of my air conditioning?" with a domain specific phrase such as “air conditioning”, R&R returns some answers containing the term “air” and no “conditioning” or it returns answers containing other terms like air bag or air filter.
This can be accomplish using a raw Solr instance and set the phrase between quotes. So, the Solr query would look like the following:
...
"debug": {
"rawquerystring": "When do I have to change the filter of my \"air conditioning\" ?",
"querystring": "When do I have to change the filter of my \"air conditioning\" ?",
"parsedquery": "text:when text:do text:i text:have text:to text:change text:the text:filter text:of text:my PhraseQuery(text:\"air conditioning\") text:?",
"parsedquery_toString": "text:when text:do text:i text:have text:to text:change text:the text:filter text:of text:my text:\"air conditioning\" text:?",
...
However, the R&R guide states:
The syntax is different from standard Solr syntax as follows:
You can search for a single term, or a phrase. You do not need to
surround the phrase with double quotation marks as with Solr, but you
can include phrases in the query and they are accounted for by the
ranker models.
We could not find more details regarding the above statement.
But, as we understand, the ranker is supposed to identify phrases. If that is the case, we were wondering if there is a way where we can set a dictionary of phrases in order to tune the ranker?
Or, could we set our own model of legal phrases? What are the options to accomplish this goal?
Thanks
Currently RnR doesn't support strict phrase querying, though there are features that will take term ordering and adjacent terms into consideration. We are working on a new version of service, in which users would be able to use full regular solr query syntax (including specifying phrases) for document retrieving.

In Solr, how can we use terms external to the search query to bias result ordering?

We're working on a plan to identify content tags our users are interested in. So, for instance, we may determine that User X consumes content tagged with "kermit" and "piggy" more often than other tags. These are their "favored tags."
When the users search, we'd like to favor/bias documents that contain these terms.
This means we can't boost the documents at index time, because every user will have different favored tags. Additionally, they may not be searching for the favored tags themselves. They may search for "gonzo," and so we absolutely want to give them documents with "gonzo," but we want to boost documents that also contain "kermit" or "piggy."
These favored tags are not used to actually query the index, but rather are used to bias the result ordering. The favored tags become something of a tie-breaker -- all else being equal, documents containing these terms will rank higher.
This is new/planned development, so we can use whatever version and parser stack is optimal to solve this problem.
Solution in SolrNet
The question was correctly answered below, but here's the code for SolrNet just in case someone else is using it.
var localParams = new LocalParams();
localParams.Add("bq", "kermit^10000); //numeric value is the degree of boost
var solr = ServiceLocator.Current.GetInstance<ISolrOperations<MySolrDocumentClass>>();
solr.Query(new SolrQuery("whatever") + localParams);
You didn't specify which query parser you're using, but if you are using the Dismax or Extended Dismax query parser, the bq argument should do exactly what you're looking for. bq adds search criteria to a search solely for the purpose of affecting the relevancy, but not to limit the result set.
From the Dismax documentation:
The bq (Boost Query) Parameter
The bq parameter specifies an additional, optional, query clause that
will be added to the user's main query to influence the score. For
example, if you wanted to add a relevancy boost for recent documents:
q=cheese
bq=date:[NOW/DAY-1YEAR TO NOW/DAY]
You can specify multiple bq parameters. If you want your query to be
parsed as separate clauses with separate boosts, use multiple bq
parameters.
In this case, you may want to add &bq=kermit&bq=piggy to the end of your Solr query. If you aren't using one of these query parsers, this need may be exactly the motivation you need to switch.

Proximity searching phrases with root expanders in Solr or ElasticSearch (especially websolr or bonsai.io)?

I'm trying to select a search tool for a large project, and I'd be interested to know if this use case was supported by Solr or ElasticSearch.
My customers are interested in conducting relatively sophisticated boolean searching. One search that is a must is the ability to conduct proximity searches on phrases with root expanders.
For example, imagine a user searching for a document with this phrase: "The cute dog was attacked by evil raccoons"
I'd like the user to be able to search for "evil rac*" within 5 words of "dog" and return a document with the above sentence. Ideally, a query would look something like:
("evil rac*" dog)~5
So far, the only search tool I've found that can do what I'm looking for is dtSearch. The query for dtSearch would be "evil rac*" w/5 dog, which is great. I'd rather use an open source tool like Solr or ElasticSearch (and especially a hosted solution such as websolr or bonsai.io). Any advice would be very much appreciated.
It's certainly technically possible to do this with a custom query parser, but the default, dismax, etc parsers in solr don't appear to support this. There's an old and unresolved issue about this: https://issues.apache.org/jira/browse/SOLR-1604.
ElasticSearch would only support this with the JSON query builder, but it appears that the phrase-like query support is only for "span_term"s, which are just simple words.
There's some talk of the default query parsers being more clever in the near future.
Definitely technically possible, but as of yet unsupported in Lucene. There are a few open issues to support "complex phrase" behavior in Lucene, which seems to be targeted at Lucene 4.3:
LUCENE-1486 — An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
I don't see your specific query structure in their examples there, but this is definitely a lot closer than what's available today.
To recap: theoretically feasible, not supported in syntax as of April 2013 and Lucene 4.2.1.
(Hat tip to my business partner, Kyle, for help researching this.)
It is possible but...
1) First, check http://wiki.apache.org/solr/SurroundQueryParser (http://searchhub.org/2009/02/22/exploring-query-parsers/) for surround query parser. It is almost exactly what you want. However, people claim (at least in some places) that it supports phrase queries but that is not true (yet).
2) So you have to implement the phrase proximity. A (nasty) hack is to update DistanceQuery::getSpanNearQuery (Line 78 in solr 4.2.1 in lucene/queryparser/.../DistanceQuery.java)
while (sqi.hasNext()) {
SpanNearClauseFactory sncf = new SpanNearClauseFactory(reader, fieldName, qf);
// HACK starts here
DistanceSubQuery dsq = ((DistanceSubQuery)sqi.next());
try {
if ( ((SrndTermQuery)dsq).getTermText().contains( " " ) ) {
String term_text = ((SrndTermQuery)dsq).getTermText();
String[] tokens = term_text.split("\\s+");
SpanQuery[] span_queries = new SpanQuery[tokens.length];
for ( int i = 0; i < tokens.length; ++i ) {
span_queries[i] = new SpanTermQuery( new Term(fieldName, tokens[i]) );
}
spanClauses[qi] = new SpanNearQuery( span_queries, 0, true);
qi++;
continue;
}
}catch( Exception ex ){
}
// HACK ends here
dsq.addSpanQueries(sncf);
3) And be careful that there is no preprocessing of the data so if you use stemming you have to search for exact the words e.g., select?q={!surround df=text}"we defin" 11w "descend" will match
"""
we define a set of words sorted in descending
"""

Terms Prevalence in SolR searches

Is there a way to specify a set of terms that are more important when performing a search?
For example, in the following question:
"This morning my printer ran out of paper"
Terms such as "printer" or "paper" are far more important than the rest, and I don't know if there is a way to list these terms to indicate that, in the global knowledge, they'd have more weight than the rest of words.
For specific documents you can use QueryElevationComponent, which uses special XML file in which you place your specific terms for which you want specific doc ids.
Not exactly what you need, I know.
And regarding your comment about users not caring what's underneath, you control the final query. Or, in the worst case, you can modify it after you receive it at Solr server side.
Similar: Lucene term boosting with sunspot-rails
When you build the query you can define what are the values and how much these fields have weight on the search.
This can be done in many ways:
Setting the boost
The boost can be set by using "^ "
Using plus operator
If you define + operator in your query, if there is a exact result for that filed value it is shown in the result.
For a better understanding of solr, it is best to get familiar with lucene query syntax. Refer to this link to get more info.

Resources