Apache Solr or Lucene proximity search on multiple fields - solr

Is it possible in solr/lucene to search on different multivalued fields?
Imagine to have an XML fragment like this:
<normative>
<ref><aut>State</aut><num>70</num>><year>2007</year><article>13</article></ref>
<ref><aut>TreasuryMinistry</aut><num>350</num><year>2011</year><article>21</article></ref>
</normative>
Is it possible to retrieve documents containing for instance:
num:70 AND year:2007
inside the same ref ?
i.e. this document should not be found for a query like
num:70 AND year:2011.
I could create catenated fields like
<ref cat='state-0070-2007-0013'/>
<ref cat='TreasuryMinistry-0350-2011-0021'/>
but the user must be able to find by every combination of fields, i.e.
num and year,
year and article,
num and article,
aut and num and year,
on the same ref!
I am not experienced with solr/lucene, so I fear that a wild card search like
cat:'*-0070-2007-*'
could not be not performant over our normative document corpus.
Is there a way to make a search based on relative position?
Something like using copyField to a multivalue field with different positionincrementGaps?

Not directly answering your proximity question, but can you treat each as a document? If so, then a search like 'num:70 AND year:2007' should work fine, assuming you create the 'num' and 'year' fields.

Related

SOLR: Search for a value in multiple fields

I am looking for a way of querying for values in multiple fields. Basically i am building a simple search engine where user can type ie. "Java How to XML JSON" and it will search for these values in 3 different fields categories, tags, description.
I read on some blog I should query all fields q=*:* and then filter based on those fields for example fq=categories:java,xml,how,to,json description:java,xml,how,to,json tags:java,xml,how,to,json
This works :| But it seems incorrect to just copy paste values like this.
Is there a correct way of doing this? I have been researching this for some time but i havent found a solution.
Any help is appreciated,
Thank you
You can use defType=edismax to get the extended dismax handler. This is meant to handle user typed queries (i.e. what you'd type in). You can then use qf (query fields) to tell the edismax handler which fields you want to search (and an optional weight for each field):
q=Java How to XML JSON&defType=edismax&qf=categories^5 tags description
.. will search each part of the string "Java How to XML JSON" in all the fields, and any hits in the categories field will be weighted five times higher than hits in the other two fields.

Sorting of solr documents based on search term in solr

I would like to sort solr documents based on searched term. For example the search term is "stringABC"
Then the order of the results should be
stringABC,
stringABCxxxx,
xxxxstringABCxxxx
The solr document will contain lot of fileds ex: title, description, path, article No, Product code etc..
And the default field will contain more than one field ex: title, description and path.
So the solr doc will only be returned when the search term satisfied any field from the default field.
Use three fields - one with the exact string, one with a EdgeNgramTokenizer and one with an NgramTokenizer. You can then use qf=field1^10 field2^5 field3 to score hits in these fields according to how you want to prioritize them between each other.

Solr - Nested Edismax Query

I am using Solr (with pySolr) to search products in my database, returning products, facets and facet.pivots:
result = solr.search(query_s, **{
'rows': '24',
'sort': formatted_sort,
'facet': 'on',
'facet.limit': '-1',
'facet.mincount': '1',
'facet.field': ['gender', 'material'],
'facet.pivot': 'brand,series',
'fq': '-in_stock:(0 OR 99 OR 100 OR 101)'
})
The query_s selects specific fields, for example: brand:Target AND gender:Men's.
I would like to combine the above query with a DisMax query which will allow me to combine the above query with a full text search over specified fields. I found an article which demonstrates nested queries. I have tried to implement something like this:
q: "gender:* AND _query_:"{!edismax qf=brand series}Summer""
For some reason 'Target' will return results for Target brand shirts, but only with correct capitalization. 'Summer' which is a series of Target, won't return any results. Why am I not seeing a list of docs ordered by relevancy?
Am I overcomplicating things by using Dismax altogether?
The dismax parsers are useful for making sense of more "natural" queries, i.e. queries where the user is used to just type what they're looking for, and how most search engines work.
In your case it sounds like brand:Target AND gender:Men's are filters for which documents should be shown, while the query is the part that the user has typed. Usually you'll want to have the filters in fq as they don't affect score (i.e. they're exact values matching a field value), and the query in q.
I assume that Summer is what the user would have typed into your search box, which would give you:
q=Summer&defType=edismax&qf=series
But this assumes that the series field is defined as a text field that has an analyzer attached, so that the values are lowercased and split appropriately.
If you also have a description field you'd like to search, you can do:
q=Summer&defType=edismax&qf=series^20 description
.. which would search for Summer in both the series and description fields, but give 20 times more weight to a hit in the series field. This is a good way to naturally boost documents that match more exact data in your documents. If you also include the brand field, you'd be able to let your users search for "target summer" and similar queries.

Solr Custom Boosting if a specific field matches the query

We are trying to implement a very interesting search logic with custom boosting and I am wondering if Solr can support this.
We have the following fields in our index:
Name
Description
Keywords (array)
Each keyword will have an amount(int value) paired to it.
A search is run across Name, description and keywords field. If a keyword matches the search text, the corresponding index must be boosted based on the amount of the matching keyword only.
I've read through Solr DisMax and they can only boost a field using a fixed amount.
My scenario will be to boost the result by X amount based on matching keywords only.
Thanks in advance
The only viable solution i see to this problem (assuming ofcourse you DO NOT know the number of keywords in advance) would be to just make the query as a filter query (to skip the scoring stage), get all documents matching ( a bit problematic), then just sort them on your side using the matched term to build the a java Comparator.
Problems may arise when you get a particularly large number of documents, but you could probably side step this issue by pagination
If you don't have too much different amounts maybe you can try this on index-time:
Store "keywords" in different fields(dynamicfields->boost-*) based on it's amount:
boost-1 = keyword1,keyword4,keyword6 <br/>
boost-10 = keyword2<br/>
boost-100 = keyword5
You can search across all your boost fields(edismax), boost every dynamicfield with his amount in your (e)dismax conf(boost-1^1,boost-10^10,boost-100^100).

Is it possible to have SOLR MoreLikeThis use different fields for model and matches?

Let's say I have documents with two fields, A and B.
I'd like to use SOLR's MoreLikeThis, but with a twist: I'm most interested in boosting documents whose A field is like my model document's B field. (That is, extract MLT's 'interesting terms' from the model B field, but only collect MLT results based on the A field.)
I don't see a way to use the mlt.fl fields or mlt.qf boosts to achieve this effect in a single query. (It seems mlt.fl specifies fields used for both discovery of 'interesting terms' and matching to those terms.) Am I missing some option?
Or will I have to extract the 'interesting terms' myself and swap the 'field:term' details?
(Other ideas in this same vein appreciated as well.)
Two options I see are:
Use a copyField - index your original document with a copy of field A named B, and then query using B.
Extend MoreLikeThisHandler and change the fields you query.
The first option costs a bit of programming (mostly configuration changes) and some memory consumption. The second involves more programming but no memory footprint increase. Hope one of them suits your needs.
I now think there are two ways to achieve the desired effect (without customizing the MLT source code).
First option: Do an initial MLT query with the MLT handler, adding the parameter &mlt.interestingTerms=details. This includes the list of terms that were deemed interesting, ranked with their relative boosts. The usual behavior uses those discovered terms against the same mlt.fl fields to find similar documents. For example, the response will include something like:
"interestingTerms":
["field_b:foo",5.0,"field_b:bar",2.9085307,"field_b:baz",1.67070794]
(Since the only thing about this initial query that's interesting is the interestingTerms, throwing in an fq that rules out all docs could help it skip unnecessary scoring work.)
Explicitly re-composing that interestingTerms info into a new OR query field_a:foo^5.0 field_a:bar^2.9085307 field_a:baz^1.67070794 amounts to using the B field example text to find documents that are similar in field A, and may be mimicking exactly the kind of query default MLT does on its usual model field.
Second option: Grab the model document's actual field B text, and feed it directly as a ContentStream body, to be used in lieu of a query, for specifying the model document. Then target mlt.fl at field A for the sake of collecting similar results. For example, a fragment of the parameters might be …&stream.body=foo bar baz&mlt.fl=field_a&…. Again, the net effect being that model text originally from field_b is finding documents similar only in field_a.

Resources