Can I combine result sets in Solr - solr

I want to do the following:
Let A be the set of documents, each with the field important:true, and with a date beginning with this year, or previous year. The result set should be ordered by date. In pseudo code:
Result set A:
q="testquery" +important:true AND +(date:2015* OR date:2016*)
sort=date desc
Then, let B be the remaining set of documents, i.e. those with important:true and a date preceeding year 2015, and also all documents with important:false. This set should also be ordered by date. Again in very sloppy pseudo:
Result set B:
q="testquery" -(date:2015* OR date:2016*)
sort=date desc
Now, i would like to return A followed by B, and be able to use the paging features etc. I am very noob with SOLR ( < 10 hrs of trying out different queries) and I can't figure how to accomplish this behavior. I guess I cannot use bq since we don't sort by score, right?
An example of the desired outcome:
<result name="response" numFound="2089" start="0">
<doc>
<bool name="important">true</bool>
<str name="date">2016-03-01 00:00:00</str>
</doc>
<doc>
<bool name="important">true</bool>
<str name="date">2015-12-01 00:00:00</str>
</doc>
<doc>
<bool name="important">true</bool>
<str name="date">2015-04-01 00:00:00</str>
</doc>
<doc>
<bool name="important">true</bool>
<str name="date">2015-01-01 00:00:00</str>
</doc>
<doc>
<bool name="important">false</bool>
<str name="date">2016-10-01 00:00:00</str>
</doc>
<doc>
<bool name="important">false</bool>
<str name="date">2015-03-01 00:00:00</str>
</doc>
<doc>
<bool name="important">false</bool>
<str name="date">2014-02-01 00:00:00</str>
</doc>
<doc>
<bool name="important">true</bool>
<str name="date">2014-09-01 00:00:00</str>
</doc>
<doc>
<bool name="important">false</bool>
<str name="date">2013-05-01 00:00:00</str>
</doc>
<doc>
<str name="date">2012-09-01 00:00:00</str>
</doc>
</result>
</response>
Notice in the example above that for documents older than 2015, the documents marked important is no more important than any other, they will appear in strict chronological order.
Any help is appreciated, but I would especially love examples using SolrNet syntax :)
EDIT:
I can not make any changes to index or schema...

((important: true AND (date:2016* OR date:2015*))^1001 OR (important: false AND (date:2016* OR date:2015*))^1000 OR date:*) AND something:"foo" and sort score desc, date desc
This will show recent important items first, then recent non-important items, and finally all items, and everything sorted by date in their 'sections'.
something:"foo" at the end of the clause refers to any extra clauses you might have.

The main challenge here - I feel - is sorting by date. Without that, you could easily boost your special privilege query to be at the front. But sorting by date afterwords would reset this and you would be back where you started.
It is possible however to sort by several fields. So, if your special condition could be encoded as a field value during indexing, you could sort by that first, then by date.
If that's not possible to do during the indexing, you may need to add a second trick. It is possible to sort by a function query instead of a field. So, you would need to build a function query expression (probably using if and ms at least) that represents your boost condition.
You may have some challenges representing your 2015/2016 as a condition. If it is a date, you may be able to use date math to create a consistent round-down to a year (NOW/YEAR).
I would start by doing a simpler problem of just pushing the important item to the top, still sorted by date. Just to test that my logic here works. If/once that works with functions and sort and paging, the special dates can be added into the calculation.

Related

Sorting and boosting

Is it possible to sorting a boosting query in Solr?
I have the following situation:
<doc id="A">>
<str name="PUB_DATE">2017-04-19T11:08:30Z</str>
<str name="TIPOLOGY">TWO</str>
</doc>
<doc id="B">
<str name="PUB_DATE">2017-04-19T11:08:30Z</str>
<str name="TIPOLOGY">ONE</str>
</doc>
<doc id="C">
<str name="PUB_DATE">2017-04-19T11:08:30Z</str>
<str name="TIPOLOGY">THREE</str>
</doc>
<doc id="D">
<str name="PUB_DATE">2017-04-20T11:08:30Z</str>
<str name="TIPOLOGY">ONE</str>
</doc>
the idea is:
first of all sort by pub_date desc.
In case of same pub_date, boosting by tipology field. One, two, three.
So by the above example, Solr query will be return D --> B --> A --> C
I try the following query but doesn't work:
/select?defType=edismax&q=XXXXXXXX&sort=PUB_DATE+desc&bq=TIPOLOGY:ONE^100+TIPOLOGY:B^10++TIPOLOGY:C^1
your 'sort' param is asking the results to be sorted only by PUB_DATE, and bq param is affecting the score of each doc.
What you need to do is ask them to be sorted by PUB_DATE first, and then score, like this:
/select?defType=edismax&q=XXXXXXXX&sort=PUB_DATE desc, score desc&bq=TIPOLOGY:ONE^100+TIPOLOGY:B^10++TIPOLOGY:C^1
If your bq boosts are enough to get the docs' scores in order it would work (as the score is also influenced by the q=XXXXXX part)

How to boost repeated values in a multiValue field on Solr

I have some repeated (same strings) data in a multiValue field on my solr index and i want to boost documents by matches count in that field. For example:
doc1 : { locales : ['en_US', 'de_DE', 'fr_FR', 'en_US'] }
doc2 : { locales : ['en_US'] }
When i run the query q=locales:en_US i would like to see the doc1 at the top because it has two "en_US" values. What is the proper way to boost this kind of data?
Should i use a special tokenizer?
Solr version is: 4.5
Disclaimer
In order to use either of the following solutions you will need to make either one of the following changes:
Create a copyField for locales:
<field name="locales" type="string" indexed="true" stored="true" multiValued="true"/>
<!-- No need to store(stored="false") locales_text as it will only be used for searching/sorting/boosting -->
<field name="locales_text" type="text_general" indexed="true" stored="false" multiValued="true"/>
<copyField source="locales" dest="locales_text"/>
Change the type of locales to "text_general" (the type is provided in the standard solr collection1)
First solution (Ordering):
Results can be ordered by some function. So we can order by number of occurrences (termfreq function) in field:
If copyField is used, then sort query will be: termfreq(locales_text,'en_US') DESC
If locales is of text_general type, then sort query will be: termfreq(locales,'en_US') DESC
Example response for copyField option (the result is the same for text_general type):
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
<lst name="params">
<str name="fl">*,score</str>
<str name="sort">termfreq(locales_text,'en_US') DESC</str>
<str name="indent">true</str>
<str name="q">locales:en_US</str>
<str name="_">1383598933337</str>
<str name="wt">xml</str>
</lst>
</lst>
<result name="response" numFound="2" start="0" maxScore="0.5945348">
<doc>
<arr name="locales">
<str>en_US</str>
<str>de_DE</str>
<str>fr_FR</str>
<str>en_US</str>
</arr>
<str name="id">4f9f71f6-7811-4c22-b5d6-c62887983d08</str>
<long name="_version_">1450808563062538240</long>
<float name="score">0.4203996</float></doc>
<doc>
<arr name="locales">
<str>en_US</str>
</arr>
<str name="id">7f93e620-cf7b-4b90-b741-f6edc9db77c9</str>
<long name="_version_">1450808391856291840</long>
<float name="score">0.5945348</float></doc>
</result>
</response>
You can also use fl=*,termfreq(locales_text,'en_US') to see the number of matches.
One thing to keep in mind - it is an order function, not a boost function. If you will rather boost score based on multiple matches, you will be probably more insterested in the second solution.
I included the score in the results to demonstrate what #arun was talking about. You can see that the score is different(probably to length)... Quite unexpected(for me) that for multivalued string it is the same.
Second solution (Boosting):
If copyField is used, then the query will be : {!boost b=termfreq(locales_text,'en_US')}locales:en_US
If locales is of text_general type, then the query will be: {!boost b=termfreq(locales,'en_US')}locales:en_US
Example response for copyField option (the result is the same for text_general type):
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
<lst name="params">
<str name="lowercaseOperators">true</str>
<str name="fl">*,score</str>
<str name="indent">true</str>
<str name="q">{!boost b=termfreq(locales_text,'en_US')}locales:en_US</str>
<str name="_">1383599910386</str>
<str name="stopwords">true</str>
<str name="wt">xml</str>
<str name="defType">edismax</str>
</lst>
</lst>
<result name="response" numFound="2" start="0" maxScore="1.1890696">
<doc>
<arr name="locales">
<str>en_US</str>
<str>de_DE</str>
<str>fr_FR</str>
<str>en_US</str>
</arr>
<str name="id">4f9f71f6-7811-4c22-b5d6-c62887983d08</str>
<long name="_version_">1450808563062538240</long>
<float name="score">1.1890696</float></doc>
<doc>
<arr name="locales">
<str>en_US</str>
</arr>
<str name="id">7f93e620-cf7b-4b90-b741-f6edc9db77c9</str>
<long name="_version_">1450808391856291840</long>
<float name="score">0.5945348</float></doc>
</result>
</response>
You can see that the score changed significantly. The first document score two time more than the second (because there was two matches each scored as 0.5945348).
Third solution (omitNorms=false)
Based on the answer from #arun I figured that there is also a third option.
If you convert you field to (for example) text_general AND set omitNorms=true for that field - it should have the same result.
The default standard request handler in Solr does not use only the term frequency to compute the scores. Along with term frequency, it also uses the length of the field. See the lucene scoring algorithm, where it says:
lengthNorm - computed when the document is added to the index in accordance with the number of tokens of this field in the document, so that shorter fields contribute more to the score.
Since doc2 has a shorter field it might have scored higher. Check the score for the results with fl=*,score in your query. To know how Solr arrived at the score, use fl=*,score&wt=xml&debugQuery=on (then right click on your browser and view page-source to see a properly indented score calculation). I believe you will see the lengthNorm contributing to a lower score for doc1.
To have length of the field not contribute to the score, you need to disable it. Set omitNorms=true for that field. (Ref: http://wiki.apache.org/solr/SchemaXml) Then see what the scores are.

Why is Solr returning results that are negated?

Consider the following Solr query:
text:linux -img_src:jpg -img_src:jpeg -img_src:youtube
Solr is still returning results that have the negated terms in the img_src field:
<doc>
<str name="img_src">
http://lh3.ggpht.com/-zek96i2kouM/R9HZD2U-d9I/AAAAAAAAC7Q/Zf_QHmiL10w/Stress.jpg
</str>
</doc>
<doc>
<str name="img_src">
http://lh3.ggpht.com/-zek96i2kouM/R9HZD2U-d9I/AAAAAAAAC7Q/Zf_QHmiL10w/Stress.jpg
</str>
</doc>
<doc>
<str name="img_src">
http://farm9.staticflickr.com/8436/7787223734_5962d16624.jpg
</str>
</doc>
<doc>
<str name="img_src">
http://farm8.staticflickr.com/7246/7787084482_8ee833cc45.jpg
</str>
</doc>
Obviously I'm doing something wrong. What might that be? Thanks.
I ran into similar problems. I believe I used filter queries to achieve the desired results, e.g.:
?q=text:linux&fq=-img_src:jpg&fq=-img_src:jpeg&fq=-img_src:youtube

How do I detect "ERROR:SCHEMA-INDEX-MISMATCH" in Solr?

How do I find documents in my index that have a SCHEMA-INDEX-MISMATCH? I have a number of these that I am finding them by trial-and-error. I want to query for them.
The results that I get have "ERROR:SCHEMA-INDEX-MISMATCH" in a field. An example:
<?xml version="1.0" encoding="UTF-8"?>
<response>
<result name="response" numFound="1" start="0" maxScore="12.993319">
<doc>
<float name="score">12.993319</float>
<str name="articleId">ERROR:SCHEMA-INDEX-MISMATCH,stringValue=555</str>
<str name="articleType">Knowledge Base</str>
<str name="description">Moving to another drive Question: How can I ....</str>
<str name="id">article:555</str>
<str name="title">Moving to another drive</str>
<str name="type">article</str>
</doc>
</result>
</response>
If it matters, my query is along the lines of http://server/solr/select?q=id:%22article:555%22
What is the "type" of articleId?
I had issues with a date field and due to a defect in indexing program, I had 'ERROR:SCHEMA-INDEX-MISMATCH". Since these are values out side the bounds of a normal date, I was able to find them by the query - "Not myDateFieldType:[0001-01-01T00:00:00Z NOW]" .
If you are able to craft this type of query, depending on your data type, you should be able to find these values.

Can I restrict the search to a specific date range?

I want to get all results AFTER a given date, can you do this with solr?
(http://lucene.apache.org/solr/)
Right now the results are search the entire result set, I want to filter for anything after a given date.
Update
This isn't working for me yet.
My returned doc:
trying:
http://www.example.com:8085/solr/select/?q=test&version=2.2&start=0&rows=10&indent=on&indexed_at:2009-08-27T13%3A15%3A27.73Z
<doc>
<str name="apptype">Forum</str>
<str name="collapse">forum:334</str>
<str name="content"> testing </str>
<str name="contentid">357</str>
<str name="createdby">some_user</str>
<str name="date">20090819</str>
<str name="dummy_id">1</str>
<int name="group">5</int>
<date name="indexed_at">2009-08-25T16:48:45.121Z</date>
<str name="rating">000.0</str>
<str name="rawcontent"><p>testing</p></str>
−
<arr name="roles">
<str>1</str>
<str>2</str>
<str>3</str>
<str>4</str>
<str>14</str>
<str>15</str>
<str>16</str>
</arr>
<int name="section">79</int>
<int name="thread">334</int>
<str name="title">testing</str>
<str name="titlesort">testing</str>
<str name="type">forum</str>
−
<str name="unique_id">
BLAHBLAH|357
</str>
<str name="url">/blahey/f/79/p/334/357.aspx#357</str>
<str name="user">21625</str>
<str name="username">some_user</str>
</doc>
Yes you can I assume you have a field with the date value you want to filter on. Then you do
yourdatefield:[2008-08-27T23:59:59.999Z TO *]
a sample url would be localhost:8983/solr/select?q=yourdatefield:[2008-08-27T23:59:59.999Z TO *]
you want to submit the date part as a query so in the value of q like
localhost:8983/solr/select/q=(text:test+AND+indexed_at:`[2009-08-27T13:A15:A27.73Z TO *`])
So the entire query is contained within the q querystring paramter.
the format of the date is ISO 8601.
You can add a automatic timestamp to the documents as they are indexed using:
<field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
in the schema.xml. The default schema has this commented out so if you copied the default, you just need to uncomment it.
You could add that and use olle's suggested search pattern to find the documents indexed after a certain date. (You'd have to update yourdatefield with timestamp or whatever you name the field in the xml.
You will need to create a query that compares dates, here is the syntax for queries:
http://wiki.apache.org/solr/SolrQuerySyntax
And here is how you can make date comparisons in the query:
http://lucene.apache.org/solr/api/org/apache/solr/util/DateMathParser.html

Resources