Difference between 2 solr queries - solr

What is the difference between running the following two solr queries. They seem to be giving me a different number of results.
fq=field1:value1&fq=(field2:value21 OR field2:value22)
versus
fq=field1:value1&fq=field2:value21 OR field2:value22
The first one gives me a larger result set whereas the second one gives me a smaller results set. Does the parentheses have any effect in this case? If so, what is it?

check this manual, for using Boolean operators in SOLR
http://robotlibrarian.billdueber.com/solr-and-boolean-operators/

Related

Sort by constant number

I need to randomize Solr (6.6.2) search results, but the order needs to be consistent given a specific seed. This is for a paginated search that returns a limited result set from a much larger one, so I must do the ordering at the query level and not at the application level once the data has been fetched.
Initially I tried this:
https://localhost:8984/solr/some_index/select?q=*:*&sort=random_999+ASC
Where 999 is a constant that is fed in when constructing the query prior to sending it to Solr. The constant value changes for each new search.
This solution works. However, when I run the query a few times, or run it on different Solr instances, the ordering is different.
After doing some reading, random_ generates a number via:
fieldName.hashCode() + context.docBase + (int)top.getVersion()
This means that when the random number is generated, it takes the index version into account. This becomes problematic when using a distributed architecture or when indexes are updated, as is well explained here.
There are various recommended solutions online, but I am trying to avoid writing a custom random override. Is there some type of trick where I can feed in some type of function or equation to the sort param?
For example:
min(999,random_999)
Though this always results in the same order, even when either of the values change.
This question is somewhat similar to this other question, but not quite.
I searched for answers on SO containing solr.RandomSortField, and while they point out what the issue is, none of them have a solution. It seems the best way would be to override the solr.RandomSortField logic, but it's not clear how.
Prior Research
https://lucene.472066.n3.nabble.com/Random-sorting-and-result-consistency-across-successive-calls-based-on-seed-td4170508.html
Solr: Random sort order after index version change
https://mail-archives.apache.org/mod_mbox/lucene-dev/201811.mbox/%3CJIRA.13196983.1541639245000.300557.1541639520069#Atlassian.JIRA%3E
Solr - Return random results (Sort by Random)
https://realize.be/blog/random-results-apache-solr-and-drupal
https://lucene.472066.n3.nabble.com/Sorting-with-customized-function-of-score-td3987281.html
Even after implementing a custom random sort field, the results still differed across instances of Solr.
I ended up adding a new field that is populated at index time which is a 32 bit hash of an ID field that already existed in the document.
I then built a "stateless" linear congruential generator to produce a set of acceptably random numbers to use for sorting:
?sort=mod(product(hash_int_id,{seedConstant},982451653), 104395301) asc
Since this function technically passes a new seed for each row, and because it does not store state (like rand.Next() would), this solution is admittedly inferior and it is not a true PRNG; however, it does seem to get me most of the way there. Note that you will have to tune your values depending on the size of your data set and the size of the values in your hash_int_id equivalent field.

SOLR numeric range query versus multiple OR

Suppose there are several docs having one of the fields clientID, values from ranging 1 to 100.
Query 1:
FQ: **clientID:1 OR clientID:2 OR clientID:3 or clientID:5 or clientID:7 or client ID:8**
Query 2:
FQ: **clientID:[1 TO 3] or clientID:5 or clientID:[7 TO 8]**
Question:
Will there be a big performance difference between these two queries? If yes, how?
Doesn't SOLR do the preprocessing of translating such range values if given in multiple ORs?
There might be - depending on cached entries, etc. The second query will be two range queries and a regular query combined into three boolean clauses, while the first one will be six different boolean clauses.
Speed probably won't differ too much for your example, but as the number of clauses grow, the latter will keep the number of sets to be intersected lower than the first one. To get exact data - try it out - your core will be different from other people's cores.
And no, Solr won't preprocess anything. That's handed over to Lucene to do as it pleases, but a range query can be resolved in a different way than a exact field query. There can be entries between the terms given in your pure boolean query, so you can't translate it into a range query and expect the same result, and you can't do it the other way around either - since the field may not be integer (and even integer types differ in how they're being indexed).
The important part is usually that the fq will be cached separately, so it's usually more important to keep it re-usable across queries.
If you use the default numeric types, Solr index more than one precision for each number, (look for trieIntField and IntPointField in Solr field types
so, when when you index a 15, it index it as 15 and as 10, and when you index a 9 it index it as a 9 and as 0. When you search for a 8 - 21 range, it converts the search to a number[8] or number[9] or number[10] or number[20] or number[21]
(with binary ranges instead of decimal, but I hope you get the idea). So I suggest you use the range queries and let Solr manage the optimizations.
PointField types are the replacement for TrieFields, functionally are similar but use another data structures to store the information. So if you have a legacy index you can use the triefields, but if you are making new ones the PointFields are recommended.

Custom SOLR-sorting that is aware of its neighbours

For a SOLR search, I want to treat some results differently (where the field "is_promoted" is set to "1") to give them a better ranking. After the "normal" query is performed, the order of the results should be rearranged so that approximately 30 % of the results in a given range (say, the first 100 results) should be "promoted results". The ordering of the results should otherwise be preserved.
I thought it would be a good idea to solve this by making a custom SOLR plugin. So I tried writing a SearchComponent, but it seems like you can't change the ordering of search results after it has passed through the QueryComponent (since they are cached)?
One could have written some kind of custom sort function (or a function query?) but the challenge is that the algorithm needs to know about the score/ordering of the other surrounding results. A simple increase in the score won't do the trick.
Any suggestions on how this should be implemented?
Just answered this question on the Solr users list. The RankQuery feature in Solr 4.9 is designed to solve this type of problem. You can read about RankQueries here: http://heliosearch.org/solrs-new-rankquery-feature/

Double negation in Lucene query

Lets say i have a binary field checked
Lets also assume that 3 documents out of 10 has checked:1 others checked:0
When I search in lucene
checked:1 - returns correct result (3)
checked:0 - returns correct result (7)
-checked:1 - returns correct result (7)
-checked:0 - returns correct result (3)
BUT
-(-(checked:1)) - suddenly returns wrong result (10, i.e. entire data set).
Any idea why lucene query parse acts so weird
Each Lucene query has to contain at least one positive term (either MUST/+ or SHOULD) so it matches at least one document. So your queries -checked:1 and -checked:0 are invalid, and I am surprised you are getting any results.
These queries should (most likely) look like this:
+*:* -checked:1
+*:* -checked:0
Getting back to your problem: double negation makes no sense in Lucene. Why would you have double negation, what are you trying to query?
Generally speaking, don't look at Lucene query operators (! & |) as Boolean operators, they aren't exactly what you think they are.
After some research and trial and error and building up on answer from midas, I have came up with the method to resolve this inconsistency. When I say inconsistency, I mean from a common sense view for a user. From information retrieval prospective, midas has linked an interesting article, which explains why such a query makes no sense.
So, the trick is to keep each negative expression with MatchAllDocsQueryNode class, namely the rewritten query has to look like this:
-(-(checked:1 *:*) *:*)
Then the query will produce the expected result. I have accomplished it by writing my own nodeprocessor class, which performs necessary operations.

How can I find the offset of a particular document in a SOLR Query without iteration?

I have a solr query that looks like this ?q=:&fq=section:10&sort=modified_date+desc&start=0&rows=50
and for example it has 1000000 results. I also know for a fact that a document exists in the overall result set. What I don't know is where it exists, for example is it number 325000 out of 1000000 results
Is there a way to determine what location the document exists inside the given query without iterating through the very large result set?
You certainly know some other info about that document right, that is also in some other field indexed in solr, can't you add more conditions (q or fq) to narrow down the result set?
Otherwise I don't think there is a way to help you to find a given doc among all hits without iterating.

Resources