Solr non sorted results order - solr

I was wondering if I can count on the results order of Solr queries if the queries were not sorted.
For example:
Lets assume there are 100 documents and I want to provide paging by running 10 queries of 10 docs each, where I increment the start position each time.
If I will run a *:* 10 times while increment the start position by 10 each time can I assume I'll get all 100 docs or since there is no sorting each time I'll get a different random 10 documents.
I know that in SQL databases it worn't work, I was wondering if Solr is different.

Solr is different. In case that you do not specify a sort order, it will sort by score. In case the score is equal, it will sort by the moment a document got indexed.
In case that you query for *:* (MatchAllDocsQuery) all documents do have the same score and will be returned in the order they got indexed, as described in the SO question How are results ordered in solr in a "match all docs" query
When would the order change? In case one of the documents gets updated. Then it will fall behind its' older brothers.
But Solr has a RandomSortField for this matter:
Utility Field used for random sorting. It should not be passed a value.
This random sorting implementation uses the dynamic field name to set the random 'seed'. To get random sorting order, you need to use a random dynamic field name.

Related

Apache Solr's bizarre search relevancy rankings

I'm using Apache Solr for conducting search queries on some of my computer's internal documents (stored in a database). I'm getting really bizarre results for search queries ordered by descending relevancy. For example, I have 5 words in my search query. The most relevant of 4 results, is a document containing only 2 of those words multiple times. The only document containing all the words is dead last. If I change the words around in just the right way, then I see a better ranking order with the right article as the most relevant. How do I go about fixing this? In my view, the document containing all 5 of the words, should rank higher than a document that has only two of those words (stated more frequently).
What Solr did is a correct algorithm called TF-IDF.
So, in your case, order could be explained by this formula.
One of the possible solutions is to ignore TF-IDF score and count one hit in the document as one, than simply document with 5 matches will get score 5, 4 matches will get 4, etc. Constant Score query could do the trick:
Constant score queries are created with ^=, which
sets the entire clause to the specified score for any documents
matching that clause. This is desirable when you only care about
matches for a particular clause and don't want other relevancy factors
such as term frequency (the number of times the term appears in the
field) or inverse document frequency (a measure across the whole index
for how rare a term is in a field).
Possible example of the query:
text:Julian^=1 text:Cribb^=1 text:EPA^=1 text:peak^=1 text:oil^=1
Another solution which will require some scripting will be something like this, at first you need a query where you will ask everything contains exactly 5 elements, e.g. +Julian +Cribb +EPA +peak +oil, then you will do the same for combination of 4 elements out of 5, if I'm not mistaken it will require additional 5 queries and back forth, until you check everything till 1 mandatory clause. Then you will have full results, and you only need to normalise results or just concatenate them, if you decided that 5-matched docs always better than 4-matched docs. Cons of this solution - a lot of queries, need to run them programmatically, some script would help, normalisation isn't obvious. Pros - you will keep both TF-IDF and the idea of matched terms.

Solr: Use a count or weighting number in source documents to weight search results

I have a Solr Index that is largely composed of repeated terms. I'm trying to return results scores weighted by the number of occurrences of each term. The problem with this is that the index is enormous, as a result. I'd like a way to shrink this down. Something like Solr understanding "myterm:500" means that there are 500 instances of myterm for this record. I've also run into an upper limit for multivariate text_general fields by simply repeating terms.
Is there a way to do this? Can I shrink my indexes by several orders of magnitude?

How can I sort facets by their tf-idf score, rather than popularity?

For a specific facet field of our Solr documents, it would make way more sense to be able to sort facets by their relative "interesting-ness" i.e. their tf-idf score, rather than by popularity. This would make it easy to automatically get rid of unwanted common English words, as both their TF and DF would be high.
When a query is made, TF should be calculated, using all the documents that participate in teh results list.
I assume that the only problem with this approach would be when no query is made, resp., when one searches for ":". Then, no term will prevail over the others in terms of interestingness. Please, correct me if I am wrong here.
Anyway,is this possible? What other relative measurements of "interesting-ness" would you suggest?
facet.sort
This param determines the ordering of the facet field constraints.
count - sort the constraints by count (highest count first) index - to
return the constraints sorted in their index order (lexicographic by
indexed term). For terms in the ascii range, this will be
alphabetically sorted. The default is count if facet.limit is greater
than 0, index otherwise.
Prior to Solr1.4, one needed to use true instead of count and false
instead of index.
This parameter can be specified on a per field basis.
It looks like you couldn't do it out of the box without some serious changes on client side or in Solr.
This is a very interesting idea and I have been searching around for some time to find a solution. Anything new in this area?
I assume that for facets with a limited number of possible values, an interestingness-score can be computed on the client side: For a given result set based on a filter, we can exclude this filter for the facet using the local params-syntax (!tag & !ex) Local Params - On the client side, we can than compute relative compared to the complete index (or another subpart of a filter). This would probably not work for result sets build by a query-parameter.
However, for an indexed text-field with many potential values, such as a fulltext-field, one would have to retrieve df-counts for all terms. I imagine this could be done efficiently using the terms component and probably should be cached on the client-side / in memory to increase efficiency. This appears to be a cumbersome method, however, and doesn't give the flexibility to exclude only certain filters.
For these cases, it would probably be better to implement this within solr as a new option for facet.sort, because the information needed is easily available at the time facet counts are computed.
There has been a discussion about this way back in 2009.
Currently, with the larger flexibility of facet.json, e.g. sorting on stats-facets (e.g. avg(price)) of another field, I guess this could be implemented as an additional sort-option. At least for facets of type term, the result-count (df for current result-set) only needs to be divided by the df of that term for the index (docfreq). If the current result-set is the complete index, facets should be sorted by count.
I will probably implement a workaround in the client for fields with a fixed and rather small vocabulary, e.g. based on a second, cashed query on the complete index. However, for term-fields and similar this might not scale.

solr function query demoting based on a field value range?

In my solr ranking function, I want to demote large documents. So for example, if a document is bigger than 1000 characters or less than 100 characters, I want to demote the score by half. Is there any readily available function query or should I build a new ones
you can always add an indexed field with the length of the doc, and then use that one for boosting accordingly.
It is quite easy to do that with UpdateRequestProcessor or just ScriptUpdateProcessor

Is the Solr Query possible with function value comparison?

I am working on solr for 3-4 months. I want to know if it is possible to query on solr with following requirements.
return all the documents where,
fieldName1 = queryTerm1 &
strdist(queryTerm2, fieldName2, JW) > 5 (or some constant)
If this is possible, what will be the query?
I guess you can get close.
Sort the results on string distance (split for easier):
localhost:8983/solr/select/?fl=id
&q=fieldName1:queryTerm1
&sort=strdist("queryTerm2",fieldName2, JW) desc
which will order the results, highest string distance downwards.
Note that you cannot directly get the string distance. There is a pseudo-field score, retrieved by:
fl=id,score
but it means nothing in an absolute sense.
You can also boost results based on the string distance, instead of simply sorting them. In this case, it will look at the relevancy of the document as well as the string distance.
Once you have a sorted list (hope its not too large!), you can determine client-side the elements which have 'string distance < 5'.
I made this up from the links below.
http://yonik.wordpress.com/2011/03/10/solr-relevancy-function-queries/
http://wiki.apache.org/solr/FunctionQuery#strdist
as far as i know, it's not possible

Resources