Deep paging on facet results - solr

According to https://cwiki.apache.org/confluence/display/solr/Faceting I can use facet.offset and facet.limit to paginate.
I think these are analogous to start and rows for normal query results.
However, wouldn't this be very slow if I have too many facet results? According to https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results:
When you wish to fetch a very large number of sorted results from Solr
to feed into an external system, using very large values for the start
or rows parameters can be very inefficient. Pagination using start
and rows not only require Solr to compute (and sort) in memory all of
the matching documents that should be fetched for the current page,
but also all of the documents that would have appeared on previous
pages.
So for deep paging on normal queries, I'd use a cursorMark instead.
So
1) Am I right that deep paging on facet results using facet.offset has the same performance conerns as the quote above?
2) Is there something like cursorMark or other more efficient deep paging for facet results instead of facet.offset?

Yes, if you will take a look into one of the FacetCollector implementation, you will see something like this:
#Override
public boolean collect(BytesRef term, int count) {
if (count > min) {
// NOTE: we use c>min rather than c>=min as an optimization because we are going in
// index order, so we already know that the keys are ordered. This can be very
// important if a lot of the counts are repeated (like zero counts would be).
spare.copyUTF8Bytes(term);
queue.add(new SimpleFacets.CountPair<>(spare.toString(), count));
if (queue.size()>=maxsize) min=queue.last().val;
}
return false;
}
and a little bit above:
maxsize = limit>0 ? offset+limit : Integer.MAX_VALUE-1;
which basically leads to the same problem as for deep paging. The code will create a huge BoundedTreeSet (cause maxsize is determined by sum of offset and limit), and complexity will be around the same as in deep paging scenario.
However, most of the time, I do not expect anybody to have array of facet values larger than 10_000 (got it from the top of my head, probably even less), which shouldn't cause any troubles (until you get millions of facet values).
Usually facets are coming from fields with limited semantics (brand, color, state, department, etc.) and usually these values are limited.
As a summary: algorithm is the same as in collecting matched documents, but the nature of the facet values should save us from the problem.

Related

Sort by constant number

I need to randomize Solr (6.6.2) search results, but the order needs to be consistent given a specific seed. This is for a paginated search that returns a limited result set from a much larger one, so I must do the ordering at the query level and not at the application level once the data has been fetched.
Initially I tried this:
https://localhost:8984/solr/some_index/select?q=*:*&sort=random_999+ASC
Where 999 is a constant that is fed in when constructing the query prior to sending it to Solr. The constant value changes for each new search.
This solution works. However, when I run the query a few times, or run it on different Solr instances, the ordering is different.
After doing some reading, random_ generates a number via:
fieldName.hashCode() + context.docBase + (int)top.getVersion()
This means that when the random number is generated, it takes the index version into account. This becomes problematic when using a distributed architecture or when indexes are updated, as is well explained here.
There are various recommended solutions online, but I am trying to avoid writing a custom random override. Is there some type of trick where I can feed in some type of function or equation to the sort param?
For example:
min(999,random_999)
Though this always results in the same order, even when either of the values change.
This question is somewhat similar to this other question, but not quite.
I searched for answers on SO containing solr.RandomSortField, and while they point out what the issue is, none of them have a solution. It seems the best way would be to override the solr.RandomSortField logic, but it's not clear how.
Prior Research
https://lucene.472066.n3.nabble.com/Random-sorting-and-result-consistency-across-successive-calls-based-on-seed-td4170508.html
Solr: Random sort order after index version change
https://mail-archives.apache.org/mod_mbox/lucene-dev/201811.mbox/%3CJIRA.13196983.1541639245000.300557.1541639520069#Atlassian.JIRA%3E
Solr - Return random results (Sort by Random)
https://realize.be/blog/random-results-apache-solr-and-drupal
https://lucene.472066.n3.nabble.com/Sorting-with-customized-function-of-score-td3987281.html
Even after implementing a custom random sort field, the results still differed across instances of Solr.
I ended up adding a new field that is populated at index time which is a 32 bit hash of an ID field that already existed in the document.
I then built a "stateless" linear congruential generator to produce a set of acceptably random numbers to use for sorting:
?sort=mod(product(hash_int_id,{seedConstant},982451653), 104395301) asc
Since this function technically passes a new seed for each row, and because it does not store state (like rand.Next() would), this solution is admittedly inferior and it is not a true PRNG; however, it does seem to get me most of the way there. Note that you will have to tune your values depending on the size of your data set and the size of the values in your hash_int_id equivalent field.

How can I sort facets by their tf-idf score, rather than popularity?

For a specific facet field of our Solr documents, it would make way more sense to be able to sort facets by their relative "interesting-ness" i.e. their tf-idf score, rather than by popularity. This would make it easy to automatically get rid of unwanted common English words, as both their TF and DF would be high.
When a query is made, TF should be calculated, using all the documents that participate in teh results list.
I assume that the only problem with this approach would be when no query is made, resp., when one searches for ":". Then, no term will prevail over the others in terms of interestingness. Please, correct me if I am wrong here.
Anyway,is this possible? What other relative measurements of "interesting-ness" would you suggest?
facet.sort
This param determines the ordering of the facet field constraints.
count - sort the constraints by count (highest count first) index - to
return the constraints sorted in their index order (lexicographic by
indexed term). For terms in the ascii range, this will be
alphabetically sorted. The default is count if facet.limit is greater
than 0, index otherwise.
Prior to Solr1.4, one needed to use true instead of count and false
instead of index.
This parameter can be specified on a per field basis.
It looks like you couldn't do it out of the box without some serious changes on client side or in Solr.
This is a very interesting idea and I have been searching around for some time to find a solution. Anything new in this area?
I assume that for facets with a limited number of possible values, an interestingness-score can be computed on the client side: For a given result set based on a filter, we can exclude this filter for the facet using the local params-syntax (!tag & !ex) Local Params - On the client side, we can than compute relative compared to the complete index (or another subpart of a filter). This would probably not work for result sets build by a query-parameter.
However, for an indexed text-field with many potential values, such as a fulltext-field, one would have to retrieve df-counts for all terms. I imagine this could be done efficiently using the terms component and probably should be cached on the client-side / in memory to increase efficiency. This appears to be a cumbersome method, however, and doesn't give the flexibility to exclude only certain filters.
For these cases, it would probably be better to implement this within solr as a new option for facet.sort, because the information needed is easily available at the time facet counts are computed.
There has been a discussion about this way back in 2009.
Currently, with the larger flexibility of facet.json, e.g. sorting on stats-facets (e.g. avg(price)) of another field, I guess this could be implemented as an additional sort-option. At least for facets of type term, the result-count (df for current result-set) only needs to be divided by the df of that term for the index (docfreq). If the current result-set is the complete index, facets should be sorted by count.
I will probably implement a workaround in the client for fields with a fixed and rather small vocabulary, e.g. based on a second, cashed query on the complete index. However, for term-fields and similar this might not scale.

Solr Filter Query - String vs. Int

Say I'm trying to query a bunch of documents that have categories and I want to limit the queries to a specified category (as I understand it this would just be using the fq parameter (filter query).
I was wondering if there is a performance improvement for having the parameter be an integer instead of a string or something as is usually the case with data? I would just err on the right side but I thought I'd double check in case it didn't matter very much and Solr performed some sort of optimization under the hood?
It would be much more convenient if I could just filter on string matches but..
Thanks for any tips folks
Unless you need to perform range queries (numeric fields have special support for this) or sorting (the int field cache is more memory-efficient than the String field cache), they should be roughly equivalent.

Search API Number Found Accuracy

When using the Search API we have the option of paging results by setting a limit and offset for the results. We also have the option to return the total number of search results by using the 'number_found' property. In the docs it clearly states that this number is an approximation and not an exact count.
My issue is, in our production application this approximation is wildly inaccurate (e.g. it reports 1200 matches when only 200 records exist). For now we could simply set the 'number_found_accuracy' to the 'MAXIMUM_NUMBER_FOUND_ACCURACY' constant in our query options, however this will only suffice for a month or two until the number of indexed documents exceeds that constant.
For our purposes the 'number_found' count doesn't have to be exact but we would prefer it be close. All that being said, what are my options for ensuring that as the number of documents stored and indexed grow that the document counts remain relatively accurate?

Getting facet count 0 in solr

I am using solr search with faceting in my application. My use case is in such a way that the index files in the datadir keeps on changing.
The problem is, when I facet based on a particular field. I get the value from the indices that where previously in the data dir (and are not present currently). However they are returned with a value of 0. I don't understand where the values from the previous indices are persisted and are returned during a totally newer search?
Though I can simply skip the facets with count 0, I understand that this can seriously eat over my scalability. Any pointers to not include the facets from previous searchers?
[Edit 1] : The current workaround I am using is add a facet.mincount=1 in my URL. But still, I guess this can eat over my performance.
I couldnt find a comment option & I dont have enough reputation to vote-up!
I have the same exact problem.
We are using atomic updates with solr 4.2.
I found some explanation here: http://collab.sakaiproject.org/pipermail/oae-dev/2011-November/000693.html
Excerpt:
To efficiently handle facets for multi-valued fields (like tags), Solr
builds an "uninverted index" (which you think would just be called an
"index", but I suppose that's even more confusing), which maps
internal document IDs to the list of terms they contain. Calculating
facets from this data structure just requires walking over every
document in the result set, looking up the terms it contains in the
uninverted index, and adding them to the tally for all documents.
However, there's a sneaky optimisation here that causes the zero
counts we're seeing. For terms that appear in more than 5% of
documents, Solr doesn't include them in the uninverted index (leaving
them out helps to keep the size in memory down, I guess), and instead
gets the count for these terms using a regular query against the
Lucene index. Since the set of "common" terms isn't specific to your
result set, and since any given result set won't necessarily contain
all of these terms, you can get back counts of zero.
It may not be from old index values but just terms that exist in more than 5% of documents?
I think facet.mincount=n is not a workaround, you should use it to get only the non-negative facet count.
solrQuery.setQuery("*:*");
solrQuery.addFacetField("foobar");
solrQuery.setFacetMinCount(1);

Resources