Getting facet count 0 in solr - solr

I am using solr search with faceting in my application. My use case is in such a way that the index files in the datadir keeps on changing.
The problem is, when I facet based on a particular field. I get the value from the indices that where previously in the data dir (and are not present currently). However they are returned with a value of 0. I don't understand where the values from the previous indices are persisted and are returned during a totally newer search?
Though I can simply skip the facets with count 0, I understand that this can seriously eat over my scalability. Any pointers to not include the facets from previous searchers?
[Edit 1] : The current workaround I am using is add a facet.mincount=1 in my URL. But still, I guess this can eat over my performance.

I couldnt find a comment option & I dont have enough reputation to vote-up!
I have the same exact problem.
We are using atomic updates with solr 4.2.
I found some explanation here: http://collab.sakaiproject.org/pipermail/oae-dev/2011-November/000693.html
Excerpt:
To efficiently handle facets for multi-valued fields (like tags), Solr
builds an "uninverted index" (which you think would just be called an
"index", but I suppose that's even more confusing), which maps
internal document IDs to the list of terms they contain. Calculating
facets from this data structure just requires walking over every
document in the result set, looking up the terms it contains in the
uninverted index, and adding them to the tally for all documents.
However, there's a sneaky optimisation here that causes the zero
counts we're seeing. For terms that appear in more than 5% of
documents, Solr doesn't include them in the uninverted index (leaving
them out helps to keep the size in memory down, I guess), and instead
gets the count for these terms using a regular query against the
Lucene index. Since the set of "common" terms isn't specific to your
result set, and since any given result set won't necessarily contain
all of these terms, you can get back counts of zero.
It may not be from old index values but just terms that exist in more than 5% of documents?

I think facet.mincount=n is not a workaround, you should use it to get only the non-negative facet count.
solrQuery.setQuery("*:*");
solrQuery.addFacetField("foobar");
solrQuery.setFacetMinCount(1);

Related

Sort by constant number

I need to randomize Solr (6.6.2) search results, but the order needs to be consistent given a specific seed. This is for a paginated search that returns a limited result set from a much larger one, so I must do the ordering at the query level and not at the application level once the data has been fetched.
Initially I tried this:
https://localhost:8984/solr/some_index/select?q=*:*&sort=random_999+ASC
Where 999 is a constant that is fed in when constructing the query prior to sending it to Solr. The constant value changes for each new search.
This solution works. However, when I run the query a few times, or run it on different Solr instances, the ordering is different.
After doing some reading, random_ generates a number via:
fieldName.hashCode() + context.docBase + (int)top.getVersion()
This means that when the random number is generated, it takes the index version into account. This becomes problematic when using a distributed architecture or when indexes are updated, as is well explained here.
There are various recommended solutions online, but I am trying to avoid writing a custom random override. Is there some type of trick where I can feed in some type of function or equation to the sort param?
For example:
min(999,random_999)
Though this always results in the same order, even when either of the values change.
This question is somewhat similar to this other question, but not quite.
I searched for answers on SO containing solr.RandomSortField, and while they point out what the issue is, none of them have a solution. It seems the best way would be to override the solr.RandomSortField logic, but it's not clear how.
Prior Research
https://lucene.472066.n3.nabble.com/Random-sorting-and-result-consistency-across-successive-calls-based-on-seed-td4170508.html
Solr: Random sort order after index version change
https://mail-archives.apache.org/mod_mbox/lucene-dev/201811.mbox/%3CJIRA.13196983.1541639245000.300557.1541639520069#Atlassian.JIRA%3E
Solr - Return random results (Sort by Random)
https://realize.be/blog/random-results-apache-solr-and-drupal
https://lucene.472066.n3.nabble.com/Sorting-with-customized-function-of-score-td3987281.html
Even after implementing a custom random sort field, the results still differed across instances of Solr.
I ended up adding a new field that is populated at index time which is a 32 bit hash of an ID field that already existed in the document.
I then built a "stateless" linear congruential generator to produce a set of acceptably random numbers to use for sorting:
?sort=mod(product(hash_int_id,{seedConstant},982451653), 104395301) asc
Since this function technically passes a new seed for each row, and because it does not store state (like rand.Next() would), this solution is admittedly inferior and it is not a true PRNG; however, it does seem to get me most of the way there. Note that you will have to tune your values depending on the size of your data set and the size of the values in your hash_int_id equivalent field.

SOLR: size of term dictionary and how to prune it

Two related questions:
Q1. I would like to find out the term dictionary size (in number of terms) of a core.
One thing I do know how to do is to list the file size of *.tim. For example:
> du -ch *.tim | tail -1
1,3G total
But how can I convert this to number of terms? Even a rough estimate would suffice.
Q2. A typical technique in search is to "prune" the index by removing all rare (very low frequency) terms. The objective is not to prune the size of the index, but the size of the actual term dictionary. What would be the simpler way to do this in SOLR, or programatically in SOLRj?
More exactly: I wish to eliminate these terms (tokens) from an existing index (term dictionary and all the other places in the index). The result should be similar to 1) adding the terms to a stop word list, 2) re-indexing an entire collection, 3) removing the terms from the stop word list.
You can get information in the Schema Browser page and click in "Load Term info", in the luke admin handler https://wiki.apache.org/solr/LukeRequestHandler and also, in then stats component https://cwiki.apache.org/confluence/display/solr/The+Stats+Component.
To prune the index, you could do it by do a facet of the field, and get the terms with low frecuency. Then, get the docs and update the document without this term (this could be difficult because it's depends the analyzers and tokenizers of your field). Also, you can use the lucene libraries to open the index and do it programmatically.
You can check the number and distribution of your terms with the AdminUI under the collection's Schema Browser screen. You need to Load Term Info:
Or you can use Luke which allows you to look inside the Lucene index.
It is not clear what you mean to 'remove'. You can add them to the stopwords in the analyzer chain for example if you want to avoid indexing them.

How can I sort facets by their tf-idf score, rather than popularity?

For a specific facet field of our Solr documents, it would make way more sense to be able to sort facets by their relative "interesting-ness" i.e. their tf-idf score, rather than by popularity. This would make it easy to automatically get rid of unwanted common English words, as both their TF and DF would be high.
When a query is made, TF should be calculated, using all the documents that participate in teh results list.
I assume that the only problem with this approach would be when no query is made, resp., when one searches for ":". Then, no term will prevail over the others in terms of interestingness. Please, correct me if I am wrong here.
Anyway,is this possible? What other relative measurements of "interesting-ness" would you suggest?
facet.sort
This param determines the ordering of the facet field constraints.
count - sort the constraints by count (highest count first) index - to
return the constraints sorted in their index order (lexicographic by
indexed term). For terms in the ascii range, this will be
alphabetically sorted. The default is count if facet.limit is greater
than 0, index otherwise.
Prior to Solr1.4, one needed to use true instead of count and false
instead of index.
This parameter can be specified on a per field basis.
It looks like you couldn't do it out of the box without some serious changes on client side or in Solr.
This is a very interesting idea and I have been searching around for some time to find a solution. Anything new in this area?
I assume that for facets with a limited number of possible values, an interestingness-score can be computed on the client side: For a given result set based on a filter, we can exclude this filter for the facet using the local params-syntax (!tag & !ex) Local Params - On the client side, we can than compute relative compared to the complete index (or another subpart of a filter). This would probably not work for result sets build by a query-parameter.
However, for an indexed text-field with many potential values, such as a fulltext-field, one would have to retrieve df-counts for all terms. I imagine this could be done efficiently using the terms component and probably should be cached on the client-side / in memory to increase efficiency. This appears to be a cumbersome method, however, and doesn't give the flexibility to exclude only certain filters.
For these cases, it would probably be better to implement this within solr as a new option for facet.sort, because the information needed is easily available at the time facet counts are computed.
There has been a discussion about this way back in 2009.
Currently, with the larger flexibility of facet.json, e.g. sorting on stats-facets (e.g. avg(price)) of another field, I guess this could be implemented as an additional sort-option. At least for facets of type term, the result-count (df for current result-set) only needs to be divided by the df of that term for the index (docfreq). If the current result-set is the complete index, facets should be sorted by count.
I will probably implement a workaround in the client for fields with a fixed and rather small vocabulary, e.g. based on a second, cashed query on the complete index. However, for term-fields and similar this might not scale.

Custom SOLR-sorting that is aware of its neighbours

For a SOLR search, I want to treat some results differently (where the field "is_promoted" is set to "1") to give them a better ranking. After the "normal" query is performed, the order of the results should be rearranged so that approximately 30 % of the results in a given range (say, the first 100 results) should be "promoted results". The ordering of the results should otherwise be preserved.
I thought it would be a good idea to solve this by making a custom SOLR plugin. So I tried writing a SearchComponent, but it seems like you can't change the ordering of search results after it has passed through the QueryComponent (since they are cached)?
One could have written some kind of custom sort function (or a function query?) but the challenge is that the algorithm needs to know about the score/ordering of the other surrounding results. A simple increase in the score won't do the trick.
Any suggestions on how this should be implemented?
Just answered this question on the Solr users list. The RankQuery feature in Solr 4.9 is designed to solve this type of problem. You can read about RankQueries here: http://heliosearch.org/solrs-new-rankquery-feature/

How can I find the offset of a particular document in a SOLR Query without iteration?

I have a solr query that looks like this ?q=:&fq=section:10&sort=modified_date+desc&start=0&rows=50
and for example it has 1000000 results. I also know for a fact that a document exists in the overall result set. What I don't know is where it exists, for example is it number 325000 out of 1000000 results
Is there a way to determine what location the document exists inside the given query without iterating through the very large result set?
You certainly know some other info about that document right, that is also in some other field indexed in solr, can't you add more conditions (q or fq) to narrow down the result set?
Otherwise I don't think there is a way to help you to find a given doc among all hits without iterating.

Resources