Solr Filter Query - String vs. Int - solr

Say I'm trying to query a bunch of documents that have categories and I want to limit the queries to a specified category (as I understand it this would just be using the fq parameter (filter query).
I was wondering if there is a performance improvement for having the parameter be an integer instead of a string or something as is usually the case with data? I would just err on the right side but I thought I'd double check in case it didn't matter very much and Solr performed some sort of optimization under the hood?
It would be much more convenient if I could just filter on string matches but..
Thanks for any tips folks

Unless you need to perform range queries (numeric fields have special support for this) or sorting (the int field cache is more memory-efficient than the String field cache), they should be roughly equivalent.

Related

SOLR numeric range query versus multiple OR

Suppose there are several docs having one of the fields clientID, values from ranging 1 to 100.
Query 1:
FQ: **clientID:1 OR clientID:2 OR clientID:3 or clientID:5 or clientID:7 or client ID:8**
Query 2:
FQ: **clientID:[1 TO 3] or clientID:5 or clientID:[7 TO 8]**
Question:
Will there be a big performance difference between these two queries? If yes, how?
Doesn't SOLR do the preprocessing of translating such range values if given in multiple ORs?
There might be - depending on cached entries, etc. The second query will be two range queries and a regular query combined into three boolean clauses, while the first one will be six different boolean clauses.
Speed probably won't differ too much for your example, but as the number of clauses grow, the latter will keep the number of sets to be intersected lower than the first one. To get exact data - try it out - your core will be different from other people's cores.
And no, Solr won't preprocess anything. That's handed over to Lucene to do as it pleases, but a range query can be resolved in a different way than a exact field query. There can be entries between the terms given in your pure boolean query, so you can't translate it into a range query and expect the same result, and you can't do it the other way around either - since the field may not be integer (and even integer types differ in how they're being indexed).
The important part is usually that the fq will be cached separately, so it's usually more important to keep it re-usable across queries.
If you use the default numeric types, Solr index more than one precision for each number, (look for trieIntField and IntPointField in Solr field types
so, when when you index a 15, it index it as 15 and as 10, and when you index a 9 it index it as a 9 and as 0. When you search for a 8 - 21 range, it converts the search to a number[8] or number[9] or number[10] or number[20] or number[21]
(with binary ranges instead of decimal, but I hope you get the idea). So I suggest you use the range queries and let Solr manage the optimizations.
PointField types are the replacement for TrieFields, functionally are similar but use another data structures to store the information. So if you have a legacy index you can use the triefields, but if you are making new ones the PointFields are recommended.

If possible, what is the Solr query syntax to filter by doc size?

Solr 4.3.0
I want to find the larger size documents.
I'm trying to build some test data for testing memory usage, but I keep getting the smaller sized documents. So, if I could add a doc size clause to my query it would help me find more suitable documents.
I'm not aware of this possibility, most likely there is no support for it.
I could see one possible approach - you could add size of the document during indexing in some separate field, which will later use to filter on.
Another possible case - is to use TermVectorComponent, which could return term vectors for matched documents, which could lead to some understanding of "how big" this document is. Not easy and simple, though.
Example of the possibly useful output:
Third possible option (kudos to MatsLindh for the idea): to use sorting function norm() for a specific field. There are some limitations:
You need to use some classic similarity
The field you're sorting on should contains norms
Example of the sorting function: sort:norm(field_name) desc

Deep paging on facet results

According to https://cwiki.apache.org/confluence/display/solr/Faceting I can use facet.offset and facet.limit to paginate.
I think these are analogous to start and rows for normal query results.
However, wouldn't this be very slow if I have too many facet results? According to https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results:
When you wish to fetch a very large number of sorted results from Solr
to feed into an external system, using very large values for the start
or rows parameters can be very inefficient. Pagination using start
and rows not only require Solr to compute (and sort) in memory all of
the matching documents that should be fetched for the current page,
but also all of the documents that would have appeared on previous
pages.
So for deep paging on normal queries, I'd use a cursorMark instead.
So
1) Am I right that deep paging on facet results using facet.offset has the same performance conerns as the quote above?
2) Is there something like cursorMark or other more efficient deep paging for facet results instead of facet.offset?
Yes, if you will take a look into one of the FacetCollector implementation, you will see something like this:
#Override
public boolean collect(BytesRef term, int count) {
if (count > min) {
// NOTE: we use c>min rather than c>=min as an optimization because we are going in
// index order, so we already know that the keys are ordered. This can be very
// important if a lot of the counts are repeated (like zero counts would be).
spare.copyUTF8Bytes(term);
queue.add(new SimpleFacets.CountPair<>(spare.toString(), count));
if (queue.size()>=maxsize) min=queue.last().val;
}
return false;
}
and a little bit above:
maxsize = limit>0 ? offset+limit : Integer.MAX_VALUE-1;
which basically leads to the same problem as for deep paging. The code will create a huge BoundedTreeSet (cause maxsize is determined by sum of offset and limit), and complexity will be around the same as in deep paging scenario.
However, most of the time, I do not expect anybody to have array of facet values larger than 10_000 (got it from the top of my head, probably even less), which shouldn't cause any troubles (until you get millions of facet values).
Usually facets are coming from fields with limited semantics (brand, color, state, department, etc.) and usually these values are limited.
As a summary: algorithm is the same as in collecting matched documents, but the nature of the facet values should save us from the problem.

How can I sort facets by their tf-idf score, rather than popularity?

For a specific facet field of our Solr documents, it would make way more sense to be able to sort facets by their relative "interesting-ness" i.e. their tf-idf score, rather than by popularity. This would make it easy to automatically get rid of unwanted common English words, as both their TF and DF would be high.
When a query is made, TF should be calculated, using all the documents that participate in teh results list.
I assume that the only problem with this approach would be when no query is made, resp., when one searches for ":". Then, no term will prevail over the others in terms of interestingness. Please, correct me if I am wrong here.
Anyway,is this possible? What other relative measurements of "interesting-ness" would you suggest?
facet.sort
This param determines the ordering of the facet field constraints.
count - sort the constraints by count (highest count first) index - to
return the constraints sorted in their index order (lexicographic by
indexed term). For terms in the ascii range, this will be
alphabetically sorted. The default is count if facet.limit is greater
than 0, index otherwise.
Prior to Solr1.4, one needed to use true instead of count and false
instead of index.
This parameter can be specified on a per field basis.
It looks like you couldn't do it out of the box without some serious changes on client side or in Solr.
This is a very interesting idea and I have been searching around for some time to find a solution. Anything new in this area?
I assume that for facets with a limited number of possible values, an interestingness-score can be computed on the client side: For a given result set based on a filter, we can exclude this filter for the facet using the local params-syntax (!tag & !ex) Local Params - On the client side, we can than compute relative compared to the complete index (or another subpart of a filter). This would probably not work for result sets build by a query-parameter.
However, for an indexed text-field with many potential values, such as a fulltext-field, one would have to retrieve df-counts for all terms. I imagine this could be done efficiently using the terms component and probably should be cached on the client-side / in memory to increase efficiency. This appears to be a cumbersome method, however, and doesn't give the flexibility to exclude only certain filters.
For these cases, it would probably be better to implement this within solr as a new option for facet.sort, because the information needed is easily available at the time facet counts are computed.
There has been a discussion about this way back in 2009.
Currently, with the larger flexibility of facet.json, e.g. sorting on stats-facets (e.g. avg(price)) of another field, I guess this could be implemented as an additional sort-option. At least for facets of type term, the result-count (df for current result-set) only needs to be divided by the df of that term for the index (docfreq). If the current result-set is the complete index, facets should be sorted by count.
I will probably implement a workaround in the client for fields with a fixed and rather small vocabulary, e.g. based on a second, cashed query on the complete index. However, for term-fields and similar this might not scale.

Custom SOLR-sorting that is aware of its neighbours

For a SOLR search, I want to treat some results differently (where the field "is_promoted" is set to "1") to give them a better ranking. After the "normal" query is performed, the order of the results should be rearranged so that approximately 30 % of the results in a given range (say, the first 100 results) should be "promoted results". The ordering of the results should otherwise be preserved.
I thought it would be a good idea to solve this by making a custom SOLR plugin. So I tried writing a SearchComponent, but it seems like you can't change the ordering of search results after it has passed through the QueryComponent (since they are cached)?
One could have written some kind of custom sort function (or a function query?) but the challenge is that the algorithm needs to know about the score/ordering of the other surrounding results. A simple increase in the score won't do the trick.
Any suggestions on how this should be implemented?
Just answered this question on the Solr users list. The RankQuery feature in Solr 4.9 is designed to solve this type of problem. You can read about RankQueries here: http://heliosearch.org/solrs-new-rankquery-feature/

Resources