Suppose there are several docs having one of the fields clientID, values from ranging 1 to 100.
Query 1:
FQ: **clientID:1 OR clientID:2 OR clientID:3 or clientID:5 or clientID:7 or client ID:8**
Query 2:
FQ: **clientID:[1 TO 3] or clientID:5 or clientID:[7 TO 8]**
Question:
Will there be a big performance difference between these two queries? If yes, how?
Doesn't SOLR do the preprocessing of translating such range values if given in multiple ORs?
There might be - depending on cached entries, etc. The second query will be two range queries and a regular query combined into three boolean clauses, while the first one will be six different boolean clauses.
Speed probably won't differ too much for your example, but as the number of clauses grow, the latter will keep the number of sets to be intersected lower than the first one. To get exact data - try it out - your core will be different from other people's cores.
And no, Solr won't preprocess anything. That's handed over to Lucene to do as it pleases, but a range query can be resolved in a different way than a exact field query. There can be entries between the terms given in your pure boolean query, so you can't translate it into a range query and expect the same result, and you can't do it the other way around either - since the field may not be integer (and even integer types differ in how they're being indexed).
The important part is usually that the fq will be cached separately, so it's usually more important to keep it re-usable across queries.
If you use the default numeric types, Solr index more than one precision for each number, (look for trieIntField and IntPointField in Solr field types
so, when when you index a 15, it index it as 15 and as 10, and when you index a 9 it index it as a 9 and as 0. When you search for a 8 - 21 range, it converts the search to a number[8] or number[9] or number[10] or number[20] or number[21]
(with binary ranges instead of decimal, but I hope you get the idea). So I suggest you use the range queries and let Solr manage the optimizations.
PointField types are the replacement for TrieFields, functionally are similar but use another data structures to store the information. So if you have a legacy index you can use the triefields, but if you are making new ones the PointFields are recommended.
Related
I'm using Apache Solr for conducting search queries on some of my computer's internal documents (stored in a database). I'm getting really bizarre results for search queries ordered by descending relevancy. For example, I have 5 words in my search query. The most relevant of 4 results, is a document containing only 2 of those words multiple times. The only document containing all the words is dead last. If I change the words around in just the right way, then I see a better ranking order with the right article as the most relevant. How do I go about fixing this? In my view, the document containing all 5 of the words, should rank higher than a document that has only two of those words (stated more frequently).
What Solr did is a correct algorithm called TF-IDF.
So, in your case, order could be explained by this formula.
One of the possible solutions is to ignore TF-IDF score and count one hit in the document as one, than simply document with 5 matches will get score 5, 4 matches will get 4, etc. Constant Score query could do the trick:
Constant score queries are created with ^=, which
sets the entire clause to the specified score for any documents
matching that clause. This is desirable when you only care about
matches for a particular clause and don't want other relevancy factors
such as term frequency (the number of times the term appears in the
field) or inverse document frequency (a measure across the whole index
for how rare a term is in a field).
Possible example of the query:
text:Julian^=1 text:Cribb^=1 text:EPA^=1 text:peak^=1 text:oil^=1
Another solution which will require some scripting will be something like this, at first you need a query where you will ask everything contains exactly 5 elements, e.g. +Julian +Cribb +EPA +peak +oil, then you will do the same for combination of 4 elements out of 5, if I'm not mistaken it will require additional 5 queries and back forth, until you check everything till 1 mandatory clause. Then you will have full results, and you only need to normalise results or just concatenate them, if you decided that 5-matched docs always better than 4-matched docs. Cons of this solution - a lot of queries, need to run them programmatically, some script would help, normalisation isn't obvious. Pros - you will keep both TF-IDF and the idea of matched terms.
For a specific facet field of our Solr documents, it would make way more sense to be able to sort facets by their relative "interesting-ness" i.e. their tf-idf score, rather than by popularity. This would make it easy to automatically get rid of unwanted common English words, as both their TF and DF would be high.
When a query is made, TF should be calculated, using all the documents that participate in teh results list.
I assume that the only problem with this approach would be when no query is made, resp., when one searches for ":". Then, no term will prevail over the others in terms of interestingness. Please, correct me if I am wrong here.
Anyway,is this possible? What other relative measurements of "interesting-ness" would you suggest?
facet.sort
This param determines the ordering of the facet field constraints.
count - sort the constraints by count (highest count first) index - to
return the constraints sorted in their index order (lexicographic by
indexed term). For terms in the ascii range, this will be
alphabetically sorted. The default is count if facet.limit is greater
than 0, index otherwise.
Prior to Solr1.4, one needed to use true instead of count and false
instead of index.
This parameter can be specified on a per field basis.
It looks like you couldn't do it out of the box without some serious changes on client side or in Solr.
This is a very interesting idea and I have been searching around for some time to find a solution. Anything new in this area?
I assume that for facets with a limited number of possible values, an interestingness-score can be computed on the client side: For a given result set based on a filter, we can exclude this filter for the facet using the local params-syntax (!tag & !ex) Local Params - On the client side, we can than compute relative compared to the complete index (or another subpart of a filter). This would probably not work for result sets build by a query-parameter.
However, for an indexed text-field with many potential values, such as a fulltext-field, one would have to retrieve df-counts for all terms. I imagine this could be done efficiently using the terms component and probably should be cached on the client-side / in memory to increase efficiency. This appears to be a cumbersome method, however, and doesn't give the flexibility to exclude only certain filters.
For these cases, it would probably be better to implement this within solr as a new option for facet.sort, because the information needed is easily available at the time facet counts are computed.
There has been a discussion about this way back in 2009.
Currently, with the larger flexibility of facet.json, e.g. sorting on stats-facets (e.g. avg(price)) of another field, I guess this could be implemented as an additional sort-option. At least for facets of type term, the result-count (df for current result-set) only needs to be divided by the df of that term for the index (docfreq). If the current result-set is the complete index, facets should be sorted by count.
I will probably implement a workaround in the client for fields with a fixed and rather small vocabulary, e.g. based on a second, cashed query on the complete index. However, for term-fields and similar this might not scale.
Say I'm trying to query a bunch of documents that have categories and I want to limit the queries to a specified category (as I understand it this would just be using the fq parameter (filter query).
I was wondering if there is a performance improvement for having the parameter be an integer instead of a string or something as is usually the case with data? I would just err on the right side but I thought I'd double check in case it didn't matter very much and Solr performed some sort of optimization under the hood?
It would be much more convenient if I could just filter on string matches but..
Thanks for any tips folks
Unless you need to perform range queries (numeric fields have special support for this) or sorting (the int field cache is more memory-efficient than the String field cache), they should be roughly equivalent.
I am using solr search with faceting in my application. My use case is in such a way that the index files in the datadir keeps on changing.
The problem is, when I facet based on a particular field. I get the value from the indices that where previously in the data dir (and are not present currently). However they are returned with a value of 0. I don't understand where the values from the previous indices are persisted and are returned during a totally newer search?
Though I can simply skip the facets with count 0, I understand that this can seriously eat over my scalability. Any pointers to not include the facets from previous searchers?
[Edit 1] : The current workaround I am using is add a facet.mincount=1 in my URL. But still, I guess this can eat over my performance.
I couldnt find a comment option & I dont have enough reputation to vote-up!
I have the same exact problem.
We are using atomic updates with solr 4.2.
I found some explanation here: http://collab.sakaiproject.org/pipermail/oae-dev/2011-November/000693.html
Excerpt:
To efficiently handle facets for multi-valued fields (like tags), Solr
builds an "uninverted index" (which you think would just be called an
"index", but I suppose that's even more confusing), which maps
internal document IDs to the list of terms they contain. Calculating
facets from this data structure just requires walking over every
document in the result set, looking up the terms it contains in the
uninverted index, and adding them to the tally for all documents.
However, there's a sneaky optimisation here that causes the zero
counts we're seeing. For terms that appear in more than 5% of
documents, Solr doesn't include them in the uninverted index (leaving
them out helps to keep the size in memory down, I guess), and instead
gets the count for these terms using a regular query against the
Lucene index. Since the set of "common" terms isn't specific to your
result set, and since any given result set won't necessarily contain
all of these terms, you can get back counts of zero.
It may not be from old index values but just terms that exist in more than 5% of documents?
I think facet.mincount=n is not a workaround, you should use it to get only the non-negative facet count.
solrQuery.setQuery("*:*");
solrQuery.addFacetField("foobar");
solrQuery.setFacetMinCount(1);
In boolean retrieval model query consist of terms which are combined together using different operators. Conjunction is most obvious choice at first glance, but when query length growth bad things happened. Recall dropped significantly when using conjunction and precision dropped when using disjunction (for example, stanford OR university).
As for now we use conjunction is our search system (and boolean retrieval model). And we have a problem if user enter some very rare word or long sequence of word. For example, if user enters toyota corolla 4wd automatic 1995, we probably doesn't have one. But if we delete at least one word from a query, we have such documents. As far as I understand in Vector Space Model this problem solved automatically. We does not filter documents on the fact of term presence, we rank documents using presence of terms.
So I'm interested in more advanced ways of combining terms in boolean retrieval model and methods of rare term elimination in boolean retrieval model.
It seems like the sky's the limit in terms of defining a ranking function here. You could define a vector where the wi are: 0 if the ith search term doesn't appear in the file, 1 if it does; the number of times search term i appears in the file; etc. Then, rank pages based on e.g. Manhattan distance, Euclidean distance, etc. and sort in descending order, possibly culling results with distance below a specified match tolerance.
If you want to handle more complex queries, you can put the query into CNF - e.g. (term1 or term2 or ... termn) AND (item1 or item2 or ... itemk) AND ... and then redefine the weights wi accordingly. You could list with each result the terms that failed to match in the file... so that the users would at least know how good a match it is.
I guess what I'm really trying to say is that to really get an answer that works for you, you have to define exactly what you are willing to accept as a valid search result. Under the strict interpretation, a query that is looking for A1 and A2 and ... Am should fail if any of the terms is missing...