I have multiple collections with different fields in the schema, I would like to perform a search across multiple collections and perform default rank for results across all the collections .
Example - I have a document with ‘mustang’ word occurring 3 times in collection A and also 2 times in Collection B , then I would like the results to show both the documents with the document from collection A first and document from collection B as second result.
Scoring doesn't only take the number of occurrences into factor, so by default it'll also depend on the number of documents containing that term in the collection as well. If we're talking about a single term, you can sort by the tf function or something like that - for more complex queries, using collection wide term frequencies may be the only option (but may be costly).
To create one common collection that queries both, use the CREATEALIAS command in the Collections API. The collections parameter takes a comma separated list of collections that is represented by the alias, allowing you to query both A and B through the alias C.
Related
We have solr index which has multiple collections i.e. collection_data_sales and collection_data_marketing. So when the user performs a search query, both the collections are queried upon using collection alias. Both collections have same solr schema.
Is there a way to boost the result from a specific collection ?
i.e. Suppose user specifies collection sales data, then search should happen on both collection_data_sales and collection_data_marketing but boost should be given for documents from collection_data_sales.
If you are able to differentiate both collections using data from it it will be enough. Lets imagine that in schema you have field type so for collection_data_marketing you have type:marketing and for collection_data_sales you have type:sales.
The only thing now you have to do is to use boost function like for example this:
bf=sum(product(query($q1),10), product(query($q2,3)))&q1=type:sales&q2=type:marketing
In this example sales will have weight 10 and marketing will have weight 3
So I read this: http://wiki.apache.org/solr/SolrCaching#filterCache
and specifically
The filter cache stores the results of any filter queries ("fq"
parameters) that Solr is explicitly asked to execute. (Each filter is
executed and cached separately. When it's time to use them to limit
the number of results returned by a query, this is done using set
intersections.)
So my question is this. Lets say my app filters on a set of different formatsIDs. If the format ids are numeric say 1,2,3,4,5. And there are many permutations of those being sent in queries as fq parameters.
if I wrote a warming query like this...
...
<str name="fq">format:(1)+OR+format:(2)+OR+format:(3)+OR+format:(4)+OR+format:(5)</str>
...
Would that warm things up and help all my queries trying to filter by various permutations of those formats OR... only folks searching for that permutation?
Should I instead create 5 separate warming queries (1 for each format) to take advantage of "set intersection"?
Or will that query create the sets for each format?
Example queries
...fq=format:(1)+OR+format:(2)...
...fq=format:(1)+OR+format:(3)...
...fq=format:(2)+OR+format:(3)...
...fq=format:(2)+OR+format:(5)...
etc...
so none of those I believe will use the filter cache created by the warming query listed above.
See https://wiki.apache.org/solr/CommonQueryParameters#fq. It says:
The document sets from each filter query are cached independently.
Thus, concerning the previous examples: use a single fq containing two
mandatory clauses if those clauses appear together often, and use two
separate fq params if they are relatively independent.
It is one cache entry per fq param specified in your query.
You are not doing set intersection with OR; you are doing set union. But if you were doing set intersection like:
fq=format:(1 AND 2 AND 3 AND 4 AND 5)
(assuming format is a multi-valued field here) and have different subsets of those 5 values like
fq=format:(1 AND 2)
fq=format:(3 AND 4 AND 5)
then issuing separate filter queries like:
fq=format:1&fq=format:2&fq=format:3&fq=format:4&fq=format:5
will help all the subset queries. Here you will have 5 entries in the filter cache and they are intersected for all the subsets.
Regarding permutations i.e. the order in which the values appear in the filter query, I believe it will use hashing for the fq param, so you are better off sorting the values first and then forming your filter query.
I'd like to submit a query to SOLR/Lucene, plus a list of document IDs. From the query, I'd like the usual top-N scored results, but I'd also like to get the scores for the named documents... no matter how low they are.
Can anyone think of an easy/supported way to do this in a single index scan, where the scores for the 'added' (non-ranking/pinned-for-inclusion) docs are comparable/same-scaled as those for the top-N results? (Patching SOLR with specialized classes would be OK; I figure that's what I may have to do if there's no existing support.)
Or failing that, could it be simulated with a followup query, ideally in a way that the named-document scores could be scaled to be roughly comparable to the top-N for the reference query?
Alternatively -- and perhaps as good or better for my intended use -- could I make a single request against a SOLR/Lucene index which includes M (with M=2 or more) distinct queries, and return the results that are in the top-N for any of the M queries, and for every result include its score against all M of the distinct queries?
(Even in my above formulation, the list of documents that I want scored along with a new query will typically have been the results from a prior query.)
Solutions or even just fragments of possible approaches appreciated!
I am not sure if I understand properly what you want to achieve but wouldn't a simple
q: (somequery) OR id: (1 OR 2 OR 4)
be enough?
If you would want both parts to be boosted by the same scale (I am not sure if this isn't the default behaviour of Solr) you would want to use dismax or edismax and your query would change to something like:
q: (somequery)^10 OR id: (1 OR 2 OR 4)^10
You would then have both the elements defined by the IDs and the query results scored the same way.
To self-answer, reporting what I've found since posting...
One clumsy option is the explainOther parameter, which takes another query. (This query could be a OR list of interesting document IDs.) The response will then include a full scoring explanation for documents which match this other query. explainOther only has effect when combined with the also-required debugQuery parameter.
All that debug/explain information is overkill for the need, but may be useful, or the code paths that implement it might provide a guide to making a hypothetical new more narrowly-focused 'scoreOther' option.
Another option would be to make use of pseudo-field calculated using the query() function to report how any set of results score on some other query/queries. So if for example the original document set was the top-N from query_A, and then those are the exact documents that you also want to score against query_B, you would execute query_A again with a reporting-field …&fl=bscore:query({!dismax v="query_B"})&…. Then the document's scores against query_B would be included in the output (as bscore).
Finally, the result-grouping functionality can be used both collect the top-N for one query and scores for lesser documents intersecting with other queries in one go. For example, if querying for query_B and adding …&group=true&group.query=query_B&group.query=query_A&…, you'll get back groups that satisfy query_B (ranked by query_B), and that satisfy both query_B and query_A (but again ranked by query_B). This could be mixed with the functional field above to get the scores by another query (like query_A) as well.
However, all groups will share the same sort order (from either the master query or something specified by a group.sort parameter), so it's not currently possible (SOLR-4.0.0-beta) to get several top-N results according to different scorings, just the top-Ns according to one scoring, limited by certain groups. (There's a comment in the source code suggesting alternate sorts per group may be envisioned as a future capability.)
Is it possible in solr to index key-value pairs for a single document, like:
Document ID: 100
2011-05-01,20
2011-08-23,200
2011-08-30,1000
Document ID: 200
2011-04-23,10
2011-04-24,100
and then querying for documents with a specific value aggregation in a specific time range, i.e. "give me documents with sum(value) > 0 between 2011-08-01 and 2011-09-01" would return the document with id 100 in the example data above.
Here is a post from the Solr User Mailing List where a couple of approaches for dealing with fields as key/value pairs are discussed.
1) encode the "id" and the "label" in the field value; facet on it;
require clients to know how to decode. This works really well for simple
things where the the id=>label mappings don't ever change, and are
easy to encode (ie "01234:Chris Hostetter"). This is a horrible approach
when id=>label mappings do change with any frequency.
2) have a seperate type of "metadata" document, one per "thing" that you
are faceting on containing fields for id and the label (and probably a
doc_type field so you can tell it apart from your main docs) then once
you've done your main query and gotten the results back facetied on id,
you can query for those ids to get the corrisponding labels. this works
realy well if the labels ever change (just reindex the corrisponding
metadata document) and has the added bonus that you can store additional
metadata in each of those docs, and in many use cases for presenting an
initial "browse" interface, you can sometimes get away with a cheap
search for all metadata docs (or all metadata docs meeting a certain
criteria) instead of an expensive facet query across all of your main
documents.
I want to provide additional information per each indexed document during index time.
And access this information in the same analyzer during query time to compare it.
So. Theoretically it would be great to write this value into some field present in this document and at query time search this field also.
f.e. I have an animals db. I want to find all documents with 3 words 'dog' inside. (just an example). I can setup for my "animals" field my custom BaseTokenFilterFactory which will produce my custom TokenFilter which will just count all 'dog' words and store this number somewhere. So. Where I can store this value to access it at searching time?
Your example sounds like something which will be better suited to be handled by custom Similarity or a query function in Solr and not as a custom analyzer.
For example if using Solr 4.0 you can use the function termfreq(field,term) to order by the number of times dog appears. or you can use it as a filter like so:
fq={!frange l=3 u=100000}termfreq(animals,"dog")
This will filter all documents whose animals field doesn't have at least 3 occurrences of the word dog.
The advantage of using this method is that you don't affect the scoring of the documents only filter them.
The ability to filter by function exists since Solr 1.4 so even if you are using an earlier version of Solr (>1.4) you can easily write the "termfreq" function query yourself