Solr function query filter improvementss - solr

I'm trying to filter documents based on all values in a field.
For this I had used a function query in a filterquery.
To explain it.
I have exclusion rules on regions and on countries.
Each document contains the values for which it is excluded.
If exclusion rules exist on regions, do nothing.(the region filter query is a separate one)
If exclusion rules don't exist for region, use the country.
For this requirement I had the filter query below.
fq="!{!df=excluded_region v=$user.region}"
fq={!frange l=0 u=0}and(not(docfreq(excluded_region,$user.region)),termfreq(excluded_country,$user.country))
It works fine except when a region is deleted from the index entirely.(none of the documents still have that value)
The docFrequency is not changed.
I know I could resolve this by segment merging, but this is not possible due to the size of the index.
Also possible by dynamically adding filter statements, but I'd prefer to have these blocking rules in the appends section of the request handlers.
Is there a better way to write this function query?
Is it possible to do a subquery across all documents to check whether a region exists?
Example(s) of how the data is supposed to work:
DocId
excluded_region
excluded_country
Doc A
A1
BE
Doc B
A2
BE
Doc C
A3,A1
BE
Doc D
A3,A1,A4
BE
If for example the user has country BE and region A5(not existing in any document), nothing is returned.
If he has region A1, document B is the only returned document.

Related

Accepting and managing identical documents in SOLR

I have a tree structure with documents I'm indexing with Solr. Many documents exist in multiple places with identical content, but some metadata differs. I'd like to keep the duplicates in the index, so it is not de-duplication I'm looking for (or at least think so). What strategies are available to me, if I want to get single hits for the documents that are duplicated, but still being able to keep the individual documents available?
Folder A |
Folder A1 |
Document 1 | Category 1
Document 2 | Category 1
Folder A2 |
Document 1 | Category 2
Document 2 | Category 2
Document 1 is the same and exists in both Folder A1 and A2. When searching for something in Document 1, I want to be able to find it if I filter out Category 1 (or 2), but without filter, I'd like to get one hit, indicating that it matches multiple categories.
Is it better to approach this when populating the index, or when querying? What options are available?
This is a good case for using Collapse and Expanding.
You collapse the result set based on the Document ID of the document, allowing you to only get one result back for each distinct document. You're still able to get all variants of the unique document back (i.e. the different sets of metadata with their categories) by using the Expand functionality.
q=foo&fq={!collapse field=DocumentID}&expand=true
The expand=true parameter turns on the ExpandComponent. The ExpandComponent adds a new section to the search output labeled expanded.
Inside the expanded section there is a map with each group head pointing to the expanded documents that are within the group. As applications iterate the main collapsed result set, they can access the expanded map to retrieve the expanded groups.
You also have the option of using Result Grouping but if you can make C&E work that's the recommended solution.

Mongo DB Query to check if document array field element present in more than one document

I have been searching through the MongoDB query syntax with various combinations of terms to see if I can find the right syntax for the type of query I want to create.
We have a collection containing documents with an array field. This array field contains ids of items associated with the document.
I want to be able to check if an item has been associated more than once. If it has then more than one document will have the id element present in its array field.
I don't know in advance the id(s) to check for as I don't know which items are associated more than once. I am trying to detect this. It would be comparatively straightforward to query for all documents with a specific value in their array field.
What I need is some query that can return all the documents where one of the elements of its array field is also present in the array field of a different document.
I don't know how to do this. In SQL it might have been possible with subqueries. In Mongo Query Language I don't know how to do this or even if it can be done.
You can use $lookup to self join the rows and output the document when there is a match and $project with exclusion to drop the joined field in 3.6 mongo version.
$push with [] array non equality match to output document where there is matching document.
db.col.aggregate([
{"$unwind":"$array"},
{"$lookup":{
"from":col,
"localField":"array",
"foreignField":"array",
"as":"jarray"
}},
{"$group":{
"_id":"$_id",
"fieldOne":{"$first":"$fieldOne"},
... other fields
"jarray":{"$push":"$jarray"}
}},
{"$match":{"jarray":{"$ne":[]}}},
{"$project":{"jarray":0}}
])

Filter on fields only if present on a document

Is it possible to filter a document by the value provided only if the document has the field.
For context,
I have document types A,B,C that have the field.
I also have document types D and E that don't.
I could define a query such that the filter only applies to the first subset, but I might later add a new document type to the first set which will invalidate this filter.
You'll have to combine the query with an match against all documents, except those who have a value in the field:
myfield:foobar OR (*:* NOT myfield:*)
.. should do what you want. That being said, I'd probably wait to introduce these additional queries until I actually see that it's needed, as it will make each query more expensive without possibly being necessary in the future - but that's up to your judgement.

Can SOLR/Lucene report calculated score of extra named documents, even if they're not in top N results?

I'd like to submit a query to SOLR/Lucene, plus a list of document IDs. From the query, I'd like the usual top-N scored results, but I'd also like to get the scores for the named documents... no matter how low they are.
Can anyone think of an easy/supported way to do this in a single index scan, where the scores for the 'added' (non-ranking/pinned-for-inclusion) docs are comparable/same-scaled as those for the top-N results? (Patching SOLR with specialized classes would be OK; I figure that's what I may have to do if there's no existing support.)
Or failing that, could it be simulated with a followup query, ideally in a way that the named-document scores could be scaled to be roughly comparable to the top-N for the reference query?
Alternatively -- and perhaps as good or better for my intended use -- could I make a single request against a SOLR/Lucene index which includes M (with M=2 or more) distinct queries, and return the results that are in the top-N for any of the M queries, and for every result include its score against all M of the distinct queries?
(Even in my above formulation, the list of documents that I want scored along with a new query will typically have been the results from a prior query.)
Solutions or even just fragments of possible approaches appreciated!
I am not sure if I understand properly what you want to achieve but wouldn't a simple
q: (somequery) OR id: (1 OR 2 OR 4)
be enough?
If you would want both parts to be boosted by the same scale (I am not sure if this isn't the default behaviour of Solr) you would want to use dismax or edismax and your query would change to something like:
q: (somequery)^10 OR id: (1 OR 2 OR 4)^10
You would then have both the elements defined by the IDs and the query results scored the same way.
To self-answer, reporting what I've found since posting...
One clumsy option is the explainOther parameter, which takes another query. (This query could be a OR list of interesting document IDs.) The response will then include a full scoring explanation for documents which match this other query. explainOther only has effect when combined with the also-required debugQuery parameter.
All that debug/explain information is overkill for the need, but may be useful, or the code paths that implement it might provide a guide to making a hypothetical new more narrowly-focused 'scoreOther' option.
Another option would be to make use of pseudo-field calculated using the query() function to report how any set of results score on some other query/queries. So if for example the original document set was the top-N from query_A, and then those are the exact documents that you also want to score against query_B, you would execute query_A again with a reporting-field …&fl=bscore:query({!dismax v="query_B"})&…. Then the document's scores against query_B would be included in the output (as bscore).
Finally, the result-grouping functionality can be used both collect the top-N for one query and scores for lesser documents intersecting with other queries in one go. For example, if querying for query_B and adding …&group=true&group.query=query_B&group.query=query_A&…, you'll get back groups that satisfy query_B (ranked by query_B), and that satisfy both query_B and query_A (but again ranked by query_B). This could be mixed with the functional field above to get the scores by another query (like query_A) as well.
However, all groups will share the same sort order (from either the master query or something specified by a group.sort parameter), so it's not currently possible (SOLR-4.0.0-beta) to get several top-N results according to different scorings, just the top-Ns according to one scoring, limited by certain groups. (There's a comment in the source code suggesting alternate sorts per group may be envisioned as a future capability.)

SOLR: Is it it possible to index multiple timestamp:value pairs per document?

Is it possible in solr to index key-value pairs for a single document, like:
Document ID: 100
2011-05-01,20
2011-08-23,200
2011-08-30,1000
Document ID: 200
2011-04-23,10
2011-04-24,100
and then querying for documents with a specific value aggregation in a specific time range, i.e. "give me documents with sum(value) > 0 between 2011-08-01 and 2011-09-01" would return the document with id 100 in the example data above.
Here is a post from the Solr User Mailing List where a couple of approaches for dealing with fields as key/value pairs are discussed.
1) encode the "id" and the "label" in the field value; facet on it;
require clients to know how to decode. This works really well for simple
things where the the id=>label mappings don't ever change, and are
easy to encode (ie "01234:Chris Hostetter"). This is a horrible approach
when id=>label mappings do change with any frequency.
2) have a seperate type of "metadata" document, one per "thing" that you
are faceting on containing fields for id and the label (and probably a
doc_type field so you can tell it apart from your main docs) then once
you've done your main query and gotten the results back facetied on id,
you can query for those ids to get the corrisponding labels. this works
realy well if the labels ever change (just reindex the corrisponding
metadata document) and has the added bonus that you can store additional
metadata in each of those docs, and in many use cases for presenting an
initial "browse" interface, you can sometimes get away with a cheap
search for all metadata docs (or all metadata docs meeting a certain
criteria) instead of an expensive facet query across all of your main
documents.

Resources