Accepting and managing identical documents in SOLR - solr

I have a tree structure with documents I'm indexing with Solr. Many documents exist in multiple places with identical content, but some metadata differs. I'd like to keep the duplicates in the index, so it is not de-duplication I'm looking for (or at least think so). What strategies are available to me, if I want to get single hits for the documents that are duplicated, but still being able to keep the individual documents available?
Folder A |
Folder A1 |
Document 1 | Category 1
Document 2 | Category 1
Folder A2 |
Document 1 | Category 2
Document 2 | Category 2
Document 1 is the same and exists in both Folder A1 and A2. When searching for something in Document 1, I want to be able to find it if I filter out Category 1 (or 2), but without filter, I'd like to get one hit, indicating that it matches multiple categories.
Is it better to approach this when populating the index, or when querying? What options are available?

This is a good case for using Collapse and Expanding.
You collapse the result set based on the Document ID of the document, allowing you to only get one result back for each distinct document. You're still able to get all variants of the unique document back (i.e. the different sets of metadata with their categories) by using the Expand functionality.
q=foo&fq={!collapse field=DocumentID}&expand=true
The expand=true parameter turns on the ExpandComponent. The ExpandComponent adds a new section to the search output labeled expanded.
Inside the expanded section there is a map with each group head pointing to the expanded documents that are within the group. As applications iterate the main collapsed result set, they can access the expanded map to retrieve the expanded groups.
You also have the option of using Result Grouping but if you can make C&E work that's the recommended solution.

Related

Solr function query filter improvementss

I'm trying to filter documents based on all values in a field.
For this I had used a function query in a filterquery.
To explain it.
I have exclusion rules on regions and on countries.
Each document contains the values for which it is excluded.
If exclusion rules exist on regions, do nothing.(the region filter query is a separate one)
If exclusion rules don't exist for region, use the country.
For this requirement I had the filter query below.
fq="!{!df=excluded_region v=$user.region}"
fq={!frange l=0 u=0}and(not(docfreq(excluded_region,$user.region)),termfreq(excluded_country,$user.country))
It works fine except when a region is deleted from the index entirely.(none of the documents still have that value)
The docFrequency is not changed.
I know I could resolve this by segment merging, but this is not possible due to the size of the index.
Also possible by dynamically adding filter statements, but I'd prefer to have these blocking rules in the appends section of the request handlers.
Is there a better way to write this function query?
Is it possible to do a subquery across all documents to check whether a region exists?
Example(s) of how the data is supposed to work:
DocId
excluded_region
excluded_country
Doc A
A1
BE
Doc B
A2
BE
Doc C
A3,A1
BE
Doc D
A3,A1,A4
BE
If for example the user has country BE and region A5(not existing in any document), nothing is returned.
If he has region A1, document B is the only returned document.

Solr dynamic sorting

We have a website on which you can search through a large amount of products from different shops. Say we have 5 products per result page and the 10 best matches for a search have all the same score. 8 of the products are of one shop (A), and the two others by two other shops (B,C).
What we often get is (letter indicating a product of this shop)
A
A
A
A
A
---- second result page ----
A
B
A
C
A
but what we want to get is something like this:
A
C
B
A
A
---- second result page ----
A
A
A
A
A
Writing function query seems to be one option
http://www.solrtutorial.com/custom-solr-functionquery.html
What is the best way to achieve this?
You could group the results by shop using Field Collapsing and display the result either as a group or flattened list (depending on how you want it).
Another trick that I've seen in use to help the users see results from multiple group is to use Facets. You could have a sidebar (or something similar) that does two things:
By default it lets the user know that there are other filter criteria (ex. shops) in the result. This helps a lot when the result is paginated.
With facets being present, it is upto the user to choose whatever criteria she/he wishes to apply, thus relieving you of implementing heavy scenario based logic.
Read more about faceting here.
Edit:
If you have to use custom sort logic, you could write it down using Functions and use it in the sort when querying Solr. Here is the reference from the docs.

How does appengine's data store query and index multi-value properties?

Lets say I have a Photo class containing a multi-valued property for tags and a date field.
I would like to allow the user to perform a query based on tags (using only a AND operator for more then 1 tag).
For example lets say a user searches for a rainy day.
Select * from Photo where tag='clouds' AND tag='rainy'
How does the zig-zag merge work? I know that two scans are performed, and based on if the keys from both searches point to the same Photo then it's returned. Does this happen in parallel however? Ex: While Search 1 finds a photo that contains tag 'clouds' , Search 2 is finding the first photo that contains tag "rainy". When both searches are done, it becomes synchronous. Search 1 then continues it's scan until it hits the same key as S2. Then while the keys for each search are the same, the photo is returned, and the "cursor" is moved along 1 step for each search?
Secondly, does defining multiple indexes speed up these sort of queries? Ex, if I wanted to allow up to 4 tags then I would need to define the indexes such as:
Index(Photo)
Index(Photo, tag)
Index(Photo, tag,tag)
Index(Photo, tag,tag,tag)
Index(Photo, tag,tag,tag,tag)
Then, performing the same query above will be quicker?
Also, using our original query, lets say we have Millions of photos tagged as cloudy, but only two are tagged as rainy. Does this mean zig-zag will perform relatively slow? Since one of the searches will try to find a matching exist? Even worse, if we have one million photos tagged "rainy" and one million are tagged "cloudly" yet no single photo have both tags in them. Will defining the above index's fix this issue?
Lastly, lets say a photo has 100 tags. Does that mean all the index's above have to include EVERY combination of the 100 tags?
I know there are got-yas (such as a entity can only be indexed 5000 times, and a single multi-valued property can only be indexed a 1000 times).
How does the zig-zag merge work?
You can check out the Google I/O video from 2009 on Building Scalable, Complex Apps on App Engine. Brett Slatkin explains how zig-zag merge works starting at 27 minutes. As he says, "I can't really explain it without showing how it works."

Solr: Query and return x number of types

I have a large index of files. One of the fields I have is "content_type". This field stores the basic type for a file (i.e. pdf, image, video, document, spreadsheet, etc).
I'm running a search on files names (my "title" field). How can I structure the query so that it returns only a certain amount of each type?
For example, say I have 1000 files with the word "work" in the title. I want to search for "work" in the title, but I want 5 results from each "content_type" returned first. (assuming that each specific content_type has 5 or more items). So on my search results page I can say:
1,000 items were found for "work"
Then I start listing listing the items, 5 for each type.
Can anyone help me build a query that will do this? I'm pretty new to Solr, but I'm hoping this can be done.
Seems basically you want to limit and group the results per content type.
Check out the Solr field Collapsing and grouping feature
This will help you to group the results per content type using group.field=content_type
The number of results in a group can be limited by group.limit=5
For the complete list of options refer to the link above.
And you can use the normal query parameters to search the results i.e. q=work
This feature is only available from the Solr 3.3 build.

SOLR: Is it it possible to index multiple timestamp:value pairs per document?

Is it possible in solr to index key-value pairs for a single document, like:
Document ID: 100
2011-05-01,20
2011-08-23,200
2011-08-30,1000
Document ID: 200
2011-04-23,10
2011-04-24,100
and then querying for documents with a specific value aggregation in a specific time range, i.e. "give me documents with sum(value) > 0 between 2011-08-01 and 2011-09-01" would return the document with id 100 in the example data above.
Here is a post from the Solr User Mailing List where a couple of approaches for dealing with fields as key/value pairs are discussed.
1) encode the "id" and the "label" in the field value; facet on it;
require clients to know how to decode. This works really well for simple
things where the the id=>label mappings don't ever change, and are
easy to encode (ie "01234:Chris Hostetter"). This is a horrible approach
when id=>label mappings do change with any frequency.
2) have a seperate type of "metadata" document, one per "thing" that you
are faceting on containing fields for id and the label (and probably a
doc_type field so you can tell it apart from your main docs) then once
you've done your main query and gotten the results back facetied on id,
you can query for those ids to get the corrisponding labels. this works
realy well if the labels ever change (just reindex the corrisponding
metadata document) and has the added bonus that you can store additional
metadata in each of those docs, and in many use cases for presenting an
initial "browse" interface, you can sometimes get away with a cheap
search for all metadata docs (or all metadata docs meeting a certain
criteria) instead of an expensive facet query across all of your main
documents.

Resources