Apache solr dedupe for existing document - Deduplication - solr

I already have an index which was indexed with a unique key "oldfield" in a cloud environment with multiple shards. The unique key was then changed to a different field "newfield". Now when it comes to updating documents, the document is added instead of being updated because the new document is routed to a different shard.
This causes duplicates. The index now has both old and new version of the document stored on different shards.
Dedupe wont work as it will try to add a new key to existing document unique key field "newField"
This is the update processor that I am using for dedupe.
<processor class="solr.processor.SignatureUpdateProcessorFactory">
<bool name="enabled">true</bool>
<str name="signatureField">newField</str>
<bool name="overwriteDupes">true</bool>
<str name="fields">newField</str>
<str name="signatureClass">solr.processor.TextProfileSignature</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
How to I make sure duplicates are not generated for existing documents?

Related

Can I combine result sets in Solr

I want to do the following:
Let A be the set of documents, each with the field important:true, and with a date beginning with this year, or previous year. The result set should be ordered by date. In pseudo code:
Result set A:
q="testquery" +important:true AND +(date:2015* OR date:2016*)
sort=date desc
Then, let B be the remaining set of documents, i.e. those with important:true and a date preceeding year 2015, and also all documents with important:false. This set should also be ordered by date. Again in very sloppy pseudo:
Result set B:
q="testquery" -(date:2015* OR date:2016*)
sort=date desc
Now, i would like to return A followed by B, and be able to use the paging features etc. I am very noob with SOLR ( < 10 hrs of trying out different queries) and I can't figure how to accomplish this behavior. I guess I cannot use bq since we don't sort by score, right?
An example of the desired outcome:
<result name="response" numFound="2089" start="0">
<doc>
<bool name="important">true</bool>
<str name="date">2016-03-01 00:00:00</str>
</doc>
<doc>
<bool name="important">true</bool>
<str name="date">2015-12-01 00:00:00</str>
</doc>
<doc>
<bool name="important">true</bool>
<str name="date">2015-04-01 00:00:00</str>
</doc>
<doc>
<bool name="important">true</bool>
<str name="date">2015-01-01 00:00:00</str>
</doc>
<doc>
<bool name="important">false</bool>
<str name="date">2016-10-01 00:00:00</str>
</doc>
<doc>
<bool name="important">false</bool>
<str name="date">2015-03-01 00:00:00</str>
</doc>
<doc>
<bool name="important">false</bool>
<str name="date">2014-02-01 00:00:00</str>
</doc>
<doc>
<bool name="important">true</bool>
<str name="date">2014-09-01 00:00:00</str>
</doc>
<doc>
<bool name="important">false</bool>
<str name="date">2013-05-01 00:00:00</str>
</doc>
<doc>
<str name="date">2012-09-01 00:00:00</str>
</doc>
</result>
</response>
Notice in the example above that for documents older than 2015, the documents marked important is no more important than any other, they will appear in strict chronological order.
Any help is appreciated, but I would especially love examples using SolrNet syntax :)
EDIT:
I can not make any changes to index or schema...
((important: true AND (date:2016* OR date:2015*))^1001 OR (important: false AND (date:2016* OR date:2015*))^1000 OR date:*) AND something:"foo" and sort score desc, date desc
This will show recent important items first, then recent non-important items, and finally all items, and everything sorted by date in their 'sections'.
something:"foo" at the end of the clause refers to any extra clauses you might have.
The main challenge here - I feel - is sorting by date. Without that, you could easily boost your special privilege query to be at the front. But sorting by date afterwords would reset this and you would be back where you started.
It is possible however to sort by several fields. So, if your special condition could be encoded as a field value during indexing, you could sort by that first, then by date.
If that's not possible to do during the indexing, you may need to add a second trick. It is possible to sort by a function query instead of a field. So, you would need to build a function query expression (probably using if and ms at least) that represents your boost condition.
You may have some challenges representing your 2015/2016 as a condition. If it is a date, you may be able to use date math to create a consistent round-down to a year (NOW/YEAR).
I would start by doing a simpler problem of just pushing the important item to the top, still sorted by date. Just to test that my logic here works. If/once that works with functions and sort and paging, the special dates can be added into the calculation.

Solr: Facet one field with two outputs

I'm using Solr for indexing products and organising them into several categories. Each document has a taxon_names multi value field, where the categories are stored as human readable strings for a product.
Now I want to fetch all the categories from Solr and display them with clickable links to the user, without hitting the database again. At index time, I get the permalinks for every category from the MySQL database, which is stored as a multi value field taxon_permalinks. For generating the links to the products, I need the human readable format of the category and its permalink (otherwise you would have such ugly URLs in your browser, when just using the plain human readable name of the category, e.g. %20 for space).
When I do a facet search with http://localhost:8982/solr/default/select?q=*%3A*&rows=0&wt=xml&facet=true&facet.field=taxon_names, I get a list of human readable taxons with its counts. Based on this list, I want to create the links, so that I don't have to hit the database again.
So, is it possible to retrieve the matching permalinks from Solr for the different categories? For example, I get a XML like this:
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
</lst>
<result name="response" numFound="6580" start="0"/>
<lst name="facet_counts">
<lst name="facet_queries"/>
<lst name="facet_fields">
<lst name="taxon_names">
<int name="Books">2831</int>
<int name="Music">984</int>
...
</lst>
</result>
And inside the taxon_names array I would need the name of the permalink.
Maybe it's possible by defining a custom field type in the config XMLs. But for this, I don't have enough experience with Solr.
Since it appears from your description that you are faceting permalink in the taxon_permalink field and the values in that field should correspond to the same category names in the taxon_names field. Solr allows you to facet on multiple fields, so you can just facet on both fields and walk the two facet results grabbing the display name from the taxon_names facet values and the permalink from the taxon_permalink facet values.
Query:
http://localhost:8982/solr/default/selectq=*%3A*&rows=0&wt=xml
&facet=true&facet.field=taxon_names&facet.field=taxon_permalink
Your output should then look like similar to the following:
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
</lst>
<result name="response" numFound="6580" start="0"/>
<lst name="facet_counts">
<lst name="facet_queries"/>
<lst name="facet_fields">
<lst name="taxon_names">
<int name="Books">2831</int>
<int name="Music">984</int>
...
</lst>
<lst name="taxon_permalink">
<int name="permalink1">2831</int>
<int name="permalink2">984</int>
...
</lst>
</result>

Solr Facet Search-Spell check

I'm usign Solr facet search on a column of database. It successfully returns the data:
<lst name="facet_counts">
<lst name="facet_queries"/>
<lst name="facet_fields">
<lst name="tags">
<int name="lol">58</int>
<int name="scienc">58</int>
<int name="photo">34</int>
<int name="axiom">27</int>
<int name="geniu">14</int>
</lst>
</lst>
<lst name="facet_dates"/>
<lst name="facet_ranges"/>
</lst>
I want to make sure that only complete words are counted. In the above example you can see counts for'scienc' and 'geniu' that should be for 'science' and 'genius'. How can I achieve this? Can I incorporate spell checking feature?
This probably has to do with the underlying fieldType that you have associated with your tags field. The field value is most likely being stemmed or having other analyzers associated with it. I would suggest one of two things:
Remove the stemming and/or other processing to prevent the words from appearing as partial.
(Recommended) Create a separate field tags_facet with fieldType="string" in your schema.xml and use a copyField directive to copy the values feed into your original tags field. Then facet on this new tags_facet field.
Use the copyField feature of Solr to copy the original field to one with a string fieldType. If the values are a set of words, instead of string, you could use a whitespace tokenised fieldtype (without ngrams of course.)

Solr More Like This (MLT) using a different unique identifier than the default one id

I m trying to use MLT but I have as unique identifier doc_id instead of id and if I do this :
http://localhost:8983/solr/mlt/?q=doc_id:question#11 I have no results
where If I do this
http://localhost:8983/solr/mlt/?q=id:11 I have results
<requestHandler name="/mlt" class="solr.MoreLikeThisHandler">
<lst name="defaults">
<str name="mlt.fl">title,text</str>
<str name="mlt.mintf">1</str>
<str name="mlt.mindf">2</str>
<str name="mlt.minwl">2</str>
<str name="mlt.boost">true</str>
<int name="rows">5</int>
<str name="fl">id,doc_id,title,content_type,user_id,topic_id,score</str>
</lst>
</requestHandler>
How can I use MLT with doc_id as my unique identifier ?
What you have looks fine. MLT just users the query to find a doc and if found use that doc for the source document. Are you sure a document is returned with the query "doc_id:question#11". Put the value in quotes and see if that get you the document back, ex. doc_id:"question#11". What is the datatype for doc_id?

How do I detect "ERROR:SCHEMA-INDEX-MISMATCH" in Solr?

How do I find documents in my index that have a SCHEMA-INDEX-MISMATCH? I have a number of these that I am finding them by trial-and-error. I want to query for them.
The results that I get have "ERROR:SCHEMA-INDEX-MISMATCH" in a field. An example:
<?xml version="1.0" encoding="UTF-8"?>
<response>
<result name="response" numFound="1" start="0" maxScore="12.993319">
<doc>
<float name="score">12.993319</float>
<str name="articleId">ERROR:SCHEMA-INDEX-MISMATCH,stringValue=555</str>
<str name="articleType">Knowledge Base</str>
<str name="description">Moving to another drive Question: How can I ....</str>
<str name="id">article:555</str>
<str name="title">Moving to another drive</str>
<str name="type">article</str>
</doc>
</result>
</response>
If it matters, my query is along the lines of http://server/solr/select?q=id:%22article:555%22
What is the "type" of articleId?
I had issues with a date field and due to a defect in indexing program, I had 'ERROR:SCHEMA-INDEX-MISMATCH". Since these are values out side the bounds of a normal date, I was able to find them by the query - "Not myDateFieldType:[0001-01-01T00:00:00Z NOW]" .
If you are able to craft this type of query, depending on your data type, you should be able to find these values.

Resources