Keep one entry of duplicate articles with SOLR deduplication

Keep one entry of duplicate articles with SOLR deduplication - solr

I have used Solr deduplication with following setting in solrconfig.xml
<updateRequestProcessorChain name="dedupe">
<processor class="solr.processor.SignatureUpdateProcessorFactory">
<bool name="enabled">true</bool>
<str name="signatureField">signature</str>
<bool name="overwriteDupes">true</bool>
<str name="fields">description</str>
<str name="signatureClass">solr.processor.TextProfileSignature</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
and in schema.xml
<field name="signature" type="string" stored="true" indexed="true" multiValued="false" />
My objective is to find documents with duplicate descriptions (used TextProfileSignature for near duplicate) keep one entry and remove other duplicate entries.
for e.g.
doc1
description : Websol – Candidate should be good in communication and computer skills
must be willing to relocate
We have good vacancies for Back Office in international call centers
doc2
description :Websol – Candidate should be good in communication and computer skills
must be willing to relocate
We have good vacancies for Back Office in international call centers...
from these two docs only one to be deleted not both but with solr dedupe both entries get deleted.
Let me know if i am missing aything in setting or i need to follow other way to achieve this.

Could be you are suffering from a known issue

Related

Solr server Context Filtering in Auto suggester not working

I'm experiencing problem when I try to use Context Filtering with auto suggester. What I want is to filter the suggestions based on url field
Here is my searchComponent:
<lst name="suggester">
<str name="name">AnalyzingInfixSuggester</str>
<str name="lookupImpl">AnalyzingInfixLookupFactory</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="field">main_title</str>
<str name="weightField">main_title</str>
<str name="contextField">url</str>
<str name="suggestAnalyzerFieldType">text_general</str>
</lst>
Here are the fields in my schema:
<field name="main_title" type="string" indexed="true" stored="true"/>
<field name="url" type="string" indexed="true" stored="true"/>
Example:
I'm searching for "aacsb" and I have two results, which is correct. One is in English and one in German. I want to filter them out and show only the German result.
My urls looks like this:
https://www.myWebsite.com/aacsb-dog-lion?german
https://www.myWebsite.com/aacsb-dog-lion?english
Here are my queries:
http://localhost:8983/solr/myCore/suggest?&q=aacsb&suggest.dictionary=AnalyzingInfixSuggester&suggest.cfq=-url:english
http://localhost:8983/solr/myCore/suggest?&q=aacsb&suggest.dictionary=AnalyzingInfixSuggester&suggest.cfq=-english
With these I'm receiving both results. It doesn't matter if we have the field name or not.
When I tried these
http://localhost:8983/solr/myCore/suggest?&q=aacsb&suggest.dictionary=AnalyzingInfixSuggester&suggest.cfq=url:english
http://localhost:8983/solr/myCore/suggest?&q=aacsb&suggest.dictionary=AnalyzingInfixSuggester&suggest.cfq=english
I don't receive any results.
I read the documentation several times:LINK, but I still can't make it work.
Any help is welcomed.
Thanks!
EDIT:
I pasted the wrong queries, this was the correct:
http://localhost:8983/solr/myCore/suggest?&q=aacsb&suggest.dictionary=AnalyzingInfixSuggester&suggest.cfq=url:\*english\*
http://localhost:8983/solr/myCore/suggest?&q=aacsb&suggest.dictionary=AnalyzingInfixSuggester&suggest.cfq=\*english\*

Solr ClassificationUpdateProcessorFactory Bayes: problem with labels

I have encountered a very strange behaviour: to test the classifcation function in Solr, I have defined the following processor chain:
<updateRequestProcessorChain name="classification">
<processor class="solr.UUIDUpdateProcessorFactory">
<str name="fieldName">id</str>
</processor>
<processor class="solr.ClassificationUpdateProcessorFactory">
<str name="inputFields">content</str>
<str name="classField">cat_knn</str>
<str name="predictedClassField.maxCount">2</str>
<str name="algorithm">knn</str>
<str name="knn.k">10</str>
<str name="knn.minTf">1</str>
<str name="knn.minDf">1</str>
</processor>
<processor class="solr.ClassificationUpdateProcessorFactory">
<str name="inputFields">content</str>
<str name="classField">cat_bayes</str>
<str name="predictedClassField.maxCount">2</str>
<str name="algorithm">bayes</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory"/>
<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>
As a test set I am using news categories, such as "business", "entertainment" etc.
The relevant fields are defined as follows:
<field name="cat_knn" type="text_en" indexed="true" stored="true" multiValued="true"/>
<field name="cat_bayes" type="text_en" indexed="true" stored="true" multiValued="true"/>
For the training set cat_knn and cat_bayes contain exactly the same category labels.
However, if I use the above chain to classify new documents, the cat_knn for new documents are labeled with the full label, i.e. "business" or "entertainment", whereas for the bayes algorithm the labels are cut and displayed as "busi" or "entertain". At the same time, a label like "sport" is properly recorded as "sport".
Any idea what might be going on here?

What you are seeing is the stemmed tokens for the field instead. On the SolrClassification wiki page it specifies that:
The field that contains the class of the document. It must appear in the indexed documents. If knn algorithm it must be stored. If bayes algorithm it must be indexed and ideally not heavily analysed.
This indicates that bayes uses the actual tokens, while knn uses the stored text for the field when outputting the class.
Change the field type to string or strings (single valued vs multivalued), or a text field with minimal analysis (maybe a KeywordTokenizer with only a LowercaseFilter or similar).

Solr deduplication error while indexing nutch data

I had integrated nutch 2.3.1 with solr 6.5, with this I could push data to solr and get indexed. Now I want to remove duplicate elements and for this I made the modifications in schema.xml and solrconfig.xml
<field name="signatureField" type="string" stored="true" indexed="true" multiValued="false" />
<updateRequestProcessorChain name="dedupe">
<processor class="solr.processor.SignatureUpdateProcessorFactory">
<bool name="enabled">true</bool>
<str name="signatureField">id</str>
<bool name="overwriteDupes">false</bool>
<str name="fields">id,content,date,url</str> <!-- changing to id <str name="fields">name,features,cat</str>-->
<str name="signatureClass">solr.processor.Lookup3Signature</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
<requestHandler name="/update" class="solr.UpdateRequestHandler" >
<lst name="defaults">
<str name="update.chain">dedupe</str>
</lst>
</requestHandler>
but after indexing bin/nutch solrindex http://localhost:8983/solr/testcore -all
error !!
please help me to sort out this issue
thanking you in advance :)

This issue might be related to the schema updated, if you have some data existing in Solr and you updated the schema while that data exist in the core, Nutch will take it as a mismatch Schema, best way to fix this issue is re-crawling the webpage with the schema updated and keep in mind that any update to the schema will/could probably cause issues with you existing index.
Since post is already old, for future reference for people that could have the same issue.
Best :)

Apache nutch not indexing all documents to apache solr

I am using apache nutch 2.3 (latest version). I have crawled about 49000 documnts by nutch. From documents mime analysis, crawled data containes about 45000 thouseand text/html documents. But when I saw indexed documents in solr (4.10.3), only about 14000 documents are indexed. Why this huge difference between documents are (45000-14000=31000). If I assume that nutch only index text/html documents, then atleast 45000 documents should be indexed.
What is the problem. How to solve it?

In my case this problem was due to missing solr indexer infomration in nutch-site.xml. When I update config, this problem was resolved. Please check your crawler log at indexing step. In my case it was informed that no solr indexer plugin is found.
Following lines (property) are added in nutch-site.xml
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|protocol-http|indexer-solr|urlfilter-regex|parse-(html|tika)|index-(basic|more)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
<description>plugin details here </description>
</property>

You should look at your Solr logs, to see if there's anything about "duplicate" documents, or just go look in the solrconfig.xml file for the core into which you are pushing the documents. There is likely a "dedupe" call is being made on the update handler, the fields used may be causing duplicate documents (based on a few fields) to be dropped. You'll see something like this
<requestHandler name="/dataimport"
class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="update.chain">dedupe</str> <<-- change dedupe to uuid
<str name="config">dih-config.xml</str> or comment the line
</lst>
</requestHandler>
and later in the file the definition of the dedupe update.chain,
<updateRequestProcessorChain name="dedupe">
<processor class="solr.processor.SignatureUpdateProcessorFactory">
<bool name="enabled">true</bool>
<str name="signatureField">id</str>
<bool name="overwriteDupes">true</bool>
-->> <str name="fields">url,date,rawline</str> <<--
<str name="signatureClass">solr.processor.Lookup3Signature</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
The "fields" element is what will select which input data is used to determine the uniqueness of the record. Of course, if you know there's no duplication in your input data, this is not the issue. But the above configuration will throw out any records which are duplicate on the fields shown.
You may not be using the dataimport requestHandler, but rather the "update" requestHandler. I'm not sure which one Nutch uses. Either, way, you can simply comment out the update.chain, change it to a different processorChain such as "uuid", or add more fields to the "fields" declaration.

Solr keeps giving errors in automatic uuid generation

I got a problem with automatic uuid generation in Solr. I want Solr to generate automatically uuids for the data imported by DataImportHandler.
Here's what i did:
In schema.xml
<fieldType name="uuid" class="solr.UUIDField" indexed="true" />
<field name="id" type="uuid" indexed="true" stored="true" required="true" multiValued="false" />
In solrconfig.xml
I added:
<updateRequestProcessorChain name="uuid">
<processor class="solr.UUIDUpdateProcessorFactory">
<str name="fieldName">id</str>
</processor>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
I modified:
<requestHandler name="/update" class="solr.UpdateRequestHandler">
<!-- See below for information on defining
updateRequestProcessorChains that can be used by name
on each Update Request
-->
<!--
<lst name="defaults">
<str name="update.chain">dedupe</str>
</lst>
-->
<lst name="defaults">
<str name="update.chain">uuid</str>
</lst>
Also I did not comment or remove the UniqueKey and removed everything about QueryElevation.
But I just keep getting this error, which I totally have no idea where it comes out.
org.apache.solr.common.SolrException: Invalid UUID String: '1'
at org.apache.solr.schema.UUIDField.toInternal(UUIDField.java:89)
at org.apache.solr.schema.FieldType.readableToIndexed(FieldType.java:393)
at org.apache.solr.schema.FieldType.readableToIndexed(FieldType.java:398)
at org.apache.solr.update.AddUpdateCommand.getIndexedId(AddUpdateCommand.java:98)
at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:717)
at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:557)
at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
at org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:71)
at org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:235)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:512)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:416)
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:331)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:239)
at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:411)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:464)
BTW, I am using Solr 4.8. Thanks very much for the reply and I really appreciate your help !!!

My guess is that you are getting field with that name coming from DIH and the UUID URP does not override one if it is present.
Try adding IgnoreFieldUpdateProcessorFactory in front and see if the problem goes away. If it does, you can start figuring out where DIH is picking it up from. For example, if you are getting data from the database and use select *, DIH will automatically try to map any fields with the identical names to what you have in schema.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Keep one entry of duplicate articles with SOLR deduplication - solr

Could be you are suffering from a known issue

Related

Solr server Context Filtering in Auto suggester not working

Solr ClassificationUpdateProcessorFactory Bayes: problem with labels

Solr deduplication error while indexing nutch data

Apache nutch not indexing all documents to apache solr

Solr keeps giving errors in automatic uuid generation

Categories

Resources