Solr deduplication error while indexing nutch data - solr

I had integrated nutch 2.3.1 with solr 6.5, with this I could push data to solr and get indexed. Now I want to remove duplicate elements and for this I made the modifications in schema.xml and solrconfig.xml
<field name="signatureField" type="string" stored="true" indexed="true" multiValued="false" />
<updateRequestProcessorChain name="dedupe">
<processor class="solr.processor.SignatureUpdateProcessorFactory">
<bool name="enabled">true</bool>
<str name="signatureField">id</str>
<bool name="overwriteDupes">false</bool>
<str name="fields">id,content,date,url</str> <!-- changing to id <str name="fields">name,features,cat</str>-->
<str name="signatureClass">solr.processor.Lookup3Signature</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
<requestHandler name="/update" class="solr.UpdateRequestHandler" >
<lst name="defaults">
<str name="update.chain">dedupe</str>
</lst>
</requestHandler>
but after indexing bin/nutch solrindex http://localhost:8983/solr/testcore -all
error !!
please help me to sort out this issue
thanking you in advance :)

This issue might be related to the schema updated, if you have some data existing in Solr and you updated the schema while that data exist in the core, Nutch will take it as a mismatch Schema, best way to fix this issue is re-crawling the webpage with the schema updated and keep in mind that any update to the schema will/could probably cause issues with you existing index.
Since post is already old, for future reference for people that could have the same issue.
Best :)

Related

SolrCloud Deduplication Overwrite isn't working

I've been struggling to get Deduplication to work in SolrCloud (version 8.6). My solrconfig.xml contains:
<updateRequestProcessorChain name="dedupeOn">
<processor class="solr.processor.SignatureUpdateProcessorFactory">
<bool name="enabled">true</bool>
<str name="signatureField">dedupeId</str>
<bool name="overwriteDupes">true</bool>
<str name="fields">journal_doi,internal_pmid</str>
<str name="signatureClass">solr.processor.Lookup3Signature</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.DistributedUpdateProcessorFactory"/>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
and
<requestHandler name="/update" class="solr.UpdateRequestHandler" >
<lst name="defaults">
<str name="update.chain">dedupeOn</str>
</lst>
</requestHandler>
my managedschema contains:
<field name="dedupeId" type="string" indexed="true" stored="true" multiValued="false" />
In my test, I add 1000 documents, and commit manually. I see the "dedupeId" is created with the hash.
I then add 10 more documents that I know are duplicates, and again commit manually. These 10 rows are added, and the original document with the matching dedupeId is not overwritten. For example:
"response":{"numFound":2,"start":0,"maxScore":2.1554677,"numFoundExact":true,"docs":[
{
"internal_pmid":"13367837",
"dedupeId":"7f0306ecd909a68e",
"journal_doi":"10.1097/00005053-195603000-00006"},
{
"internal_pmid":"13367837",
"dedupeId":"7f0306ecd909a68e",
"journal_doi":"10.1097/00005053-195603000-00006"}]
}}
I'm not sure if its significant, but in the solr logs, I see some "add" entries that contain, in part:
webapp=/solr path=/update params={update.distrib=TOLEADER&update.chain=dedupeOn&distrib.from=*(shard path)*/&wt=javabin&version=2}{add=[00001hLxMb (1690871781072568320)]} 0 2
but other add entries do not contain the update.chain property e.g.
webapp=/solr path=/update params={wt=javabin&version=2}{add=[00000sta0n (1690871780667817984)]} 0 2
Any help would be greatly appreciated.

Solr 7.x "/export" response handler not working with Streaming Expressions

I am doing a Solr streaming expression, and I try to use the /export handler to fetch all of the results. Consider the following query:
search(main, q=*:*, fl="SSRN",qt="/export",sort="SSRN asc")
I configured my schema.xml for the SSRN field as follows:
<field name="SSRN" type="int" indexed="true" stored="true" required="false" multiValued="false" docValues="true" />
Since the SSRN field is a docValue, it should work. The results are just the standard 10 documents. This is running in a SolrCloud environment with just one node and one shard.
Thanks in advance!
I fixed the issue. It seems that in SOLR-8426: Enable /export, /stream and /sql handlers by default and remove them from example configs, they removed the need to add /export handler to the solrconfig.xml. If you do add it, then it doesn't work. The solution is just to remove this code (from solrconfig.xml):
<requestHandler name="/export" class="solr.SearchHandler">
<lst name="invariants">
<str name="rq">{!xport}</str>
<str name="wt">xsort</str>
<str name="distrib">false</str>
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="wt">json</str>
<str name="indent">true</str>
</lst>
</requestHandler>

Solr keeps giving errors in automatic uuid generation

I got a problem with automatic uuid generation in Solr. I want Solr to generate automatically uuids for the data imported by DataImportHandler.
Here's what i did:
In schema.xml
<fieldType name="uuid" class="solr.UUIDField" indexed="true" />
<field name="id" type="uuid" indexed="true" stored="true" required="true" multiValued="false" />
In solrconfig.xml
I added:
<updateRequestProcessorChain name="uuid">
<processor class="solr.UUIDUpdateProcessorFactory">
<str name="fieldName">id</str>
</processor>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
I modified:
<requestHandler name="/update" class="solr.UpdateRequestHandler">
<!-- See below for information on defining
updateRequestProcessorChains that can be used by name
on each Update Request
-->
<!--
<lst name="defaults">
<str name="update.chain">dedupe</str>
</lst>
-->
<lst name="defaults">
<str name="update.chain">uuid</str>
</lst>
Also I did not comment or remove the UniqueKey and removed everything about QueryElevation.
But I just keep getting this error, which I totally have no idea where it comes out.
org.apache.solr.common.SolrException: Invalid UUID String: '1'
at org.apache.solr.schema.UUIDField.toInternal(UUIDField.java:89)
at org.apache.solr.schema.FieldType.readableToIndexed(FieldType.java:393)
at org.apache.solr.schema.FieldType.readableToIndexed(FieldType.java:398)
at org.apache.solr.update.AddUpdateCommand.getIndexedId(AddUpdateCommand.java:98)
at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:717)
at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:557)
at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
at org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:71)
at org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:235)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:512)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:416)
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:331)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:239)
at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:411)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:464)
BTW, I am using Solr 4.8. Thanks very much for the reply and I really appreciate your help !!!
My guess is that you are getting field with that name coming from DIH and the UUID URP does not override one if it is present.
Try adding IgnoreFieldUpdateProcessorFactory in front and see if the problem goes away. If it does, you can start figuring out where DIH is picking it up from. For example, if you are getting data from the database and use select *, DIH will automatically try to map any fields with the identical names to what you have in schema.

Basic UIMA with SOLR

I am trying to connect UIMA with Solr. I have downloaded the Solr 3.5 dist and have it successfully running with nutch and tika on windows 7 using solrcell and curl via cygwin.
To begin, I copied the 6 jars from solr/contrib/uima/lib to the working /lib in solr.
Next, I read the readme.txt file in solr/contrib/uima/lib and edited both my solrconfig.xml and schema.xml to no avail.
I then found this link which seemed a bit more applicable since I didnt care to use Alchemy or OpenCalais: http://code.google.com/a/apache-extras.org/p/rondhuit-uima/?redir=1
Still- when I run a curl command that imports a pdf via solrcell I do not get the additional UIMA fields nor do I get anything on my logs. The test.pdf is parsed though and I see the pdf in Solr using:
curl 'http://localhost:8080/solr/update/extract?fmap.content=content&literal.id=doc1&commit=true' -F "file=#test.pdf"
SolrConfig.XML
<updateRequestProcessorChain name="uima">
<processor class="org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory">
<lst name="uimaConfig">
<lst name="runtimeParameters">
<str name="host">http://localhost</str>
<str name="port">8080</str>
</lst>
<str name="analysisEngine">C:\uima\desc\com\rondhuit\uima\desc\NextAnnotatorDescriptor.xml</str>
<bool name="ignoreErrors">true</bool>
<str name="logField">id</str>
<lst name="analyzeFields">
<bool name="merge">false</bool>
<arr name="fields">
<str>content</str>
</arr>
</lst>
<lst name="fieldMappings">
<lst name="type">
<str name="name">com.rondhuit.uima.next.NamedEntity</str>
<lst name="mapping">
<str name="feature">entity</str>
<str name="fieldNameFeature">uname</str>
<str name="dynamicField">*_sm</str>
</lst>
</lst>
</lst>
</lst>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
<requestHandler name="/update/uima" class="solr.XmlUpdateRequestHandler">
<lst name="defaults">
<str name="update.chain">uima</str>
</lst>
</requestHandler>
AND I ALSO ADJUSTED MY requestHander:
<requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
<lst name="defaults">
<str name="update.processor">uima</str>
</lst>
</requestHandler>
Schema.XML
<!-- fields for UIMA -->
<field name="uname" type="string" indexed="true" stored="true" multiValued="true" required="false"/>
<dynamicField name="*_sm" type="string" indexed="true" stored="true"/>
All I am trying to do is have UIMA pull out names from text (just to start as a demo) and cannot figure out what I am doing wrong.
Thank you in advance for reading this.
Not sure if this ever got addressed, but in case someone else is looking, I had this same problem yesterday. Figured out that I was calling /update/extract to use solrcell, which doesn't use uima because it's integrated into /update.

Mapping fields in SOLR for faceting

I'm indexing rich text documents into SOLR 3.4 using ExtractingRequestHandler and I'm having trouble getting it to behave like I want it to.
I would like to store creation date as a field to use for faceted search later and have defined the following in schema.xml:
<field name="creation_date" type="date" indexed="true" stored="true"/>
I index like this:
curl -s "http://localhost:8983/solr/update/extract?literal.id=myid&resource.name=myfile.xls&commit=true" -F myfile=#/path/to/myfile.xls
I get the dynamic field attr_creation_date (that other rules make sure), but I don't get it as creation_date. I have also unsuccessfully tried to use copyField like so:
<copyField source="attr_creation_date" dest="creation_date"/>
Yet another try was putting this in solrconfig.xml, but no luck:
<str name="fmap.Creation-Date">creation_date</str>
I'm pretty sure I'm missing something basic here. Any help is most appreciated!
Settings for ExtractingRequestHandler in solrconfig.xml:
<requestHandler name="/update/extract" startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="fmap.content">text</str>
<str name="fmap.Last-Save-Date">last_save_date</str>
<str name="fmap.Creation-Date">creation_date</str>
<str name="fmap.Content-Type">content_type</str>
<str name="lowernames">true</str>
<str name="uprefix">attr_</str>
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
</lst>
</requestHandler>
My schema.xml file (lots of default stuff): https://gist.github.com/1358002

Resources