Apache nutch not indexing all documents to apache solr - solr

I am using apache nutch 2.3 (latest version). I have crawled about 49000 documnts by nutch. From documents mime analysis, crawled data containes about 45000 thouseand text/html documents. But when I saw indexed documents in solr (4.10.3), only about 14000 documents are indexed. Why this huge difference between documents are (45000-14000=31000). If I assume that nutch only index text/html documents, then atleast 45000 documents should be indexed.
What is the problem. How to solve it?

In my case this problem was due to missing solr indexer infomration in nutch-site.xml. When I update config, this problem was resolved. Please check your crawler log at indexing step. In my case it was informed that no solr indexer plugin is found.
Following lines (property) are added in nutch-site.xml
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|protocol-http|indexer-solr|urlfilter-regex|parse-(html|tika)|index-(basic|more)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
<description>plugin details here </description>
</property>

You should look at your Solr logs, to see if there's anything about "duplicate" documents, or just go look in the solrconfig.xml file for the core into which you are pushing the documents. There is likely a "dedupe" call is being made on the update handler, the fields used may be causing duplicate documents (based on a few fields) to be dropped. You'll see something like this
<requestHandler name="/dataimport"
class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="update.chain">dedupe</str> <<-- change dedupe to uuid
<str name="config">dih-config.xml</str> or comment the line
</lst>
</requestHandler>
and later in the file the definition of the dedupe update.chain,
<updateRequestProcessorChain name="dedupe">
<processor class="solr.processor.SignatureUpdateProcessorFactory">
<bool name="enabled">true</bool>
<str name="signatureField">id</str>
<bool name="overwriteDupes">true</bool>
-->> <str name="fields">url,date,rawline</str> <<--
<str name="signatureClass">solr.processor.Lookup3Signature</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
The "fields" element is what will select which input data is used to determine the uniqueness of the record. Of course, if you know there's no duplication in your input data, this is not the issue. But the above configuration will throw out any records which are duplicate on the fields shown.
You may not be using the dataimport requestHandler, but rather the "update" requestHandler. I'm not sure which one Nutch uses. Either, way, you can simply comment out the update.chain, change it to a different processorChain such as "uuid", or add more fields to the "fields" declaration.

Related

Solr More Like This Says "numFound" doesn't equal number of docs in match

I have a Solr More Like This Handler, configured as follows:
Request Handler Configuration
<requestHandler name="/themlturl" class="solr.MoreLikeThisHandler">
<lst name="defaults">
<str name="wt">json</str>
<int name="rows">5</int>
<str name="mlt.fl">name, category_stack</str>
<str name="mlt.qf">name^3 category_stack^5</str>
<str name="fl">id, name</str>
<str name="mlt">true</str>
<str name="mlt.mintf">1</str>
</lst>
</requestHandler>
Simple Query
Queries that has one document match work fine
results in
Query With More Than One Document
I am trying to get documents similar to more than one document using an OR in the q field.
This results in the following response
it is clear that Solr found the three documents since the match > numFound is 3, but the returned documents in the match > docs is only one, and the results in the response are documents similar to that one document.
Does the MLT handler support multiple documents ? if not, is there a solution other than querying the handler once for each document.
What I am trying to build is a simple content-based recommendation engine which is supposed to show documents similar to the ones a user saves.

Default operator AND using SOLR on Coldfusion

I just want the default operator to be AND and not an OR for every basic search. For a particular collection, in the schema.xml and solrconfig.xml files I set the defaultOperator to AND (makes no difference) and set the mm to 100%, restart the CF Add-on Server services and still no difference when doing a search. I am on Coldfusion 2018.
<cfsearch
name='qHearings'
collection='hearings_collection'
criteria='conflicts of interest'
/>
returns me documents with words 'conflicts' OR 'interest'. If I change it to:
<cfsearch
name='qHearings'
collection='hearings_collection'
criteria='conflicts AND of AND interest'
/>
returns me documents with words 'conflicts' AND 'interest'. This is good but my users don't like be told to use AND and I hear endless comments about why can't it be like google search :(
I have been reading up on SOLR and it seems like many have the same problem but I try the suggestions but I always get an OR search result.
Anyone got basic SOLR search to default to AND?
Thank you #MatsLindh, your comments lead me to the right path! I was setting
<solrQueryParser q.op="AND"/>
in the schema.xml thinking that was where I was suppose to do it (of course, it made no difference I still got an OR search result).
I couldn't find a Solr log for Coldfusion but I played around with solrconfig.xml file for one particular collection. After re-reading your comments I added
<str name="q.op">AND</str>
to the "standard" handler and it worked! I am somewhat embarrassed because it wasn't obvious to me to do it that way and for all my googling I didn't see examples of it being done that way (I only saw it as being passed in a query parameter).
So my standard handler looks like this:
<requestHandler name="standard" class="solr.StandardRequestHandler" default="true">
<!-- default values for query parameters -->
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="hl.fl">summary title </str>
<str name="df">contents</str>
<str name="q.op">AND</str>
<str name="mm">100%</str>
<!-- omp = Only More Popular -->
<str name="spellcheck.onlyMorePopular">false</str>
<!-- exr = Extended Results -->
<str name="spellcheck.extendedResults">false</str>
<!-- The number of suggestions to return -->
<str name="spellcheck.count">1</str>
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>
Super embarrassing for me that the solution was so simple.

Using atomic update in Solr get an error

I'm getting the following error in 5.2.1:
RunUpdateProcessor has received an AddUpdateCommand containing a document that appears to still contain Atomic document update operations, most likely because DistributedUpdateProcessorFactory was explicitly disabled from this updateRequestProcessorChain
I tried working in cloud and in single. Guess that must be something with my solrconfig.xml - can someone please post example to a file that works?
In the solrconfig I have the following but also tried other.
<initParams path="/update/**">
<lst name="defaults">
<str name="update.chain">add-unknown-fields-to-the-schema</str>
</lst>
</initParams>
<updateRequestProcessorChain name="add-unknown-fields-to-the-schema">
<processor class="solr.UUIDUpdateProcessorFactory">
<str name="fieldName">id</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
When trying the example in http://yonik.com/solr/atomic-updates/ it works fine but that is using dynamic fields.
BTW - got the same error when trying solrj and also curl command (with
xml in a file)
Thanks.
It appears that I had the following missing from schema.xml. Strange that didn't read anything about it as a requirement.
<uniqueKey>id</uniqueKey>

Why does Solr 6.1 turn JSON single values into arrays?

I'm in the process of upgrading from 4.7 to 6.1. I was specifying fields in solrconfig.xml previously but wanted to move to the managed schema way so I can add JSON with new fields whenever I want to.
The problem is 6.1 managed schema is turning string values or numbers etc into arrays. This errors out sorting since Solr cannot sort on array values and its turning my single-value dates into arrays with a single value.
SolrConfig.xml 6.1 has this:
<processor class="solr.AddSchemaFieldsUpdateProcessorFactory">
<str name="defaultFieldType">strings</str>
<lst name="typeMapping">
<str name="valueClass">java.lang.Boolean</str>
<str name="fieldType">booleans</str>
</lst>
<lst name="typeMapping">
<str name="valueClass">java.util.Date</str>
<str name="fieldType">tdates</str>
</lst>
<lst name="typeMapping">
<str name="valueClass">java.lang.Long</str>
<str name="valueClass">java.lang.Integer</str>
<str name="fieldType">tlongs</str>
</lst>
<lst name="typeMapping">
<str name="valueClass">java.lang.Number</str>
<str name="fieldType">tdoubles</str>
</lst>
</processor>
I tried making the data types singular such as strings -> string but that didn't work.
Thanks!
Fields already created are the issue
(sorry to answer my own question but I found out the answer before anyone else did)
Changing the above snippet to singular data types works BUT...
If you have already created fields dynamically with a different solrconfig.xml then you reload it to have singular fields, the defaults will work as expected BUT you have already defined the existing ones.
To remedy this, unloaded the core, deleted it, recreated it, changed the solrconfig.xml to the desired settings, then added the docs in there.
It worked fine after that.
UPDATE
I recommend editing the manage-schema file found in /var/solr/data/CORE_NAME/conf and predefine the fields you want leaving the default behavior. You can also do this through the admin interface by adding fields.

Batch analysing documents with solr (extracting tf idf information)

Hi i want to extract the tf-idf values for terms in documents. After a bit of searching i found a request handler in the example configuration that can do that: http://localhost:8983/solr/tvrh/?q=id:documentid&qt=tvrh&tv=true&tv.all=true
What i want to do is to batch-analyse documents. This is what i do:
sending a new document to the solr update handler with commit=true
Querying solr for the term vectors using the above url
The problem is that inserting a docment with commit=true takes about 600ms which is not really acceptable for my usecase.
i then found http://wiki.apache.org/solr/RealTimeGet and tried to combine that with the termvector request handler:
<requestHandler name="/tvrh" class="solr.RealTimeGetHandler" startup="lazy">
<lst name="defaults">
<str name="df">text</str>
<bool name="tv">true</bool>
</lst>
<arr name="last-components">
<str>tvComponent</str>
</arr>
</requestHandler>
But then i get this as response when i try to query the handler: http://pastebin.com/KtB7DBSv I suppose combining those two is not possible?
How can i improve the performance anyway? Any suggestions? Is there another approach to get the tf idf values?
i did not found a solution to the specific problem in the question, but found that using softCommit=true is much more faster.

Resources