Getting clear content (without markup) with Nutch 1.9 - solr

Using Nutch 1.9, how do I get clear content (without html markup) of crawled pages and save the .content in readable form. Is Solr way to do that or can it be done without it and how?
And a subquestion, how do I control the crawling depth with bin/crawl script? There was an option to that (and topN) in bin/nutch crawl command, but it is deprecated now and won't execute.

Add this in nutch site.xml
<!-- tika properties to use BoilerPipe, according to Marcus Jelsma -->
<property>
<name>tika.use_boilerpipe</name>
<value>true</value>
</property>
<property>
<name>tika.boilerpipe.extractor</name>
<value>ArticleExtractor</value>
</property>
// This is for nutch 1.7, I'm not sure about 1.9
Use jsoup to get plain text.

Related

datastore-indexes.xml doesn't work, creates no index after deploy

I have created a Kind called User in Google App Engine datastore, and I am trying to add an index for this kind.
Firstly, I followed https://cloud.google.com/appengine/docs/standard/java/config/indexconfig to create index by adding datastore-indexes.xml inside war/WEB-INF, but it doesn't work, no index is created after I deploy to app engine.
code in my datastore-indexes.xml:
<?xml version="1.0" encoding="utf-8"?>
<datastore-indexes autoGenerate="false">
<datastore-index kind="User" ancestor="false" source="manual">
<property name="area" direction="asc"/>
<property name="coins_balance" direction="asc"/>
</datastore-index>
</datastore-indexes>
Then I followed https://cloud.google.com/appengine/docs/standard/python/config/indexref, I created an index.yaml and run gcloud app deploy index.yaml, this time index is actually created.
So can anyone help me understand why datastore-indexes.xml in my case doesn't work, thanks.
As documented in the java index config page and noted in the comments, datastore-indexes.xml is only supported through appcfg.sh at this time. To use gcloud, you'll need to configure your indexes as a yaml file.

Nutch 2.3.1 in crawl Deep Web

i follow the tutorial from
Nutch Wiki "SetupNutchAndTor"(https://wiki.apache.org/nutch/SetupNutchAndTor)
Set up nutch-site.xml
<property>
<name>http.proxy.host</name>
<value>127.0.0.1</value>
<description>The proxy hostname. If empty, no proxy is used.
</description>
</property>
<property>
<name>http.proxy.port</name>
<value>8118</value>
<description>The proxy port.</description>
</property>
but still crawl nothing from the .onion link and not indexed into Solr. Anyone know what is the problem?
Anything in the logs?
FYI with StormCrawler you can use a SOCKS proxy directly thanks to this commit
You'd need to use OKHTTP for the protocol implementation and configure it like this
http.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol"
https.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol"
http.proxy.host: localhost
http.proxy.port: 9050
http.proxy.type: "SOCKS"

Integrating grobid with tika and solr

I'm using Solr to index journal articles. Using the out-of-the-box configuration, it indexed the text of the documents, but I'm looking to use Grobid to pull out the authors, title, affiliations, etc. I got grobid up and running as a service.
I added
<str name="tika.config">/path/to/tika-config.xml</str>
to the requestHandler for /update/extract in solrconfig.xml
The tika-config looks like:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<properties>
<parsers>
<parser class="org.apache.tika.parser.journal.JournalParser">
<mime>application/pdf</mime>
</parser>
</parsers>
</properties>
I'm getting a ClassNotFound exception when I try to import a document, but can't figure out where to set the classpath to fix it.
As mentioned on the Solr user's list, the latest version of Solr (6.0.0) is using a version of Tika (1.7) that predates the addition of grobid (which came in in Tika 1.11) permalink. To follow the upgrade to Tika 1.13, see SOLR-8981

Nutch message "No IndexWriters activated" while loading to solr

I have run nutch crawler as per nutch tutorial http://wiki.apache.org/nutch/NutchTutorial but when i started loading it to solr i am getting this message i.e. "No IndexWriters activated - check your configuration"
bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -dir crawl/segments/
Indexer: starting at 2013-07-15 08:09:13
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
**No IndexWriters activated - check your configuration**
Indexer: finished at 2013-07-15 08:09:21, elapsed: 00:00:07
Make sure that the plugin indexer-solr is included. Go to the file: conf/nutch-site.xml and in the property plugin.includes add the plugin, for instance:
protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)
After adding the plugin the No IndexWriters activated - check your configuration warning disappeared in my case.
Check this thread: http://lucene.472066.n3.nabble.com/a-plugin-extending-IndexWriter-td4074353.html
#Tryskele + #Scott101 worked for me:
add plugin.includes property to both /conf/nutch-site.xml and runtime/local/conf/nutch-site.xml files:
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
</property>
Don't know if this is still an issue, but I was having this problem and then realized that my src/plugin/build.xml was missing the indexer-solr plugin. Adding the following and then recompiling nutch fixed it for me:
<ant dir="indexer-solr" target="deploy"/>
Add the below property in conf/nutch-site.xml for plugin
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
</property>
Let me know if it solves your problem.

I'm following the Nutch tutorial, and getting a "No URLs to fetch" error

Following the Apache Nutch tutorial here:
As indicated in the tutorial, I've set the last line of my regex-urlfilter.txt to:
+^http://([a-z0-9]*\.)*nutch.apache.org/
My nutch-site.xml file contains only the lines
<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>
And my seed.txt file is:
http://nutch.apache.org/
However, when I crawl with
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
I get a "No URLs to fetch" error. Anyone know why?
Configuration looks fine to me. You have made these changes in runtime/local folder right?
seed.txt will be in NUTCH_HOME/runtime/local/urls folder and
regex-urlfilter.txt and nutch-site.xml will be in NUTCH_HOME/runtime/local/conf folder
NUTCH_HOME is installation directory

Resources