nutch Unable to successfully parse content - solr

I try to crawl using nutch 1.4 , but I'm facing error in parsing, this is the log file:
2012-01-09 09:12:02,696 INFO parse.ParseSegment - ParseSegment: starting at 2012-01-09 09:12:02
2012-01-09 09:12:02,697 INFO parse.ParseSegment - ParseSegment: segment: crawl/segments/20120109091153
2012-01-09 09:12:03,416 WARN parse.ParseUtil - Unable to successfully parse content http://sujitpal.blogspot.com/ of type application/xhtml+xml
2012-01-09 09:12:03,417 INFO parse.ParseSegment - Parsing: http:// sujitpal.blogspot.com/
2012-01-09 09:12:03,418 WARN parse.ParseSegment - Error parsing: http://sujitpal.blogspot.com/: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content
2012-01-09 09:12:03,419 INFO crawl.SignatureFactory - Using Signature impl: org.apache.nutch.crawl.MD5Signature
by checking config/nutch-site.xml I found html|text|xhtml|xml are included in the plugin.includes preperty
<property>
<name>plugin.includes</name>
<value>myplugins|protocol-httpclient|query-(basic|site|url)|summary-
basic|urlfilter-
regex|parse-(xml|xhtml|html|tika|text|js)|index-(basic|anchor)|scoring-
opic|urlnormalizer-(pass|regex|basic)|query-(basic|site|url)|response-(json|xml)
</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
</description>
</property>
Why can't it parse xhtml/xml or even text/xml?

Which plugins have you configured? If you are using tika, then tika has a mapping from mime-type like xhtml/xml to a parser. If there is no entry in the configfile, nothing happens.
You could disable tika and only use the parse-html plugin.
I tested your site with our default plugin config.
protocol-http|urlfilter-regex|parse-(html)|index-(basic|anchor)
|query- (basic|site|url)|response-(json|xml)
|summary-basic|scoring-opic|urlnormalizer-
(pass|regex|basic)
And got your page parsed.
Parsed (32ms):http://sujitpal.blogspot.com/
Grettings
JPee

Related

How to prevent crawling external links with apache nutch?

I want to crawl only specific domains on nutch. For this I set the db.ignore.external.links to true as it was said in this FAQ link
The problem is nutch start to crawl only links in the seed list. For example if I put "nutch.apache.org" to seed.txt, It only find the same url (nutch.apache.org).
I get the result by running crawl script with 200 depth. And it's finished with one cycle and generate the out put below.
How can I solve this problem ?
I'm using apache nutch 1.11
Generator: starting at 2016-04-05 22:36:16
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: 0 records selected for fetching, exiting ...
Generate returned 1 (no new segments created)
Escaping loop: no more URLs to fetch now
Best Regards
You want to fetch only pages from a specific domain.
You already tried db.ignore.external.links but this restrict anything but the seek.txt urls.
You should try conf/regex-urlfilter.txt like in the example of the nutch1 tutorial:
+^http://([a-z0-9]*\.)*your.specific.domain.org/
Are you using "Crawl" script? If yes make sure you giving level which is greater than 1. If you run something like this "bin/crawl seedfoldername crawlDb http://solrIP:solrPort/solr 1". It will crawl only urls which are listed in the seed.txt
And to crawl specific domain you can use regex-urlfiltee.txt file.
Add following property in nutch-site.xml
<property>
<name>db.ignore.external.links</name>
<value>true</value>
<description>If true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. </description>
</property>

Apache CXF 2.7.11 on WebSphere 8.5

I have an application that exposes web services for clients via CXF. This side of things works perfectly.
The application also needs to act as a client itself and contact other servers, this is where I am running into problems.
With "Parent First" classloading I get this:
Caused by: javax.xml.ws.WebServiceException: Error: Maintain Session is enabled but none of the session properties (Cookies, Over-written URL) are returned.
at org.apache.axis2.jaxws.ExceptionFactory.createWebServiceException(ExceptionFactory.java:173) ~[org.apache.axis2.jar:na]
at org.apache.axis2.jaxws.ExceptionFactory.makeWebServiceException(ExceptionFactory.java:70) ~[org.apache.axis2.jar:na]
at org.apache.axis2.jaxws.ExceptionFactory.makeWebServiceException(ExceptionFactory.java:118) ~[org.apache.axis2.jar:na]
at org.apache.axis2.jaxws.BindingProvider.setupSessionContext(BindingProvider.java:355) ~[org.apache.axis2.jar:na]
at org.apache.axis2.jaxws.BindingProvider.checkMaintainSessionState(BindingProvider.java:322) ~[org.apache.axis2.jar:na]
at org.apache.axis2.jaxws.client.proxy.JAXWSProxyHandler.invokeSEIMethod(JAXWSProxyHandler.java:393) ~[org.apache.axis2.jar:na]
at ...
With "Parent last" classloading the application can't even expose its own services:
[23/06/15 15:33:12:985 BST] 000002d3 servlet E com.ibm.ws.webcontainer.servlet.ServletWrapper service Uncaught service() exception thrown by servlet cxf: java.lang.VerifyError: JVMVRFY013 class loading constraint violated; class=org/apache/cxf/jaxb/attachment/JAXBAttachmentUnmarshaller, method=getAttachmentAsDataHandler(Ljava/lang/String;)Ljavax/activation/DataHandler;, pc=0
at java.lang.J9VMInternals.verifyImpl(Native Method)
at java.lang.J9VMInternals.verify(J9VMInternals.java:85)
at java.lang.J9VMInternals.initialize(J9VMInternals.java:162)
I have tried disabling WebShere's own JAXWS Engine via the WAR's manifest.mf and no matter what I try with "Parent last" classloading I always get some error like the above. A different class depending on what JAR I have moved or replaced, but always a verify error.
I have also gone through the official Apache documentation, various IBM guides, countless blog and forum posts to no avail. I am at my wit's end with this
The same WAR runs perfectly on Tomcat, JBoss and WebLogic.
This is a complete list of all thirdparty JAR files:
activation-1.1.jar
antisamy-1.4.3.jar
aopalliance-1.0.jar
asm-3.3.1.jar
batik-css-1.7.jar
batik-ext-1.7.jar
batik-util-1.7.jar
bcprov-jdk15-1.46.jar
bsh-core-2.0b4.jar
commons-beanutils-1.7.0.jar
commons-beanutils-core-1.7.0.jar
commons-codec-1.3.jar
commons-collections-3.2.jar
commons-configuration-1.5.jar
commons-dbutils-1.6.jar
commons-digester-1.8.jar
commons-fileupload-1.3.1.jar
commons-httpclient-3.1.jar
commons-io-2.4.jar
commons-jexl-2.1.1.jar
commons-lang-2.4.jar
commons-logging-1.1.1.jar
cxf-api-2.7.11.jar
cxf-rt-bindings-soap-2.7.11.jar
cxf-rt-bindings-xml-2.7.11.jar
cxf-rt-core-2.7.11.jar
cxf-rt-databinding-jaxb-2.7.11.jar
cxf-rt-frontend-jaxws-2.7.11.jar
cxf-rt-frontend-simple-2.7.11.jar
cxf-rt-transports-http-2.7.11.jar
cxf-rt-ws-addr-2.7.11.jar
cxf-rt-ws-policy-2.7.11.jar
dom4j-1.6.1.jar
esapi-2.0.1.jar
FastInfoset-1.0.2.jar
geronimo-javamail_1.4_spec-1.7.1.ja
hamcrest-all-1.3.jar
hsqldb-1.8.0.10.jar
httpclient-4.3.6.jar
httpcore-4.3.3.jar
jaxen-1.1-beta-8.jar
jaxrpc-api-1.1.jar
jaxrpc-impl-1.1.3_01.jar
jaxrpc-spi-1.1.3_01.jar
joda-time-2.2.jar
js-1.7R2.jar
log4j-1.2.16.jar
logback-classic-0.9.21.jar
logback-core-0.9.21.jar
mail-1.4.7.jar
mailapi-1.4.3.jar
nekohtml-1.9.12.jar
not-yet-commons-ssl-0.3.9.jar
opensaml-2.6.1.jar
openws-1.5.1.jar
quartz-1.8.6.jar
saaj-api-1.3.5.jar
saaj-impl-1.3.jar
serializer-2.7.1.jar
slf4j-api-1.6.0.jar
slf4j-log4j12-1.6.0.jar
spring-aop-3.2.6.RELEASE.jar
spring-beans-3.2.6.RELEASE.jar
spring-context-3.2.6.RELEASE.jar
spring-core-3.2.6.RELEASE.jar
spring-expression-3.2.6.RELEASE.jar
spring-web-3.2.6.RELEASE.jar
stax2-api-3.1.4.jar
velocity-1.7.jar
vuelinkcore-20.2.3.jar
vueservlet-20.2.3.jar
woodstox-core-asl-4.2.1.jar
wsdl4j-1.6.3.jar
xml-apis-ext-1.3.04.jar
xml-resolver-1.2.jar
xmlsec-1.5.6.jar
xmltooling-1.4.1.jar
xom-1.1.jar
Does anyone know how to get Apache CXF 2.7.11 on WebSphere 8.5 to be able to act as a server and as a client?
We had the same problem using Was 8.5 (jdk 1.7_64), CXF, JAXB & xmlbeans:
JAXB is the default xml/java binding used by CXF. Was 8.5 uses endorsed JAXB api definition version 2.2.2 (in <WebSphere-dir>\AppServer\endorsed_apis\jaxb-api.jar) and standard implementation (in JRE rt.jar).
Xmlbeans 2.4.x holds inside org.w3c.* classes already present in Was (<WebSphere-dir>\AppServer\java_1.7_64\jre\lib\xml.jar).
In the end we solved so:
first following the instructions here:
http://www.ibm.com/developerworks/websphere/library/techarticles/1001_thaker/1001_thaker.html
then deleting from our deploy the following jar:
activation-*,
stax-api-* (but not stax2-api!),
jaxb-api-*,
jaxb-impl-*,
xercesImpl-*,
xml-apis-*
last deleting all org.w3c classes inside xmlbeas-2.x.jar
This is a complete list of all thirdparty JAR files we are successfully using:
cxf-*-2.7.11.jar
dom4j-1.6.1.jar
ehcache-2.8.2.jar
ehcache-core-2.5.1.jar
jettison-1.1.jar
neethi-3.0.3.jar
ognl-3.0.6.jar
opensaml-2.6.1.jar
openws-1.5.1.jar
spring-*-3.2.13.RELEASE.jar
stax2-api-3.1.1.jar
woodstox-core-asl-4.2.1.jar
wsdl4j-1.6.3.jar
wss4j-1.6.10.jar
xml-resolver-1.2.jar
xmlbeans-2.3.0-now3c.jar
xmlpull-1.1.3.1.jar
xmlschema-core-2.1.0.jar
xmlsec-1.5.4.jar
xmltooling-1.4.1.jar
xpp3_min-1.1.4c.jar
xstream-1.4.7.jar
We hope this is helpful.
PARENT_LAST:
Maybe you have a third party library in your deployment with the javax.activation.DataHandler class. Try to remove the activation-1.1.jar from your deployment.
This post can be usefull for you: LinkageError whilst trying to invoke CXF/SOAP webservice

Solr: where to find the Luke request handler

I'm trying to get a list of all the fields, both static and dynamic, in my Solr index. Another SO answer suggested using the Luke Request Handler for this.
It suggests finding the handler at this url:
http://solr:8983/solr/admin/luke?numTerms=0
When I try this url on my server, however, I get a 404 error.
The admin page for my core is here http://solr:8983/solr/#/mycore, so I also tried http://solr:8983/solr/#/mycore/admin/luke. This also gave me another 404.
Does anyone know what I'm doing wrong? Which url should I be using?
First of all you have to enable the Luke Request Handler. Note that if you started from the example solrconfig.xml you probably don't need to enable it explicitly because
<requestHandler name="/admin/" class="solr.admin.AdminHandlers" />
does it for you.
Then if you need to access the data programmatically you have to make an HTTP GET request to http://solr:8983/solr/mycore/admin/luke (no hash mark!). The response is in XML but specifying wt parameter you can obtain different formats (e.g. http://solr:8983/solr/mycore/admin/luke?wt=json)
If you only want to see fields in SOLR web interface select your core from the drop down menu and then click on "Schema Browser"
In Solr 6, the solr.admin.AdminHandlers has been removed. If your solrconfig.xml has the line <requestHandler name="/admin/" class="solr.admin.AdminHandlers" />, it will fail to load. You will see errors in the log telling you it failed to load the class org.apache.solr.handler.admin.AdminHandlers.
You must include in your solrconfig.xml the line,
<requestHandler name="/admin/luke" class="org.apache.solr.handler.admin.LukeRequestHandler" />
but the URL is core-specific, i.e. http://your_server.com:8983/solr/your_core_name/admin/luke
And you can specify the parameters fl,numTerms,id,docId as follows:
/admin/luke
/admin/luke?fl=cat
/admin/luke?fl=id&numTerms=50
/admin/luke?id=SOLR1000
/admin/luke?docId=2
You can use this Luke tool which allows you to explore Lucene index.
You can also use the solr admin page :
http://localhost:8983/solr/#/core/schema-browser

Map static field between nutch and solr

I use nutch 1.4 and I would like to map static field to Solr.
I know there is the index-static plugin. I configured it in nutch-site.xml like this :
<property>
<name>index-static</name>
<value>field:value</value>
</property>
However, the value is not sent to Solr.
Does anyone have a solution ?
It looks like the entry in nutch-default.xml is wrong.
According to the plugin source "index.static" instead of "index-static" is the right name for the property.
String fieldsString = conf.get("index.static", null);
After using that in my nutch-site.xml I was able to send multiple fields to my solr server.
Also make sure that the plugin is added to list of included plugins in the "plugin.includes" property.

how to configure XsltUpdateRequestHandler

I'm looking for a simple example of configuring XsltUpdateRequestHandler.
The SOLR config in 3.4.0 is minimal:
<!-- XSLT Update Request Handler
Transforms incoming XML with stylesheet identified by tr=
-->
<requestHandler name="/update/xslt"
startup="lazy"
class="solr.XsltUpdateRequestHandler" />
And the SOLR wiki page is virtually non-existent ("TODO: Write a better documentation"):
http://wiki.apache.org/solr/XsltUpdateRequestHandler
I guess I really want to know how to point to a specific XSL file, because I find the line "Transforms incoming XML with stylesheet identified by tr=" a bit cryptic.
You could simply add an XSL style sheet to the solr/conf/xslt directory.
You can then use the XsltUpdateRequestHandler, and specify that stylesheet when you index the document.
For example:
curl "http://localhost:8983/solr/update/xslt?commit=true&tr=rss2solr.xsl" -H "Content-Type: text/xml" --data-binary #blogrss.xml
For details, a nice article explains in detail. (Free to download)

Resources