Nutch 2.3.1 in crawl Deep Web - solr

i follow the tutorial from
Nutch Wiki "SetupNutchAndTor"(https://wiki.apache.org/nutch/SetupNutchAndTor)
Set up nutch-site.xml
<property>
<name>http.proxy.host</name>
<value>127.0.0.1</value>
<description>The proxy hostname. If empty, no proxy is used.
</description>
</property>
<property>
<name>http.proxy.port</name>
<value>8118</value>
<description>The proxy port.</description>
</property>
but still crawl nothing from the .onion link and not indexed into Solr. Anyone know what is the problem?

Anything in the logs?
FYI with StormCrawler you can use a SOCKS proxy directly thanks to this commit
You'd need to use OKHTTP for the protocol implementation and configure it like this
http.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol"
https.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol"
http.proxy.host: localhost
http.proxy.port: 9050
http.proxy.type: "SOCKS"

Related

Log soap request and response using apache-cxf

I am working on Maven project in Intellij.I have generated wsdl to java using cxf-codegen-plugin. I have created a client and created a tester.java to test the client. I have to log the soap request and response. I have one cxf.xml, config.properties and a client.java files. I am not sure where to configure to log the soap messages. Also i have less idea about webservices. I have also copied log4j.xml to my METAINF.
I have tried all possible scenarios in stack overflow. Not sure which is going wrong.
Assuming you have the latest version of CXF (or fairly recent), the easiest way is to enable the logging feature on the CXF bus in the cxf.xml:
...
<cxf:bus>
<cxf:features>
<cxf:logging/>
</cxf:features>
</cxf:bus>
...
or only on your jaxws endpoint:
<jaxws:endpoint...>
<jaxws:features>
<bean class="org.apache.cxf.feature.LoggingFeature"/>
</jaxws:features>
</jaxws:endpoint>
Make sure you have cxf-rt-features-logging-XXX.jar on your classpath (XXX = your version of CXF).
And configure logging as described here:
http://cxf.apache.org/docs/general-cxf-logging.html
You need to be in INFO level at least.

Getting clear content (without markup) with Nutch 1.9

Using Nutch 1.9, how do I get clear content (without html markup) of crawled pages and save the .content in readable form. Is Solr way to do that or can it be done without it and how?
And a subquestion, how do I control the crawling depth with bin/crawl script? There was an option to that (and topN) in bin/nutch crawl command, but it is deprecated now and won't execute.
Add this in nutch site.xml
<!-- tika properties to use BoilerPipe, according to Marcus Jelsma -->
<property>
<name>tika.use_boilerpipe</name>
<value>true</value>
</property>
<property>
<name>tika.boilerpipe.extractor</name>
<value>ArticleExtractor</value>
</property>
// This is for nutch 1.7, I'm not sure about 1.9
Use jsoup to get plain text.

Want to deploy Solr 4.0.0 on JBoss 6.0.1

I am new for Solr installation on JBoss. Can anyone help me to deploy Solr 4.0.0 on JBoss 6.0.1?
Thanks in advance!!
You can find the instructions to install Solr in JBoss in the "official" solr documentation (http://wiki.apache.org/solr/SolrJBoss). In general installing Solr in any Container/Application Server consists of deploying the "war" file in the deployment directory (in jboss "deployment" directory of the installation), then define the environment variable "solr.solr.home" and "solr.data.dir", which should point for the directory where you have your "solr" (where you extracted/placed your Solr distribution) - you can define these in the "JBOSS_HOME/standalone/configuration/standalone.xml" file, like this:
<system-properties>
<property name="solr.solr.home" value="/usr/local/jboss/solr/solr-4.2.1"/>
<property name="solr.data.dir" value="/usr/local/jboss/solr/data"/>
<property name="org.apache.catalina.connector.URI_ENCODING" value="UTF-8"/>
<property name="org.apache.catalina.connector.USE_BODY_ENCODING_FOR_QUERY_STRING" value="true"/>
</system-properties>

CXF Web service client not encrypting the SOAP Request XML message

I am learning Webservice security . I am using CXF framework for that. I have developed one test service it will just double up the value whatever we sent. Based on this tutorial
i have added the WS-Policy for XML encryption and signature.
Then i developed the web service client for this service as a eclipse project using CXF.
The following is my client configuration file
<jaxws:client id="doubleItClient" serviceClass="com.DoubleIt" address="http://localhost:8080/myencws/services/DoubleItPort?wsdl">
<jaxws:features>
<bean class="org.apache.cxf.feature.LoggingFeature" />
</jaxws:features>
<jaxws:properties>
<entry key="ws-security.callback-handler" value="com.ClientKeystorePasswordCallback"/>
<entry key="ws-security.encryption.properties" value="com/clientKeystore.properties"/>
<entry key="ws-security.signature.properties" value="com/clientKeystore.properties"/>
<entry key="ws-security.encryption.username" value="myservicekey"/>
</jaxws:properties>
I have generated all the keystore file , and i created the clientKeystore.properties file and placed in the src directory of my project.
But whenever i run this client the SOAP request message was not encrypted. So inn server side i am getting exception like
These policy alternatives can not be satisfied:
{http://docs.oasis-open.org/ws-sx/ws-securitypolicy/200702}EncryptedParts
{http://docs.oasis-open.org/ws-sx/ws-securitypolicy/200702}SignedParts
The following is my SOAP request
<soap:Envelope
xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"><soap:Body><ns2:doubleValue
xmlns:ns2="http://com/"><arg0>5</arg0></ns2:doubleValue></soap:Body></soap:Envelope>
I am using CXF2.7.3. I dont know whats wrong . Please help me.
I have a similar issue with my code before, what was missing was the jar dependencies which does the actual encryption when the security policy are read by your client from the WSDL.
My fix was to add certain maven dependencies in your POM to enable encryption. Check this url: http://cxf.apache.org/docs/using-cxf-with-maven.html
Also read "Enabling WS-SecurityPolicy" section in url http://cxf.apache.org/docs/ws-securitypolicy.html
I hope this helps
Make sure you are using the correct library. Try to include cxf bundle only, remove other cxf dependencies
If you are using maven, something like this:
<dependency>
<groupId>org.apache.cxf</groupId>
<artifactId>cxf-bundle</artifactId>
<version>2.7.18</version>
</dependency>
I ran into the same issue and after much experimentation, the following guidelines help every single time.
Structure your cxf client config xml to have import of META-INF cxf.xml.
Define the cxf bus features (for logging)
Define the http conduits (if needed for TLS Handshake etc)
jaxws:client bean with name attribute as {targetNameSpaceWSDL)/PortName and createdFromAPI=true and abstract=true
Make client tag contain jaxws features. Remember to use latest "security" and not "ws-security"
In your java client class, use the SpringBus to load the cxf client config xml.SVN Link for SpringBus Client Config
Make sure all the required dependencies for WS policy processing is present in classpath like cxf-rt-ws-policy and cxf-rt-ws-security.jar and bouncycastle providers if needed
Note:
security.signature.properties and security.encryption.properties can be externalized as well and directly referred to with the absolute path in the xml value.

How can I crawl pdf files that are served on internet using Nutch-1.0 using http protocol

I want to know How can I crawl pdf files that are served on internet using Nutch-1.0 using http protocol
I am able to do it on local file systems using file:// protocol but not http protocol
add this property in the nutch-site.xml file then you will crawl the pdf files
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(html|text|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>protocol-httpclient|urlfilter-regex|parse-(html|text|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</description>
</property>

Resources