how to use number of rounds on Nutchx2

how to use number of rounds on Nutchx2 - solr

I have the same problem. I use just this command for whole process:
crawl urls/ucuzcumSeed.txt ucuzcum http://localhost:8983/solr/ucuzcum/ 10
crawl <seedDir> <crawlID> [<solrUrl>] <numberOfRounds>
By the way I'm using 2.3.1 version of Nutch and 5.2.1 version of Solr. The problem is that I cannot fetch whole web site for just this command. I suppose numberofRounds parameter doesnt work. At first run nutch just find 1 url for fetch and generate and parse it. After at the second step it can get more urls. In this case, this means nutch stops in the end of the first iteration. But it should continue according to my command. What should I do to crawl a whole website with nutch?
nutch-site.xml :
<property>
<name>http.agent.name</name>
<value>MerveCrawler</value>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-rege$
</property>
<property>
<name>http.content.limit</name>
<value>-1</value><!-- No limit -->
<description>The length limit for downloaded content using the http://
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the file.content.limit setting.
</description>
</property>
<property>
<name>fetcher.verbose</name>
<value>true</value>
<description>If true, fetcher will log more verbosely.</description>
</property>
<property>
<name>db.max.outlinks.per.page</name>
<value>100000000000000000000000000000000000000000000</value>
<description>The maximum number of outlinks that we'll process for a page.
If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
will be processed for a page; otherwise, all outlinks will be processed.
</description>
</property>
<property>
<name>db.ignore.external.links</name>
<value>false</value>
<description>If true, outlinks leading from a page to external hosts
will be ignored. This is an effective way to limit the crawl to include
only initially injected hosts, without creating complex URLFilters.
</description>
</property>
<property>
<name>db.ignore.internal.links</name>
<value>false</value>
<description>If true, when adding new links to a page, links from
the same host are ignored. This is an effective way to limit the
size of the link database, keeping only the highest quality
links.
</description>
</property>
<property>
<name>fetcher.server.delay</name>
<value>10</value>
<description>The number of seconds the fetcher will delay between
successive requests to the same server. Note that this might get
overriden by a Crawl-Delay from a robots.txt and is used ONLY if
fetcher.threads.per.queue is set to 1.
</description>
</property>
<property>
<name>file.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded content using the file
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the http.content.limit setting.
</description>
</property>
<property>
<name>http.timeout</name>
<value>100000000000000000000000000000000000</value>
<description>The default network timeout, in milliseconds.</description>
</property>
<property>
<name>http.timeout</name>
<value>100000000000000000000000000000000000</value>
<description>The default network timeout, in milliseconds.</description>
</property>
<property>
<name>generate.max.count</name>
<value>100000000</value>
<description>The maximum number of urls in a single
fetchlist. -1 if unlimited. The urls are counted according
to the value of the parameter generator.count.mode.
</description>
</property>

There's several reasons why the crawl might not get further e.g. robots.txt directives. Look at the logs and / or the content of the crawl table to get a better idea of what the problem might be

Related

How do I specify that a component within an OpenCPI application xml is within an HDL assembly?

Components within an OpenCPI application need to be specified such that their location is identified. My xml is below. The file_read and file_write seem to be found ok but ocpirun reports that no acceptable implementation can be found for the other components. They are within an HDL assembly/container.
I have tried many variations along the lines of local."component", local."binary file name"."component" and many others.
<Application done='file_write'>
<Instance component='ocpi.core.file_read' name='file_read' connect='fft_1024'>
<property name='filename' value='react_jammer_rx.input'/>
<property name='granularity' value='4'/>
<property name='messageSize' value='1024'/>
</Instance>
<Instance component='fft_1024_xs' name='fft_1024' connect='peak_detector'/>
<Instance component='peak_detector_xs_us' name='peak_detector' connect='file_write'/>
<Instance component='ocpi.core.file_write' name='file_write'>
<property name='filename' value='react_jammer_rx.output'/>
</Instance>
</Application>

You need to add the project package-ID to the start of your fft_1024_xs and peak_detector_xs_us components. If you have registered your project (using ocpidev register project in your project root directory) you can find this out using ocpidev show registry and it will appear in the list.

How to get file uploaded time in wso2 esb sequence

I am using WSO2 inbound endpoint to fetch a file from an FTP server. And I know how to get the file name back. Now my question is how to get the file uploaded time back (or the last modified time)?
This is the code to get the file name.
<property expression="get-property('transport', 'FILE_NAME')" name="ftp.var.filename"
xmlns:ns="http://org.apache.synapse/xsd"
xmlns:ns2="http://org.apache.synapse/xsd"/>
I think there should be a similar code to get the timestamp of the file.

With the following property, you will be able to get the last modified time of the file polled from the inbound endpoint.
`<property expression="get-property('transport', 'LAST_MODIFIED')" name="ftp.var.last.modified.time" xmlns:ns="http://org.apache.synapse/xsd"/>`
Add this to the relevant sequence to further process and following is a sample sequence in which the file name and the last modified time is logged.
<?xml version="1.0" encoding="UTF-8"?>
<sequence name="fileSequence" onError="fault" xmlns="http://ws.apache.org/ns/synapse">
<log level="custom">
<property expression="get-property('transport', 'FILE_NAME')"
name="ftp.var.filename" xmlns:ns="http://org.apache.synapse/xsd"/>
<property
expression="get-property('transport', 'LAST_MODIFIED')"
name="ftp.var.last.modified.time" xmlns:ns="http://org.apache.synapse/xsd"/>
</log>
</sequence>
Please check whether this meets your requirement and please refer [1] to further clarify this.
[1]-https://github.com/wso2/wso2-synapse/blob/master/modules/transports/core/vfs/src/main/java/org/apache/synapse/transport/vfs/VFSTransportListener.java#L767

add subscription to plesk control panel

i add a plesk subscription with the following xml code, this add subscription without any hosting type. i want the hosting type be "website". please help me
<packet>
<webspace>
<add>
<gen_setup>
<name>ggg.com</name>
<owner-login>mmm</owner-login>
<ip_address>111.111.111.111</ip_address>
<status>0</status>
</gen_setup>
<plan-name>1m</plan-name>
</add>
</webspace>
</packet>

the correct code is:
<packet>
<webspace>
<add>
<gen_setup>
<name>{domainName}</name>
<owner-login>{username}</owner-login>
<ip_address>111.111.111.111</ip_address>
</gen_setup>
<hosting>
<vrt_hst>
<property>
<name>ftp_login</name>
<value>ftp_{username}</value>
</property>
<property>
<name>ftp_password</name>
<value>{pass}</value>
</property>
<ip_address>111.111.111.111</ip_address>
</vrt_hst>
</hosting>
<plan-name>{plan}</plan-name>
</add>

gzip compression not working in Solr 5.1

I'm trying to apply gzip compression in Solr 5.1. I understand that Running Solr on Tomcat is no longer supported from Solr 5.0, so I've tried to implement it in Solr.
I've downloaded jetty-servlets-9.3.0.RC0.jar and placed it in my webapp\WEB-INF folder, and have added the following in webapp\WEB-INF\web.xml:
<filter>
<filter-name>GzipFilter</filter-name>
<filter-class>org.eclipse.jetty.servlets.GzipFilter</filter-class>
<init-param>
<param-name>methods</param-name>
<param-value>GET,POST</param-value>
<param-name>mimeTypes</param-name>
<param-value>text/html,text/plain,text/xml,text/json,text/javascript,text/css,application/xhtml+xml,application/javascript,image/svg+xml,application/json,application/xml; charset=UTF-8</param-value>
</init-param>
</filter>
<filter-mapping>
<filter-name>GzipFilter</filter-name>
<url-pattern>/*</url-pattern>
</filter-mapping>
However, when I start Solr and check the browser, there's no gzip compression, and I only get the following at the Response Headers output:
Content-Type:text/plain;charset=UTF-8
Transfer-Encoding:chunked
Is there anything which I configure wrongly or might have missed out? I'm also running zookeeper-3.4.6.

Download a kosher version of Jetty that closely matches what is currently in Solr: http://www.eclipse.org/jetty/download.html
Extract that .zip to a location you will discard.
Copy out these files to your copy of jetty in Solr (located in path-to-solr/server/):
modules/gzip.mod
etc/gzip.xml
Modify modules/gzip.mod:
#
# GZIP module
# Applies GzipHandler to entire server
#
[depend]
server
[xml]
etc/jetty-gzip.xml
[ini-template]
## Minimum content length after which gzip is enabled
jetty.gzip.minGzipSize=2048
## Check whether a file with *.gz extension exists
jetty.gzip.checkGzExists=false
## Gzip compression level (-1 for default)
jetty.gzip.compressionLevel=-1
## User agents for which gzip is disabled
jetty.gzip.excludedUserAgent=.*MSIE.6\.0.*
Modify etc/gzip.xml:
<?xml version="1.0"?>
<!DOCTYPE Configure PUBLIC "-//Jetty//Configure//EN" "http://www.eclipse.org/jetty/configure_9_3.dtd">
<!-- =============================================================== -->
<!-- Mixin the GZIP Handler -->
<!-- This applies the GZIP Handler to the entire server -->
<!-- If a GZIP handler is required for an individual context, then -->
<!-- use a context XML (see test.xml example in distribution) -->
<!-- =============================================================== -->
<Configure id="Server" class="org.eclipse.jetty.server.Server">
<Call name="insertHandler">
<Arg>
<New id="GzipHandler" class="org.eclipse.jetty.server.handler.gzip.GzipHandler">
<Set name="minGzipSize">
<Property name="jetty.gzip.minGzipSize" deprecated="gzip.minGzipSize" default="2048"/>
</Set>
<Set name="checkGzExists">
<Property name="jetty.gzip.checkGzExists" deprecated="gzip.checkGzExists" default="false"/>
</Set>
<Set name="compressionLevel">
<Property name="jetty.gzip.compressionLevel" deprecated="gzip.compressionLevel" default="-1"/>
</Set>
<Set name="excludedAgentPatterns">
<Array type="String">
<Item>
<Property name="jetty.gzip.excludedUserAgent" deprecated="gzip.excludedUserAgent" default=".*MSIE.6\.0.*"/>
</Item>
</Array>
</Set>
<Set name="includedMethods">
<Array type="String">
<Item>GET</Item>
<Item>POST</Item>
</Array>
</Set>
<Set name="includedPaths"><Array type="String"><Item>/*</Item></Array></Set>
<Set name="excludedPaths"><Array type="String"><Item>*.gz</Item></Array></Set>
<Call name="addIncludedMimeTypes"><Arg><Array type="String">
<Item>text/html</Item>
<Item>text/plain</Item>
<Item>text/xml</Item>
<Item>application/xml</Item><!-- IMPORTANT - DO NOT FORGET THIS LINE -->
<Item>application/xhtml+xml</Item>
<Item>text/css</Item>
<Item>application/javascript</Item>
<Item>image/svg+xml</Item>
</Array></Arg></Call>
<!--
<Call name="addExcludedMimeTypes"><Arg><Array type="String"><Item>some/type</Item></Array></Arg></Call>
-->
</New>
</Arg>
</Call>
</Configure>
Here's the part that should make you cringe a bit.
Modify bin\solr.cmd
...
set "SOLR_JETTY_CONFIG=--module=http,gzip"
...
set "SOLR_JETTY_CONFIG=--module=https,gzip"
...
Notice that --module=http was already there. Just add ",gzip" so it matches the lines above.
I would prefer finding a better way of specifying the gzip module to load, but I don't know how. If you know how, please reply to this answer and tell me how because I hate modifying a script that comes with a product--it's a maintenance nightmare and, well, I think you get the picture.
After this, restart the solr server and gzip should now be enabled--at least for &wt=xml, which is sent back as Content-Type: application/xml.
You can add what you need to the etc/gzip.xml and restart the solr server for it to recognize your changes.
I tested with and without compression of 1000 documents. For me, it was the difference between 3.8 MB and 637 KB.

JPA does not write to table

I have the following JPA code, with all the values checked (ticket contains a valid bean, it ends without exception, etc.) It is executed, it does not throw any exceptions, yet in the end no data is written into the table.
I tried also retrieving a bean from the table, it also "works" (it is empty, so no data is returned).
The setup is
JBoss 6.1 Final
SQLServer 2008 Express (driver SQL JDBC 3 from MS)
The persistence code:
public String saveTicket() {
System.out.println("Controller saveTicket() ");
EntityManagerFactory factory = Persistence.createEntityManagerFactory("GesMan"); /* I know it would be better to share a single instance of factory, this is just for testing */
EntityManager entityMan = factory.createEntityManager();
entityMan.persist(this.ticket);
entityMan.close();
}
The persistence unit is
<?xml version="1.0" encoding="UTF-8"?>
<persistence version="2.0" xmlns="http://java.sun.com/xml/ns/persistence" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://java.sun.com/xml/ns/persistence http://java.sun.com/xml/ns/persistence/persistence_2_0.xsd">
<persistence-unit name="GesMan" transaction-type="JTA">
<provider>org.hibernate.ejb.HibernatePersistence</provider>
<jta-data-source>java:/GesManDS</jta-data-source>
<class>es.caib.gesma.gesman.Ticket</class>
<properties>
<property name="hibernate.dialect" value="org.hibernate.dialect.SQLServerDialect"/>
<property name="hibernate.transaction.manager_lookup_class"
value="org.hibernate.transaction.JBossTransactionManagerLookup"/>
<property name="hibernate.show_sql" value="true"/>
</properties>
</persistence-unit>
</persistence>
The datasource
<datasources>
<local-tx-datasource>
<jndi-name>GesManDS</jndi-name>
<connection-url>jdbc:sqlserver://spsigeswnt14.caib.es:1433;DatabaseName=TEST_GESMAN</connection-url>
<driver-class>com.microsoft.sqlserver.jdbc.SQLServerDriver</driver-class>
<user-name>thisis</user-name>
<password>notthepassword</password>
<check-valid-connection-sql>SELECT * FROM dbo.Ticket</check-valid-connection-sql>
<metadata>
<type-mapping>MS SQLSERVER</type-mapping>
</metadata>
</local-tx-datasource>
</datasources>

call entityMan.flush() or transaction.commit() befor closing it, otherwise it will discard all changes queued on close.

In the end it looks like I was using the wrong approach.... In JBoss you can`t (better said, I could not get to) access JPA directly (as you would do in JSE).
I ended creating an EJB (with transactions) and passing all JPA logic there.
PS: Of course, if I am wrong please tell me (now it is more of an academic issue, but still I want to know)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

how to use number of rounds on Nutchx2 - solr

There's several reasons why the crawl might not get further e.g. robots.txt directives. Look at the logs and / or the content of the crawl table to get a better idea of what the problem might be

Related

How do I specify that a component within an OpenCPI application xml is within an HDL assembly?

How to get file uploaded time in wso2 esb sequence

add subscription to plesk control panel

gzip compression not working in Solr 5.1

JPA does not write to table

Categories

Resources