Web crawling with Nutch - solr

I am using an older version of both Nutch (Nutch 1.4) and Solr (3.4.0) as I had installation issues with later versions. After installing I ran a crawl and now I wish to dump crawled URLs into a text file. These are the available options on Nutch 1.4 :
Abhijeet#Abhijeet /home/apache-nutch-1.4-bin/runtime/local/bin
$ ./nutch
Usage: nutch [-core] COMMAND
where COMMAND is one of:
crawl one-step crawler for intranets
readdb read / dump crawl db
mergedb merge crawldb-s, with optional filtering
readlinkdb read / dump link db
inject inject new urls into the database
generate generate new segments to fetch from crawl db
freegen generate new segments to fetch from text files
fetch fetch a segment's pages
parse parse a segment's pages
readseg read / dump segment data
mergesegs merge several segments, with optional filtering and slicing
updatedb update crawl db from segments after fetching
invertlinks create a linkdb from parsed segments
mergelinkdb merge linkdb-s, with optional filtering
solrindex run the solr indexer on parsed segments and linkdb
solrdedup remove duplicates from solr
solrclean remove HTTP 301 and 404 documents from solr
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
domainstats calculate domain statistics from crawldb
webgraph generate a web graph from existing segments
linkrank run a link analysis program on the generated web graph
scoreupdater updates the crawldb with linkrank scores
nodedumper dumps the web graph's node scores
plugin load a plugin and run one of its classes main()
junit runs the given JUnit test
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
Expert: -core option is for developers only. It avoids building the job jar,
instead it simply includes classes compiled with ant compile-core.
NOTE: this works only for jobs executed in 'local' mode
There are 2 options - readdb and readlinkdb. Which one of these two do I need to run? Also the format of the two commands are as stated respectively:
For readdb
Abhijeet#Abhijeet /home/apache-nutch-1.4-bin/runtime/local/bin
$ ./nutch readdb
cygpath: can't convert empty path
Usage: CrawlDbReader <crawldb> (-stats | -dump <out_dir> | -topN <nnnn> <out_dir> [<min>] | -url <url>)
<crawldb> directory name where crawldb is located
-stats [-sort] print overall statistics to System.out
[-sort] list status sorted by host
-dump <out_dir> [-format normal|csv ] dump the whole db to a text file in <out_dir>
[-format csv] dump in Csv format
[-format normal] dump in standard format (default option)
-url <url> print information on <url> to System.out
-topN <nnnn> <out_dir> [<min>] dump top <nnnn> urls sorted by score to <out_dir>
[<min>] skip records with scores below this value.
This can significantly improve performance.
For readlinkdb
Abhijeet#Abhijeet /home/apache-nutch-1.4-bin/runtime/local/bin
$ ./nutch readlinkdb
cygpath: can't convert empty path
Usage: LinkDbReader <linkdb> (-dump <out_dir> | -url <url>)
-dump <out_dir> dump whole link db to a text file in <out_dir>
-url <url> print information about <url> to System.out
I am a confused as to how to use these two commands correctly. An example would be of great help.
Edit :
So I have successfully ran the readdb option and have obtained the following result :
http://www.espncricinfo.com/ Version: 7
Status: 2 (db_fetched)
Fetch time: Sat Apr 15 20:40:38 IST 2017
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0042857
Signature: b7324a43f084e5b291ec56ccfb552a2a
Metadata: _pst_: success(1), lastModified=0
http://www.espncricinfo.com/afghanistan-v-ireland-2016-17/content/series/1040469.html Version: 7
Status: 2 (db_fetched)
Fetch time: Sat Apr 15 20:43:03 IST 2017
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.080714285
Signature: f3bf66dc7c6cd440ee01819b29149140
Metadata: _pst_: success(1), lastModified=0
http://www.espncricinfo.com/afghanistan-v-ireland-2016-17/engine/match/1040485.html Version: 7
Status: 1 (db_unfetched)
Fetch time: Thu Mar 16 20:43:51 IST 2017
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.0014285715
Signature: null
Metadata:
But on the other hand running the readlinkdb option dumps an empty file. Any ideas on what could be going wrong.
This is my readlinkdb command : ./nutch readlinkdb myCrawl2/linkdb -dump myCrawl2/LinkDump

Related

Mining the SOLR query logfile to calculate top (most frequent) searches per mont

I need to retrieve solr query stats which show for a given time period (say one month) how times that query term was searched.
I believe the first step is to enable logging which produces entries in the log file
Has anyone solved this? Is there some basic log file crunching code to spit out a list like
eat 3000
food 2020
bread 1900
wanted to come back and answer myself.
I found a good github project that parses the log
https://github.com/dfdeshom/solr-loganalyzer

Why does the log always say "No Data Available" when the cube is built?

In the sample case on the Kylin official website, when I was building cube, in the first step of the Create Intermediate Flat Hive Table, the log is always No Data Available, the status is always running.
The cube build has been executed for more than three hours.
I checked the hive database table kylin_sales and there is data in the table.
And I fount that the intermediate flat hive table kylin_intermediate_kylin_sales_cube_402e3eaa_dfb2_7e3e_04f3_07248c04c10c
has been created successfully in the hive, but there is no data in its.
hive> show tables;
OK
...
kylin_intermediate_kylin_sales_cube_402e3eaa_dfb2_7e3e_04f3_07248c04c10c
kylin_sales
...
Time taken: 9.816 seconds, Fetched: 10000 row(s)
hive> select * from kylin_sales;
OK
...
8992 2012-04-17 ABIN 15687 0 13 95.5336 17 10000975 10000507 ADMIN Shanghai
8993 2013-02-02 FP-non GTC 67698 0 13 85.7528 6 10000856 10004882 MODELER Hongkong
...
Time taken: 3.759 seconds, Fetched: 10000 row(s)
The deploy environment is as follows:
 
zookeeper-3.4.14
hadoop-3.2.0
hbase-1.4.9
apache-hive-2.3.4-bin
apache-kylin-2.6.1-bin-hbase1x
openssh5.3
jdk1.8.0_144
I deployed the cluster through docker and created 3 containers, one master, two slaves.
Create Intermediate Flat Hive Table step is running.
No Data Available means this step's log has not been captured by Kylin. Usually only when the step is exited (success or failed), the log will be recorded, then you will see the data.
For this case, usually, it indicates the job was pending by Hive, due to many reasons. The simplest way is, watch Kylin's log, you will see the Hive CMD that Kylin executes, and then you can run it manually in console, then you will reproduce the problem. Please check if your Hive/Hadoop has enough resource (cpu, memory) to execute such a query.

How to split the single field into multiple fields in logparser

I am very new to log parser. I tried using the different formats and delimiters but that that does not work for me. My log file looks like below..
# Version xx
# Feilds: date time c-ip
# Software : Weblogic
# Startdate : 2013-08-15 17:39:09
date value time ipaddress
When I appplied the following code
logparser.exe -o:DATAGRID "select * from abc.log_tmp"
where abc.log_tmp is the log file that contains the information
it gives the information in the following way:
******************************************************************
logfilename index content
********************************************************************
C:xyx\abc.log_tmp 3 date time
C:xyx\abc.log_tmp 4 date time
**********************************************
when actually it should come like
date time c-ip
xxx xxx xxx
xxx xxx xxx
from this I came to know that it is taking date time c-ip value as one, but it should take it as different....
You should explicitly tell LogParser which input format it should expect (when you don't, Log Parser tries to figure it out from the filename or from the first few lines of the log file); in this case, it should work as you expect if you added "-i:W3c" to the command-line.

How to query google app engine blobstore by creation date and order the results

I need to retrieve the latest set of files from GAE blobstore. Currently my code says
nDays = 10 #this is set at run time
gqlQuery = blobstore.BlobInfo.gql("WHERE filename = :1 ORDER BY creation DESC",<filename>)
cursor = gqlQuery.fetch(nDays)
when I iterate and print out the data by calling cursor[i].creation, it doesn't give me the last nDays starting from today. For example, today is August 20. I expect it to give me data from Aug 11 - Aug 20 (I have a file for each day). Instead it gives me data from Aug 13 back a few days.
If i remove the ORDER BY in the gqlquery, it correctly returns all the results (not sorted).
If I make the gqlQuery iterable so that I say something like
for filename in gqlQuery:
print filename.creation
it only prints from August 13 back to a few days (about 8 days). I know for a fact there is data up till today. From GAE, I can view the data. Also, the creation date is stamped automatically by Google when the file is uploaded to blobstore.
Anyone know what I'm missing?
I may also miss something, but what's the purpose of "filename = :1" in you query?
This ran correctly against my blobstore:
gqlQuery = blobstore.BlobInfo.gql("ORDER BY creation DESC")
blobs = gqlQuery.fetch(5)
self.response.headers['Content-Type'] = 'text/html'
self.response.out.write("Lasts blobs<br>")
for blob in blobs:
self.response.out.write(blob.filename + "<br>")
Florent

Apache2: server-status reported value for "requests/sec" is wrong. What am I doing wrong?

I am running Apache2 on Linux (Ubuntu 9.10).
I am trying to monitor the load on my server using mod_status.
There are 2 things that puzzle me (see cut-and-paste below):
The CPU load is reported as a ridiculously small number,
whereas, "uptime" reports a number between 0.05 and 0.15 at the same time.
The "requests/sec" is also ridiculously low (0.06)
when I know there are at least 10 requests coming in per second right now.
(You can see there are close to a quarter million "accesses" - this sounds right.)
I am wondering whether this is a bug (if so, is there a fix/workaround),
or maybe a configuration error (but I can't imagine how).
Any insights would be appreciated.
-- David Jones
- - - - -
Current Time: Friday, 07-Jan-2011 13:48:09 PST
Restart Time: Thursday, 25-Nov-2010 14:50:59 PST
Parent Server Generation: 0
Server uptime: 42 days 22 hours 57 minutes 10 seconds
Total accesses: 238015 - Total Traffic: 91.5 MB
CPU Usage: u2.15 s1.54 cu0 cs0 - 9.94e-5% CPU load
.0641 requests/sec - 25 B/second - 402 B/request
11 requests currently being processed, 2 idle workers
- - - - -
After I restarted my Apache server, I realized what is going on. The "requests/sec" is calculated over the lifetime of the server. So if your Apache server has been running for 3 months, this tells you nothing at all about the current load on your server. Instead, reports the total number of requests, divided by the total number of seconds.
It would be nice if there was a way to see the current load on your server. Any ideas?
Anyway, ... answered my own question.
-- David Jones
Apache status value "Total Accesses" is total access count since server started, it's delta value of seconds just what we mean "Request per seconds".
There is the way:
1) Apache monitor script for zabbix
https://github.com/lorf/zapache/blob/master/zapache
2) Install & config zabbix agentd
UserParameter=apache.status[*],/bin/bash /path/apache_status.sh $1 $2
3) Zabbix - Create apache template - Create Monitor item
Key: apache.status[{$APACHE_STATUS_URL}, TotalAccesses]
Type: Numeric(float)
Update interval: 20
Store value: Delta (speed per second) --this is the key option
Zabbix will calculate the increment of the apache request, store delta value, that is "Request per seconds".

Resources