How to index a NFS mount using Nutch? - solr

I am trying to build a search tool hosted on a CentOS 7 machine which should index and search the directories of the mounted NFS export. I found that Nutch+Solr is the best bet for this. I have had a hard time configuring the url for this since this will not search any http locations.
The mount is located on /mnt
So my seeds.txt looks like this:
[root#sauron bin]# cat /root/Desktop/apache-nutch-1.13/urls/seed.txt
file:///mnt
and my regex-urlfilter.txt has the same site plus allowing file protocol
# skip file: ftp: and mailto: urls
-^(http|https|ftp|mailto):
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
#-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!#=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept anything else
+^file:///mnt
However when I try to bootstrap from the initial seed list there are no updates done:
[root#sauron apache-nutch-1.13]# bin/nutch inject crawl/crawldb urls
Injector: starting at 2017-06-12 00:07:49
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: overwrite: false
Injector: update: false
Injector: Total urls rejected by filters: 1
Injector: Total urls injected after normalization and filtering: 0
Injector: Total urls injected but already in CrawlDb: 0
Injector: Total new urls injected: 0
Injector: finished at 2017-06-12 00:10:27, elapsed: 00:02:38
I have also tried changing the seeds.txt to the following with no luck:
file:/mnt
file:////<IP>:<export_path>
Please let me know if I am doing something wrong here.

From the URI point of view a file system is not really that different for Nutch, you just need to enable the protocol-file plugin, and configure the regex-urlfilter.txt like:
+^file:///mnt/directory/
-.
In this case you prevent it from indexing the parent directories of the one that you've specified.
Keep in mind that since you've already have the NFS share mounted locally it would work as a normal local file system. More information could be found in https://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F.

Related

AngularJS Error using manifest.appcache

Case 1: After event 'cached' comming from window.applicationCache.addEventListener, i get my mobile using OFFLINE MODE(airplane mode i mean),then i open my webapp from HomeScreen icon and everything works fine, except AngularJS, that doesnt get loaded. None function or variable is recnognized.
Case 2: After event 'cached' comming from window.applicationCache.addEventListener, i open my webapp from HomeScreen icon using ONLINE MODE, at this moment angular worked fine because its on correct view. BUT if i dont interact with webApp (Login for example), when i get offline mode, AngularJs get a error and not even the correct view is displayed anymore.
EDIT: I realized that if after cached event, i close web app, and then open it again, without interact, it will work fine in offline mode... How is it possible?
All the calls to Angular scripts are getting called from same page that im using <html manifest="manifest.appcache" type="text/cache-manifest">, why it wont work just after i send it to home screen?
I already tried couple workarounds.. none work!
Any help is appreciated
Above my manifest code:
CACHE MANIFEST
# Learn more:
# https://developer.mozilla.org/en-US/docs/Web/HTML/Using_the_application_cache
# -----------------------------------------------------------------------------
# It's necessary to tell web browsers to reconsider this manifest any time the
# website is updated, and you do so by changing *anything* inside the manifest.
# A common way to do this is by simply updating a commented-out string, like a
# date or a version number, or both:
# 2016-09-10:v1.122
# -----------------------------------------------------------------------------
# This is where you define all of the resources to be cached. Add new and/or
# remove old resourses as needed, keeping each one on its own line. Learn more:
# https://developer.mozilla.org/en-US/docs/Web/HTML/Using_the_application_cache#Explicit_entries
CACHE:
assets\js\jquery.min.js
assets\js\main.js
assets\js\skel.min.js
assets\js\util.js
assets\js\ie\html5shiv.js
assets\js\ie\respond.min.js
js\angular.min.js
js\angular-cookies.min.js
js\app\gmApp-controller.js
js\app\gmApp-factory.js
LICENSE.txt
assets\css\ie8.css
assets\css\ie9.css
assets\css\main.css
assets\fonts\FontAwesome.otf
assets\fonts\fontawesome-webfont.eot
assets\fonts\fontawesome-webfont.svg
assets\fonts\fontawesome-webfont.ttf
assets\fonts\fontawesome-webfont.woff
assets\fonts\fontawesome-webfont.woff2
assets\css\font-awesome.min.css
images\02.png
images\avatar.jpg
images\bg_login.jpg
images\favicon.png
images\fundo-bar.jpg
images\fundo-home.jpg
images\hotel-1.png
images\logo.jpg
images\pic01.jpg
images\pic02.jpg
images\pic03.jpg
images\pic04.jpg
images\pic05.jpg
images\pic06.jpg
images\pic07.jpg
images\pic08.jpg
images\pic09.jpg
images\pic10.jpg
images\pic11.jpg
images\pic12.jpg
images\cancun\chichen-tza.jpg
images\cancun\cirque-soleil.jpg
images\cancun\coco-bongo.jpg
images\cancun\haceienta-mortero.jpg
images\cancun\isla-mujeres.jpg
images\cancun\la-isla-shopping.jpg
images\cancun\navio-pirata.jpg
images\cancun\shopong-plaza.jpg
images\cancun\tulum.jpg
images\cancun\xcaret.jpg
images\cancun\xel-ha.jpg
images\dicas\dica1.jpg
images\dicas\dica2.png
images\dicas\dica3.jpg
images\dicas\dica3a.jpg
images\dicas\dica3b.jpg
images\dicas\dica3c.jpg
images\dicas\dica3d.jpg
images\dicas\dica3e.jpg
images\dicas\dica4.jpg
images\dicas\dica5.jpg
images\dicas\dica6.jpg
images\dicas\dica6a.jpg
images\dicas\dica6b.jpg
images\dicas\dica6c.jpg
images\dicas\dica7.jpg
images\dicas\dica7a.jpg
images\dicas\dica7b.jpg
images\dicas\dica7c.jpg
images\dicas\dica9.jpg
images\dicas\rodape.jpg
images\mini-menu\Cancun-42.jpg
images\mini-menu\chichen-tza.jpg
images\mini-menu\cirque-soleil.jpg
images\mini-menu\coco-bongo.jpg
images\mini-menu\haceienta-mortero.jpg
images\mini-menu\isla-mujeres.jpg
images\mini-menu\la-isla-shopping.jpg
images\mini-menu\navio-pirata.jpg
images\mini-menu\shopong-plaza.jpg
images\mini-menu\tulum.jpg
images\mini-menu\xcaret.jpg
images\mini-menu\xel-ha.jpg
index.html
# -----------------------------------------------------------------------------
# Resources that must be retrieved from the network. The wild card ensures any
# resource not listed in the cache above will instead be downloaded from the
# network. Learn more:
# https://developer.mozilla.org/en-US/docs/Web/HTML/Using_the_application_cache#Network_entries
NETWORK:
*
# -----------------------------------------------------------------------------
# Fallbacks. In each row, if the first resource isn't available, the second
# resource is requested. Uncomment and update as needed. Learn more:
# https://developer.mozilla.org/en-US/docs/Web/HTML/Using_the_application_cache#Fallback_entries
FALLBACK:
assets\fonts\FontAwesome.otf assets\fonts\FontAwesome.otf
assets\fonts\fontawesome-webfont.eot assets\fonts\fontawesome-webfont.eot
assets\fonts\fontawesome-webfont.svg assets\fonts\fontawesome-webfont.svg
assets\fonts\fontawesome-webfont.ttf assets\fonts\fontawesome-webfont.ttf
assets\fonts\fontawesome-webfont.woff assets\fonts\fontawesome-webfont.woff
assets\fonts\fontawesome-webfont.woff2 assets\fonts\fontawesome-webfont.woff2
js\angular.min.js js\angular.min.js
js\angular-cookies.min.js js\angular-cookies.min.js
Detail: all these fallback links are already 'work-arounds' im trying to get everything works fine at 'Add to home screen' event
After a while i realized that it was a css reference inside .css files. After add that references into manifest everything works like a charm

How to prevent crawling external links with apache nutch?

I want to crawl only specific domains on nutch. For this I set the db.ignore.external.links to true as it was said in this FAQ link
The problem is nutch start to crawl only links in the seed list. For example if I put "nutch.apache.org" to seed.txt, It only find the same url (nutch.apache.org).
I get the result by running crawl script with 200 depth. And it's finished with one cycle and generate the out put below.
How can I solve this problem ?
I'm using apache nutch 1.11
Generator: starting at 2016-04-05 22:36:16
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: 0 records selected for fetching, exiting ...
Generate returned 1 (no new segments created)
Escaping loop: no more URLs to fetch now
Best Regards
You want to fetch only pages from a specific domain.
You already tried db.ignore.external.links but this restrict anything but the seek.txt urls.
You should try conf/regex-urlfilter.txt like in the example of the nutch1 tutorial:
+^http://([a-z0-9]*\.)*your.specific.domain.org/
Are you using "Crawl" script? If yes make sure you giving level which is greater than 1. If you run something like this "bin/crawl seedfoldername crawlDb http://solrIP:solrPort/solr 1". It will crawl only urls which are listed in the seed.txt
And to crawl specific domain you can use regex-urlfiltee.txt file.
Add following property in nutch-site.xml
<property>
<name>db.ignore.external.links</name>
<value>true</value>
<description>If true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. </description>
</property>

Set default settings to 'no-cache' on Google Cloud Storage

Is there a way to set all public links to have 'no-cache' in Google Cloud Storage?
I've seen solutions to use gsutil to set the "Cache-Control" upon file-upload, but I'm looking for a more permanent solution.
There was a conversation about providing a cache invalidation feature but I didn't quite follow the reasoning. Any explanations would be greatly appreciated!
it would be difficult to provide a cache invalidation feature because once served with a non-0 cache TTL any cache on the Internet (not just those under Google's control) is allowed (per HTTP spec) to cache the data
Thanks!
For a more permanent one-time-effort solution, with the current offerings on GCP, you can do this with Cloud Functions.
Create a new Funciton, set the Event type to "On (finalizing/creating) file in the selected bucket" - google.storage.object.finalize. Make sure to select the bucket you want this on. In the body of the function, set the cacheControl / Cache-Control attribute for the blob. The attribute name depends on the language. Here's my version in Python, using cache_control:
main.py:
match the function name below to the Entry point
from google.cloud import storage
def set_file_uncached(event, context):
file = event # auto-generated
print(f"Processing file: {file=}") # logging, if you want it
storage_client = storage.Client()
# we expect just one with that name
blob = storage_client.bucket(file["bucket"]).get_blob(file["name"])
if not blob:
# in case the blob is deleted before this executes
print(f"blob not found")
return None
blob.cache_control = "public, max-age=0" # or whatever you need
blob.patch()
requirements.txt
google-cloud-storage
From the logs: Function execution took 1712 ms, finished with status: 'ok'. This could have been faster but I've set the minimum to 0 instances so it needs to spin-up for each upload. Depending on your usage and cost constraints, you can set it to 1 or something higher.
Other settings:
Retry on failure: No/False
Region: [wherever your bucket is]
Memory allocated: 128 MB (smallest available currently)
Timeout: 5 seconds (smallest available currently, function shouldn't take longer)
Minimum instances: 0
Maximum instances: 1

How can I get AppEngine to log info level only for my app?

So I've tried configuring AppEngine logging according to this guide, ensuring I've configured the logging.properties file to be used in web.xml. I've configured logging.properties the following way:
.level = WARNING
nilsnett.chinese.backend.level = INFO
The package name of my logging wrapper is nilsnett.chinese.backend. The problem is that even with this configuration, info-level log output from my app is filtered. Evidence:
I've also tried the following config, which yielded the same result (including the logger class name at the end of the package name):
.level = WARNING
nilsnett.chinese.backend.JavaUtilLogger.level = INFO
To demonstrate that the logging.properties-file is actually read, and that I actually do write info-level logging data to app-engine in this service call, let me show you what happens when I set.level=INFO:
So my desired result is to have INFO and higher-level log outputs from my packages, while other packages, like org.datanucleus, only shows output if WARNING or more severe. In the example above, I want only the two lines marked with the purple star. Am I doing anything wrong?
change your config to:
.level = WARNING
# Set the default logging level for the datanucleus loggers
DataNucleus.JDO.level=WARNING
DataNucleus.Persistence.level=WARNING
DataNucleus.Cache.level=WARNING
DataNucleus.MetaData.level=WARNING
DataNucleus.General.level=WARNING
DataNucleus.Utility.level=WARNING
DataNucleus.Transaction.level=WARNING
DataNucleus.Datastore.level=WARNING
DataNucleus.ClassLoading.level=WARNING
DataNucleus.Plugin.level=WARNING
DataNucleus.ValueGeneration.level=WARNING
DataNucleus.Enhancer.level=WARNING
DataNucleus.SchemaTool.level=WARNING
# FinalizableReferenceQueue tries to spin up a thread and fails. This
# is inconsequential, so don't scare the user.
com.google.common.base.FinalizableReferenceQueue.level=WARNING
com.google.appengine.repackaged.com.google.common.base.FinalizableReferenceQueue.level=WARNING
this is are coming from logging config template, so to set datanucleus to warning you have todo like in this template.
https://developers.google.com/appengine/docs/java/#Logging
and then just add your own logging config:
nilsnett.chinese.backend.level = INFO
this should solve it

Nutch didn't crawl all URLs from the seed.txt

I am new to Nutch and Solr. Currently I would like to crawl a website and its content is
generated by ASP. Since the content is not static, I created a seed.txt which
contained all the URLs I would like to crawl. For example:
http://us.abc.com/product/10001
http://us.abc.com/product/10002
http://jp.abc.com/product/10001
http://jp.abc.com/product/10002
...
The regex-urlfilter.txt has this filter:
# accept anything else
#+.
+^http://([a-z0-9]*\.)*abc.com/
I used this command to start the crawling:
/bin/nutch crawl urls -solr http://abc.com:8983/solr/ -dir crawl -depth 10 -topN 10
The seed.txt content 40,000+ URLs. However, I found that many of the URLs content are not
able to be found by Solr.
Question:
Is this approach for a large seed.txt workable ?
How can I check a URL was being crawlered ?
Is seed.txt has a size limitation ?
Thank you !
Check out the property db.max.outlinks.per.page in the nutch configuration files.
The default value for this property is 100 and hence only 100 urls will be picked up from the seeds.txt and rest would be skipped.
Change this value to a higher number to have all the urls scanned and indexed.
topN indicates how many of the generated links should be fetched. You could have 100 links which have been generated , but if you set topN as 12, then only 12 of those links will get fetched, parsed and indexed.

Resources