how to make nutch crawl file system?

how to make nutch crawl file system? - filesystems

not based on http,
like http://localhost:81 and so on,
but directly crawl a certain directory on local file system,
is there any way out?

From the Nutch Wiki:
How do I index my local file system?
http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6
1) crawl-urlfilter.txt needs a change to allow file: URLs while not following http: ones, otherwise it either won't index anything, or it'll jump off your disk onto web sites.
Change this line:
-^(file|ftp|mailto|https):
to this:
-^(http|ftp|mailto|https):
2) crawl-urlfilter.txt may have rules at the bottom to reject some URLs. If it has this fragment it's probably ok:
# accept anything else +.*
3) I changed my nutch.xml to include the following:
<Parameter override="false" name="plugin.includes" value="protocol-file|protocol-http|urlfilter-regex|parse-(msword|pdf|text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)"/>

nutch has the Intranet crawling available. you can read the details here

Related

How to configure Nginx autoindex to edit files?

I have configured Nginx with autoindex module to enable Directory Listing. But I want to extend this feature to enable file editing as well and saving it.
The thing is I have some Private IPs which needs to be monitored and I have added those IPs in a file and made a script to take IPs from the file and monitor them by Pinging. Since sometimes these IPs change due to DHCP, Apart from System Admins, No one is much proficient in using Terminal. Hence I wanted to provide a webUI, so that concerned persons change this IP whenever through webpage. I know this can be possible using code, but since am not a developer, I was finding a way through here. Is it possible?

No, it's not possible using nginx alone.

Use Apache to Rewrite URLs with Database Parameters as Nice URLs

For years our database driven websites have had URLs that look like:
https://www.example.com/product?id=30
But nowadays, and especially for SEO purposes, we want our URLs to be "nice" and look like:
https://www.example.com/30/myproduct
We use Zope 2.13.x running on Debian and using Apache 2.4 as the front-end webserver. I know that not too many people use Zope, but utilizing Apache's mod_rewrite we should be able to proxy the rewrite and have nice URLs that still pass the database arguments necessary in order to properly serve the pages to the end users.
There used to be a Zope Cookbook where I wrote a bunch of really detailed tutorials on Zope functionality but that no longer seems to exist and I wanted to share this with the SE community.
The awesome thing is that this is not specific to Zope, but will/should work with any rewrite of a parameter based URL into a nice URL and it's super easy once it's all working.
For complete transparency, I am going to answer my own question so that it's documented for everyone.

Using the rewrite engine in Apache, decide how you want your URLs to look to the end user in their web browser.
For example, if you are calling to a database and have a url that looks like
https://www.example.com/products?id=30&product_name=myproduct
but you want that URL to look like
https://www.example.com/products/30/myproduct
you would use a rewrite rule as follows:
RewriteRule ^/products/(.*)/(.*) /products?id=$1&product_name=$2 [L,P,NE,QSA]
To explain that further:
^/products/(.*)/(.*) is saying that anytime domain.com/products is accessed, look for two variables in the next directory names, i.e. /(.*)/(.*)
If you only wanted one variable you would do ^/products/(.*)
Likewise if you wanted three variables you would do ^/products/(.*)/(.*)/(.*)
From there we need to tell Apache how to interpret that URL in order to rewrite and still allow Zope (or whatever db you may be using) to pass the correct URL parameters. That part is:
/products?id=$1&product_name=$2
Apache will now take the first (.*) and treat that as $1. It will take the second (.*) and treat that as $2 and so on.
The part in the brackets is extremely important
L = This makes Apache stop processing the rewrite ruleset if the rule matches. This is important because you don't want Apache to get confused and start trying other rewrites.
P = Proxy the request. This makes sure that the browser does not display a different URL than https://www.example.com/products/30/myproduct (i.e. we do not want the end user seeing the rewritten URL as https://www.example.com/products?id=30&product_name=myproduct
NE = No Escaping any URL characters. You need this to ensure that the URL rewrite does not try and escape the special characters like $ = & as these are important to URL parameters
QSA = This allows multiple variables (or URL parameters) to exist
Please Note: It is very important to consider how you want your URLs to look (the nice URLs) because that is what you want to submit to the search engines. If you change your URL structure, those nice URLs will no longer work and your search engine rankings may decrease.

import wikipedia article using wget or curl (on windows)

i have a folder with wikipedia article (XML format).
I want imported files throught the Webinterface (Special:Import). Currently i do it with imacro. But this often hangs and need a lot of resources (Memory) an can only processing one file at once.So i am looking for better solution.
I currently figured out, that in have to login to get an edittoken. This is needed to upload the file.
Read already this. get stuck
To get his run in need two wget/curl "commandlines"
to login and get the edittoken (push user and pwd to form, get edittoken)
push the file to the Formular (push edittoken and content to form)
Building the loop to processing more than one file, i can do by my own.

First of all, let's be clear: the web interface is not the right way to do this. MediaWiki installation requirements include shell access to the server, which would allow you to use importDump.php as needed for heavier imports.
Second, if you want to import a Wikipedia article from the web interface then you shouldn't be downloading the XML directly: Special:Import can do that for you. Set
$wgImportSources = array( 'wikipedia' );
or whatever (see manual), visit Special:Import, select Wikipedia from the dropdown, enter the title to import, confirm.
Third, if you want to use the commandline then why not use the MediaWiki web API, also available with plenty of clients. Most clients handle tokens for you.
Finally, if you really insist on using wget/curl over index.php, you can get a token visiting api.php?action=query&meta=tokens in your browser (check on api.php for the exact instructions for your MediaWiki version) and then do something like
curl -d "&action=submit&...#filename.xml" .../index.php?title=Special:Import
(intentionally partial code so that you don't run it without knowing what you're doing).

solr/browse gives page not found error.

How to make browse page load ? I have added handler as given in the page
https://wiki.apache.org/solr/VelocityResponseWriter
Still not working. Can any one brief me on this. Thanks in advance.

Couple of things to check:
Have you restarted Solr?
Is the core you are trying to 'browse' a default core? If not, you need to include the core name in the URL. E.g. /solr/collection1/browse
Are your library statements in solrconfig.xml pointing at the right velocity jar? Use absolute path unless you are very sure that you know what your base directory is for the relative paths
Are you getting any errors in the server logs?
If all fails, start comparing what you have with the collection1 example in Solr distribution. it works there, so you can compare nearly line-by-line the relevant entries and even experiment with collection1 to make it more like your failing example.

Full and detailed list of files at a domain?

I want something that creates a full list of all files/paths at a domain (mine),
including size and modification date. I want the list to begin all the way at
the root - not just past /public_html. I'd want to run this from my Win7 64
bit PC and have the list saved on my PC.
I do NOT want to DL all the files !
Is there a Win7-64 tool I can use to accomplish this ??

When you say files/paths "at a domain", in general you have a misunderstanding. A domain is basically a name that points to a resource see here.
If this sounds kind of vague, it's because it is. Multiple computers can host a domain (ie. serve up resources for the same domain), and the resources they serve up don't have to be files at all. You can point your browser at http:// somesite/somefile.html, and that "somefile.html" may not exist at all (yet the site could still return a webpage).
You can't (in general) list all the files/paths at a "domain", but if you have access, you can certainly do that for one or more computers. Certain websites may provide a way to get a directory listing, but even then it would just be from the "DocumentRoot" (in Apache terms) of the website (not from root).
EDIT: IF your domain is hosted on a single computer, and you have full access through ftp, you could use something like the python script in the answer here to get a remote directory listing (of this computer). You probably need to change the line that says this:
ftp.login()
to this:
ftp.login(user='your username', passwd='your password')
While it may seem like the same thing, what you're really asking for is a remote directory listing of a computer, not a domain (even if a dns lookup resolves your domain to a computer).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

how to make nutch crawl file system? - filesystems

not based on http, like http://localhost:81 and so on, but directly crawl a certain directory on local file system, is there any way out?

nutch has the Intranet crawling available. you can read the details here

Related

How to configure Nginx autoindex to edit files?

Use Apache to Rewrite URLs with Database Parameters as Nice URLs

import wikipedia article using wget or curl (on windows)

solr/browse gives page not found error.

Full and detailed list of files at a domain?

Categories

Resources