Solr and Nutch - How to take control over Facets? - solr

Sorry if this question might be too general. I'd be happy with good links to documentation, if there are any. Google won't help me find them.
I need to understand how facets can be extracted from a web site crawled by Nutch then indexed by Solr. On the web site, pages have meta tags, like <meta name="price" content="123.45"/> or <meta name="categories" content="category1, category2"/>. Can I tell Nutch to extract those and Solr to treat them as facets?
In the example above, I want to specify manually that the meta name "categories" is to be treated as a facet, but the content should be dynamically used as categories.
Does it make sense? Is it possible to do with Nutch and Solr, or should I rethink my way of using it?

I haven't used Nutch (I use Heritrix), but at the end of the day, Nutch need to extract the "meta" tag values and index them in Solr (using SolrJ for ex), with different solr fields "price", "categories", etc
Then you do
http://localhost:8080/solr/myrep/select?q=mobile&facet=true&facet.limit=10&facet.field=categories
to get facets per categories. Here is a page on facets:
http://wiki.apache.org/solr/SolrFacetingOverview

One of the options is to use nutch with metadata plugin
Although it is given as an example, it is very much included with the distribution.
Assuming you know the other processes of configuring, and crawling data using nutch
Before indexing, you need to configure nutch to use metadata plugin like this.
Edit conf/nutch-site.xml
<property>
<name>plugin.includes</name>
<value>urlmeta|(rest of the plugins)</value>
</property>
The metadata tags that need to be indexed, like price can be supplied as another property
<property>
<name>urlmeta.tags</name>
<value>price</value>
</property>$
Now, you can run the nutch crawl command. After crawling and indexing with solr, you should see a field price in the index. The facet search can be used by adding facet.field in your query.
Here are some links of interest.
Using Solr to index nutch data link :Link
Help on Solr faceting queries link :Link

Related

How to index the custom extension content with solr search indexer in typo3?

I have developed couple of custom extensions in typo3 as per the requirements i have, i have to implement the apache solr to my website.
How can index the custom module content/records for solr ? Is there way to add the content to solr indexer ?
Here is example from solr extension manual for tt_news records indexing:
https://forge.typo3.org/projects/extension-solr/wiki/Tx_solrindex#queue
Edit: gist link with the configuration in case solr documentation is moved: https://gist.github.com/jozefspisiak/a5234d61429321c756ae2e2f9ab2de75

How do you configure Apache Nutch 2.3 to honour robots metatag?

I have Nutch 2.3 setup with HBase as the backend and I run a crawl of which includes the index to Solr and Solr Deduplication.
I have recently noticed that the Solr index contains unwanted webpages.
In order to get Nutch to ignore these webpages I set the following metatag:
<meta name="robots" content="noindex,follow">
I have visited the apache nutch official website and it explains the following:
If you do not have permission to edit the /robots.txt file on your server, you can still tell robots not to index your pages or follow your links. The standard mechanism for this is the robots META tag
Searching the web for answers, I found a recommendations to set Protocol.CHECK_ROBOTS or set protocol.plugin.check.robots as a property in nutch-site.xml. None of these appear to work.
At current Nutch 2.3 ignores the noindex rule, therefore indexing the content to the external datastore ie Solr.
The question is how do I configure Nutch 2.3 to honour robots metatags?
Also if Nutch 2.3 was previously configured to ignore robot metatag and during a previous crawl cycle indexed that webpage. Providing the rules for the robots metatag are correct, will this result in the page being removed from the Solr index in future crawls?
I've created a plugin to overcome the problem of Apache Nutch 2.3 NOT honouring the robots metatag rule noindex. The metarobots plugin forces Nutch to discard qualifying documents during index. This prevents the qualifying documents being indexed to your external datastore ie Solr.
Please note: This plugin prevents the index of documents that contain robots metatag rule noindex, it does NOT remove any documents that were previously indexed to your external datastore.
Visit this link for instructions

solr index multiple urls to the same page

I'm using Apache Nutch and Solr to build my search engine.
I found in the result that there is multiple urls that point to the same page, these urls indexed in solr as different result
EX:
http://www.adab.com/modules.php?name=Sh3er&doWhat=shqas&qid=83067&r=&rc=13
http://www.adab.com/modules.php?name=Sh3er&doWhat=shqas&qid=83067&r=&rc=15
How to avoid this duplication in my search engine?
you could set up deduplication so that duplicates are discarded.

How do I tell Nutch to crawl *through* a url without storing it?

Let's say I have a Confluence instance, and I want to crawl it and store the results in Solr as part of an intranet search engine.
Now let's say I only want to store a subset of the pages (matching a regex) on the Confluence instance as part of the search engine.
But, I do want Nutch to crawl all the other pages, looking for links to pages that match—I just don't want Nutch to store them (or at least I don't want Solr to return them in the results).
What's the normal or least painful way to set Nutch->Solr up to work like this?
Looks like the only way to do this is write your own IndexFilter plugin (or find someone's to copy from).
[Will add my sample plugin code here when it's working properly]
References:
http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/
http://florianhartl.com/nutch-plugin-tutorial.html
How to filter URLs in Nutch 2.1 solrindex command

solrindex way of mapping nutch schema to solr

We have several custom nutch fields that the crawler picks up and indexes. Transferring this to solr via solrindex (using the mapping file) works fine. The log shows everything is fine, however the index in solr environment does not reflect this.
Any help will be much appreciated,
Thanks,
Ashok
What I would do is use a tool like tcpmon to monitor exactly what Nutch is sending to Solr. By examing the xml payload, you could determine if Nutch is correctly sending those custom fields to Solr. If Nutch is sending them correctly, there is something going on on the Solr side. On the opposite, re-check your Nutch code.

Resources