solr index multiple urls to the same page - solr

I'm using Apache Nutch and Solr to build my search engine.
I found in the result that there is multiple urls that point to the same page, these urls indexed in solr as different result
EX:
http://www.adab.com/modules.php?name=Sh3er&doWhat=shqas&qid=83067&r=&rc=13
http://www.adab.com/modules.php?name=Sh3er&doWhat=shqas&qid=83067&r=&rc=15
How to avoid this duplication in my search engine?

you could set up deduplication so that duplicates are discarded.

Related

How do you configure Apache Nutch 2.3 to honour robots metatag?

I have Nutch 2.3 setup with HBase as the backend and I run a crawl of which includes the index to Solr and Solr Deduplication.
I have recently noticed that the Solr index contains unwanted webpages.
In order to get Nutch to ignore these webpages I set the following metatag:
<meta name="robots" content="noindex,follow">
I have visited the apache nutch official website and it explains the following:
If you do not have permission to edit the /robots.txt file on your server, you can still tell robots not to index your pages or follow your links. The standard mechanism for this is the robots META tag
Searching the web for answers, I found a recommendations to set Protocol.CHECK_ROBOTS or set protocol.plugin.check.robots as a property in nutch-site.xml. None of these appear to work.
At current Nutch 2.3 ignores the noindex rule, therefore indexing the content to the external datastore ie Solr.
The question is how do I configure Nutch 2.3 to honour robots metatags?
Also if Nutch 2.3 was previously configured to ignore robot metatag and during a previous crawl cycle indexed that webpage. Providing the rules for the robots metatag are correct, will this result in the page being removed from the Solr index in future crawls?
I've created a plugin to overcome the problem of Apache Nutch 2.3 NOT honouring the robots metatag rule noindex. The metarobots plugin forces Nutch to discard qualifying documents during index. This prevents the qualifying documents being indexed to your external datastore ie Solr.
Please note: This plugin prevents the index of documents that contain robots metatag rule noindex, it does NOT remove any documents that were previously indexed to your external datastore.
Visit this link for instructions

Nutch 2.X - Prefered urls to fetch

I have this situation: There are over 160 URLs in my seed. I started my crawling one week ago. Now I have a lot of pages crawled in my storage but I can see in my Solr index that some URLs from seed are not crawled at all (the URLs do not have some restrictions from a robots.txt) or only in very small number. Is it possible tell Nutch to prefer some URLs?
have you checked TopN value?
Or is Nutch still crawling? because indexing and sending data to solr is done at the end of process!

Solr URL with '#' causing issue in multicore

Why does Solr url has a '#' as part of its url? There were quite a few posts about the same question in the past eg. http://lucene.472066.n3.nabble.com/Curious-why-Solr-Jetty-URL-has-a-sign-td4069434.html but with no proper workaround.
I never had any problem when I was just using a single core but when I made my Solr as a multicore that is when I have issues with Solr url having '#' (pound sign).
For eg.,
solr url - http://localhost:8983/solr/
(when the above solr admin url loads in a browser, it changes to this - http://localhost:8983/solr/#/)
When I click on individual collections to get their url, this is what I get as seen below -
solr url for collection1 (core 1)- http://localhost:8983/solr/#/collection1
solr url for collection2 (core 2)- http://localhost:8983/solr/#/collection2
I have two different applications which should query their own particular solr collection, which means I have to provide their collection specific solr url. When I added this url http://localhost:8983/solr/#/collection1, the application that should utilize solr collection 'collection1' is unable to connect to solr. It is returning 'Problems were found while connecting to the SOLR server HTTP code=404 Not Found'. Same is the case with other application using Solr 'collection2'
Please tell me how I can get rid of '#' from the solr url or any possible fix for the above issue
The # is the url generated by the admin dashboard. For actually interacting with a collection, the url format is unchanged, just remove the # --
localhost:8983/solr/collection1
or /select or /update or whatever.

How do I tell Nutch to crawl *through* a url without storing it?

Let's say I have a Confluence instance, and I want to crawl it and store the results in Solr as part of an intranet search engine.
Now let's say I only want to store a subset of the pages (matching a regex) on the Confluence instance as part of the search engine.
But, I do want Nutch to crawl all the other pages, looking for links to pages that match—I just don't want Nutch to store them (or at least I don't want Solr to return them in the results).
What's the normal or least painful way to set Nutch->Solr up to work like this?
Looks like the only way to do this is write your own IndexFilter plugin (or find someone's to copy from).
[Will add my sample plugin code here when it's working properly]
References:
http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/
http://florianhartl.com/nutch-plugin-tutorial.html
How to filter URLs in Nutch 2.1 solrindex command

Solr and Nutch - How to take control over Facets?

Sorry if this question might be too general. I'd be happy with good links to documentation, if there are any. Google won't help me find them.
I need to understand how facets can be extracted from a web site crawled by Nutch then indexed by Solr. On the web site, pages have meta tags, like <meta name="price" content="123.45"/> or <meta name="categories" content="category1, category2"/>. Can I tell Nutch to extract those and Solr to treat them as facets?
In the example above, I want to specify manually that the meta name "categories" is to be treated as a facet, but the content should be dynamically used as categories.
Does it make sense? Is it possible to do with Nutch and Solr, or should I rethink my way of using it?
I haven't used Nutch (I use Heritrix), but at the end of the day, Nutch need to extract the "meta" tag values and index them in Solr (using SolrJ for ex), with different solr fields "price", "categories", etc
Then you do
http://localhost:8080/solr/myrep/select?q=mobile&facet=true&facet.limit=10&facet.field=categories
to get facets per categories. Here is a page on facets:
http://wiki.apache.org/solr/SolrFacetingOverview
One of the options is to use nutch with metadata plugin
Although it is given as an example, it is very much included with the distribution.
Assuming you know the other processes of configuring, and crawling data using nutch
Before indexing, you need to configure nutch to use metadata plugin like this.
Edit conf/nutch-site.xml
<property>
<name>plugin.includes</name>
<value>urlmeta|(rest of the plugins)</value>
</property>
The metadata tags that need to be indexed, like price can be supplied as another property
<property>
<name>urlmeta.tags</name>
<value>price</value>
</property>$
Now, you can run the nutch crawl command. After crawling and indexing with solr, you should see a field price in the index. The facet search can be used by adding facet.field in your query.
Here are some links of interest.
Using Solr to index nutch data link :Link
Help on Solr faceting queries link :Link

Resources