Document index time boosting in solr via Apache Nutch

Document index time boosting in solr via Apache Nutch - solr

I am crawling some websites using apache Nutch. I have to give boost to one website out of all. suppose out of 100 urls, there is a wiki url in seed. I want to give all data from wiki some boot, so that they should be displayed at top. I am using solr 4.10.3.
I recrawl these websites after few days. So I think, index boot via solr will not work, it will be Nutch that should do it. Any idea ?

Related

how can I integrate solr and nutch in Cloudera

I get Starting Zookeeper and solr service and in Cloudera Manager also I had create a HDFS.But i still not able to get working nutch and solr together in Cloudera.
I do not know the following steps in order to get crawling and indexing new urls and get Query Result of solr index.
Does anyone know how to proceed?

Nutch not crawling url specified in seed.txt

I have just installed nutch integrated with solr and started crawling. but the urls I am specifying in seed.txt nutch is not crawling those url immediately. It's injecting old urls which I may have given earlier but now they are commented out.It looks like nutch is injecting url's in some strange order. What is the reason.also could anybody guide me any book or detailed tutorial on nutch becuase most of the tutorial available are only installation.

As mentioned in an answer to a similar question, the old URLs are still in Nutch's crawldb.
You can nuke your previous runs completely like this user did and start fresh, or you can remove the unwanted URLs a few different ways via CrawlDbMerger:
CLI via bin/nutch mergedb
CLI via bin/nutch updatedb

How the Apache Solr server is connecting to drupal db

Firstly Thanks to stackoverflow which is giving support to everyone.
Iam new to drupal and solr server
I have Successfully installed the solrserver in my system and I can able to search the data using "Apache Solr search module" In drupal7.
But Actually I dont know what is the Background process that is Running.But Inorder to have work with it I need to have a ground knowledge on it.Drupal is connecting to solr server using the url which I have Provided in admin UI.
As Per My knowledge I think the following is the backend flow of Apache solr server module
1)It sends the request of search string from drupal to solr server.
2)The solr server searches for the string and send the result back in the format of json to drupal.
3)Drupal displays the results
But How the solr server connects to drupal db inorder to search for the string or content?
Please help with this..I really In a need to know the backend flow how the request is handling
Thankyou

I'm not a Drupal specialist, but from the Solr prospective you are searching on the documents previously indexed on Solr. I.e., all documents must be indexed on Solr prior to the search.
Therefore, you have 2 ways here:
You call Solr API from your backend and push documents to Solr index. There are specific drupal solutions you may research, but here is the wiki article from Solr prospective describing how to index documents using only JSON API: http://wiki.apache.org/solr/UpdateJSON
You connect to your database directly from Solr and pull documents to Solr index. Here is the related wiki page: http://wiki.apache.org/solr/DataImportHandler

Plugging a solr index into django app using haystack

I have some data in a mysql database and its indexed through the solr admin app. I want to expose this data as facets in a django app. I explored ajax solr and it seems to be a very good solution for integrating solr into a web app with faceting, tags etc. I was also evaluating haystack to see if that would provide a much better solution.
Is there a way to integrate existing solr index with haystack ? It looks like we need to create a model and populate the model and solr would index the model. But in my case, I already have the index built and just want to integrate into django through haystack. would appreciate any thoughts regarding this
thanks
Joe.

How to search for key words using Solr from crawled web pages by nutch?

I have an application which crawls over the websites using Apache Nutch 2.1 and persisting data to the MySQL. I have to integrate Nutch and Solr which is not a problem as enough documentation is available on the internet.
After storing content from webpages, i want to add a search functionality based on Solr. I need to search for key words in the webpages. For example, if i am crawling websites which are movies related and i want to search for any specific movie(as a key word) from the crawled data, what are the changes i need to make to the Solr configurations. Do i need to write a separate plugin altogether or i can use existing plugins?What type of indexing i have to add to the solr configurations?