I have tried indexing public url of a google drive document, but it seems that it does not work . Is there any way to crawl google drive documents via nutch and make their index using solr?
Use Google Drive API to read/manage files
https://developers.google.com/drive/web/about-sdk
Drive Public URL's page won't have direct links to subdirectories, so you will get nothing if you crawl those pages.
Related
I am new to using Solr and I am struggling with testing the search for the first time. I see in online tutorials people have many different means of testing their cores. Can I test a core on a DropBox shared folder? If so How?
This going to be a search engine for a website that will have articles blog posts and other references.
DropBox is an simple storage, not a place where you can execute something.
You need an environment, where you cat run the solr server. For example an linux server.
You can put your solr home and your solr binary into an dropbox folder in order to mount this folder to an server. Than you can execute/run the solr service at the machine, where the dropbox folder is mounted.
I am crawling some websites using apache Nutch. I have to give boost to one website out of all. suppose out of 100 urls, there is a wiki url in seed. I want to give all data from wiki some boot, so that they should be displayed at top. I am using solr 4.10.3.
I recrawl these websites after few days. So I think, index boot via solr will not work, it will be Nutch that should do it. Any idea ?
I am implementing Solr Cloud for the first time. I've worked with normal Solr and have that down pretty well, but I'm not finding a lot on what you can and can't do with Solr Cloud. So my question is about Managed Resources. I know you can CRUD stop words and synonyms using the new RESTful api in solr. However with the cloud do I need to CRUD my changes to each individual solr server in the cloud, or do I send them to a different url that sends them through to each server? I'm new to cloud and zookeeper. I have not found anything in the solr wiki about working with the managed resources in the cloud setup. Any advice would be helpful.
In SolrCloud configuration and other files like stopwords, are stored and maintained by Zookeeper. Which means you do not need to individually send updates to each server.
Once you have SolrCloud, before putting in any data, you will create a collection. Each collection has its own set of resources/config folder.
So for example if u have a collection called techproducts with 2 servers localhost1 and localhost2 the below command from any of the servers will work on the same resource.
curl "http://localhost1:8983/solr/techproducts/schema/analysis/synonyms/english"
curl "http://localhost2:8983/solr/techproducts/schema/analysis/synonyms/english"
I have an application which crawls over the websites using Apache Nutch 2.1 and persisting data to the MySQL. I have to integrate Nutch and Solr which is not a problem as enough documentation is available on the internet.
After storing content from webpages, i want to add a search functionality based on Solr. I need to search for key words in the webpages. For example, if i am crawling websites which are movies related and i want to search for any specific movie(as a key word) from the crawled data, what are the changes i need to make to the Solr configurations. Do i need to write a separate plugin altogether or i can use existing plugins?What type of indexing i have to add to the solr configurations?
I want to create custom search engine for multiple domain.
How can I use solr with nutch to create a custom search for 500+ domains, while searching each domain should be able to show its own data.
e.g.
example.com exapmle2.com example3.com and so on, When ever user searches on example.com he should get data which belongs to example.com same for example2.com and so on
these website may be blog post, e-commerce site, classified site or hotel reservation site.
any suggestion would be appreciated.
This should be possible right out of the box. When you index to solr using nutch schema it has a field called site that stores the domain. On the search interface(that you will build) when you select a domain (aka site) you just have to pass a filter query like "site:domain" so that the results are restricted to the domain searched.
NOTE: If you want to restrict crawls to the injected domains only make sure you set the external links property in nutch to false.
Hope that answers your question.