nutch and solr for multiple Domain - solr

I want to create custom search engine for multiple domain.
How can I use solr with nutch to create a custom search for 500+ domains, while searching each domain should be able to show its own data.
e.g.
example.com exapmle2.com example3.com and so on, When ever user searches on example.com he should get data which belongs to example.com same for example2.com and so on
these website may be blog post, e-commerce site, classified site or hotel reservation site.
any suggestion would be appreciated.

This should be possible right out of the box. When you index to solr using nutch schema it has a field called site that stores the domain. On the search interface(that you will build) when you select a domain (aka site) you just have to pass a filter query like "site:domain" so that the results are restricted to the domain searched.
NOTE: If you want to restrict crawls to the injected domains only make sure you set the external links property in nutch to false.
Hope that answers your question.

Related

Retrieve all domains information Google Apps

I would like to retrieve all domain information that are added to Domains section in Admin panel.
In short I require an API that returns the info. displayed here : https://admin.google.com/AdminHome?fral=1#Domains:
I've already gone thru Admin SDK's Directory API as well as Admin Settings API but I don't find anything there.
I have been searching for long but I don't fine anything useful. There are 2 questions in stackoverflow as well but that doesn't answer.
I don't believe this information is available today via the APIs in the Admin SDK.

How to search for key words using Solr from crawled web pages by nutch?

I have an application which crawls over the websites using Apache Nutch 2.1 and persisting data to the MySQL. I have to integrate Nutch and Solr which is not a problem as enough documentation is available on the internet.
After storing content from webpages, i want to add a search functionality based on Solr. I need to search for key words in the webpages. For example, if i am crawling websites which are movies related and i want to search for any specific movie(as a key word) from the crawled data, what are the changes i need to make to the Solr configurations. Do i need to write a separate plugin altogether or i can use existing plugins?What type of indexing i have to add to the solr configurations?

security threats in exposing solr query to customers in live website

I am going to build a website and I am planning to use solr for search integration.It is a ecommerce web site. I wanted to know if there is any problem in exposing the solr query format to the users of this website?
You want to have your search app query Solr, or use a proxy, so the URL is not exposed to the web user. I'm not so concerned about query syntax and parameters being visible, as long as the web user can't send them via query. You for certain want to make sure only the web app can reach the Solr server, however.
Even if you lock down the RequestHandlers so that only searches are available through the web, there still may be things in your index that you don't want to expose to customers.
For example, if two items score the same in a search, you'd like to boost the one with the higher margin. In order to do that you need to have the margin in your index, and that means it's available for all of your customers and competitors to see.
The JSON response writer is very handy for writing lightweight search apps. At the very least you'll want to implement a filtering proxy between the browser and Solr.
You absolutely do not want to expose Solr directly to users. Nor do you want to pass the format through without evaluation.
One of the thing that Solr supports is delete by query. There are other possibilities as well. You have to sanitize the content of queries.

submitting sitemap using google-appengine custom domain

I submitted a sitemap using my custom domain name instead of my appspot domain. It has been a week and I have not been indexed yet. I have over 2000 pages, so I am not sure if this is normal or not. I wanted to make sure that I was not supposed to send my appspot domain instead Does anybody know? Thank you,
You can use any domain you want - Google Search is distinct from Google App Engine. All you can do is wait to be indexed.

App Engine - Subdomain

I have deployed an application on Google App Engine and I want to link a Subdomian to that application.
I currently have a domain that is linked to a "live" site. from Google documentation I understand that i need to set up my domain with Google Apps:
To serve your app on a custom domain, the domain must be set up with Google Apps
(Source)
What exactly that mean?
I've looked in Google documentation and could get a clear idea...
Does that will effected my "live" site in some way?
just to clarify, www.mydomain.com - points a site that i own and i want sub.mydomain.com to point to my Google application.
You need to make a CNAME to forward to your app address.
Let's say your app address is https://yourapp.appspot.com, and you want sub.mydomain.com to forward to it, just do like below:
Please read THIS first, follow the steps until step 5. You'll need to type your mydomain.com in step 3, and type sub in step 5. After these, you'll some steps on how to Chang CNAME record, just follow:
set your host name to sub
Type: CNAME
IP address/host name: ghs.google.com.
Priority status: (whatever just make it's the number)
OK, and you'll visit your app by http://sub.mydomain.com, different hosting providers have different time to set it valid. :)
BTW, it'll not effect your "live" site in any way. As your main site use mydomain.com, and you just need sub.mydomain.com. What GAE said is that, if you want to set mydomain.com to your app, you need to set A type instead of CNAME type in your host. This domain hosting method includes more steps, you'll see GAE's doc that you found, and so it will effect your live site.
This means you have to register your domain with Google Apps here: https://www.google.com/a/cpanel/domain/new
You don't have to have your main website hosted on Google. Just how you arrange things is determined by how you configure your DNS which you will retain control of. Same for email you can have it delivered to Google Apps or not, depending on your DNS MX records.
You need to validate your ownership using webmasters by adding a txt record in you dns records, after that it will appears in the list of domains under App Engine > Settings > Custom domains.

Resources