Apache Solr: Can apache solr be used as a third part system for indexing and searching for documents from different websites? - solr

I am working on implementing a research web application or portal that integrates different research portal or website using an open source platform called search kit. The web application will act as a central point of access to research publications on different research portals. To do this, I also need to implement a third party system that does the following:
Searches for documents based on user query on the other different research portals and presents or displays the results to the users on my web application.
Index the documents
Should be used by system administrators to configure the web application. Whereby system administrators can add,remove or modify the URL of the website Solr is pulling documents from
Displays the results to the user in one standard format.
My question is, can apache solr be used to implement the third party system? if not, what open source platform or way would you recommend I used to implement the third party system?

In general, Solr seems like a good fit here, but you might need some custom code (apart from configuration) here and there. To go through the points:
Querying is one of the main features of Solr, so this is definitely possible.
Indexing is handled by Solr.
There was a component for Solr called "Data Import Handler" that supported indexing from URLs (see the docs). However, this was removed from the main Solr distribution, and was moved to a separate package. This package doesn't seem to be actively maintained though, so you will probably run into some problems if you decide to use it. The alternative is to develop your document-pulling code yourself.
Solr can display the results in multiple formats, but it still might not support the exact format you would like it to be. In this case, you need to build your transformation based on the result from Solr.

Related

Choose Lucene or Solr

We need to integrate a search engine in our plataform Catalog management software in Share point. The information is stored in multiple databases and a storage of files ( doc , ppt , pdf .....). Our dev platform is Asp.Net and we have done some pre-liminary work on Lucene, found it to be good. However, we just came to know of Solr.
We need to continue using lucene, but we need to defend her the solr.
Please any help is accepted.
And sorry for my english.
Lucene is a full-text search library used to provide search functionalities to an application. It can't be used as an application by itself. Solr is a complete search engine built around Lucene providing its search functionalities and others. Solr is a web application that can be used by itself without any development around it.
If you need a search engine to be called by your application I recommend you to use Solr.

How to integrate Solr with Web Application

After reading many Solr books and article all over on the net, now I have an idea of the power of this server.
But... how to integrate it in a real application? For example: a web site written in PHP, etc.
Right now, I understand that Solr produces XML, JSON etc results... so to integrate this in a web application, the "simple" work is to convert this information for render in a page or there are other technique to avoid this?
I'm my case, I have to develop a search engine to scan many documents and find result.
My idea was:
Use Solr to build an index and search documents
Use a web application to show the result
Looking on the net I haven't find anything that explains how to integrate Solr in a real application, all the reading are about "How to use Solr... with Solr..." Anything about a real integration.
Does someone have some useful resource how to integrate Solr in a real application, with some clean examples?
Edit: It looks like Apache maintains their own list of recommended
client APIs, and their recommended tool for PHP is Google's
library (though they refer to it as SolPHP). Given this, I imagine that this is the best place
to start.
A Solr library for the programming language you're using could save you some of the trouble in implementing the integration. For instance, if your site is written in PHP, you could try Google's Solr library for PHP.
I have done most of my Solr work in Java, so I have used SolrJ quite a bit. This is a well supported tool because it comes from Apache in parallel with the Solr product itself.
If you are doing work in any other languages, you are likely to find libraries available for them. The amount of time they save you may vary according to the quality of the library itself.
When I was using Solr in my project, only my application server (that is Tomcat) was communicating with Solr server. I wrote a class, which executes GET requests to Solr server based on input provided by end user. When Solr returns XML/JSON back to an application server you may parse it and process as every other bussiness data (render an *.html). So, summing up, Web Browser never communicates directly with Solr, all goes through an application server:
WebBrowser -> GET to application server -> GET to Solr server
show *.html <- parse XML/JSON, render *.html <- return XML/JSON

Is it possible to modify a schema using the rest API in Apache Solr?

I think the title is self-explanatory.
I don't see anything on the Apache Solr wiki that suggests you can maintain the schema of an Apache Solr instance using the ReST API, but maybe (hopefully) you know something I don't.
I just found a section on the Solr wiki where they describe this exact feature for release 4.4 (which is not released yet).
It does have some prerequisite configuration on the Solr instance, but it does allow you to add fields to the schema. Based on that information, I can't see why they won't eventually extend the functionality to allow you to delete as well. I guess we will have to wait and see.
Here is the link to that section: http://wiki.apache.org/solr/SchemaRESTAPI#Adding_fields_to_a_schema. It also references this JIRA issue: "In preparation for dynamic schema modification via REST API, add a "managed" schema facility".

updating Solr from Lucene Index

I'm currently working on a web archiving project. Basically, what we try to do is archive a collection of websites (using heritrix crawler) and provide access to the archived contents through a web interface.
We also offer full-text search throughout the archives. Currently, the index is generated using nutchwax (a customised version of apache Nutch, tailored to index .warc files, as generated by heritrix). Nutchwax dumps out a Lucene index and for using it in Solr, all that has to be done is to generate a correct schema.
This is all done and its running like it should, however the archive is not static and there are new .warc files generated periodically.
What I can do now, is to generate a new index, merge it with the existing one and import it back into Solr. However, to do that Solr has to be restarted.
It would be great if the index could be updated "on the fly" as this is usually the case (when updating the index via http requests)
Does anyone have an idea, how this can be done? My first shot at this was generating .xml files out of the Lucene index file and posting them to Solr. Is this worth a try or are there more elegant solutions?
You could probably leverage the use of multiple cores to accomplish what you need. See the Solr Wiki - CoreAdmin for more details. I think you could leverage the MergeIndexes capability or the ability to Swap cores for a better experience in your scenario.

Running Solr in read-only mode

I think I'm missing something obvious here. I have to imagine a lot of people open up their Solr servers to other developers and don't want them to be able to modify the index.
Is there something in solrconfig.xml that can be set to effectively make the index read-only?
Update for clarification:
My goal is to use Solr with an existing Lucene index managed by another application. This works just fine, but I want to be sure Solr never tries to write to this index.
Exposing a Solr instance to the public internet is a bad idea. Even though you can strip some components to make it read-only, it just wasn't designed with security in mind, it's meant to be used as an internal service, just like you wouldn't expose a RDBMS.
From the Solr Security wiki page:
First and foremost, Solr does not
concern itself with security either at
the document level or the
communication level. It is strongly
recommended that the application
server containing Solr be firewalled
such the only clients with access to
Solr are your own. A default/example
installation of Solr allows any client
with access to it to add, update, and
delete documents (and of course
search/read too), including access to
the Solr configuration and schema
files and the administrative user
interface.
Even ajax-solr, a Solr client for javascript meant to run in a browser, recommends talking to Solr through a proxy.
Take for example guardian.co.uk: it's well-known that they use Solr for searching, but they built an API to let others access their content. This way they can define and control exactly what and how they want people to search for things.
Otherwise, any script kiddie can write a trivial loop to DoS your Solr instance and therefore bring down your site.
You can probably just remove the line that defines your solr.XmlUpdateRequestHandler in solrconfig.xml.
Replication is a nice way to setup read-only while being able to do indexation. Just setup a master with restricted access and a slave that is read-only (by removing your XmlUpdateRequestHandler from the config). The slave will be replicated from the master but won't accept any indexation directly.
UPDATE
I just read that in Solr 1.4, you can disable component. I just tried it on the /update requestHandler and I was not able to index anymore.

Resources