How to implement solr and solarium in cakePHP - cakephp

I have to integrate Solr search in one of my CakePHP project. I have installed Solr and Solarium using composer but could not find how to start searching using Solr. Is there any example code in CakePHP?

First thing you need to figure out is how to expose the Solarium API in your CakePHP application. Typically this means saving some third-party PHP files into the Vendor directory of your application (take peek here for more information).
Once you've done this, you have two options:
Interact with the Solarium API directly in your controller's actions.
Implement a Solarium datasource so you can use CakePHP's model constructs.
Option 1
This option is less consistent with how the developers of CakePHP would like you to do MVC and you will have to generate a fair bit of code each time you want to put something in Solr or query it (e.g. connect to the Solr database). If you have minimal interaction with your Solr database, then I would recommend going down this route. Perhaps you could wrap up your access in separate helper class or function so instead of this:
public function void myControllerAction() {
// create a client instance
$client = new Solarium\Client($config);
// get a select query instance
$query = $client->createQuery($client::QUERY_SELECT);
// this executes the query and returns the result
$resultset = $client->execute($query);
// expose the result set to your view
$this->set('records', $resultset);
}
you could have this:
public function void myControllerAction() {
$resultset = solarium_get_records();
// expose the result set to your view
$this->set('records', $resultset);
}
Option 2
This option is a bit more involved and requires you to write a Solarium datasource just like the developers have written for MySql and Postgres. This does require you to thoroughly understand the inner workings of CakePHP's model engine but by taking a look at how the other datasources work, it shouldn't be rocket science. Rest assured that if you did this and made your code open-source, other developers will love to use it in their own CakePHP applications!
The benefit of this approach is that you will successfully abstract your application from the specific database implementation. So if you decided you didn't fancy using Solr and preferred a different search engine, you could migrate your data, write a new datasource (if one didn't exist already) and you're all set.
This probably doesn't exactly answer your question but instead steer you in the right direction and highlight some aspects you should consider.

I have integrated Solr with Option 1. However option 2 is doable but due to some time restriction I have to choose option 1. We can directly include Solarium vendor and include its class in our controller wherever required and use solr's add/get queries.
There are basically 3 major steps: 1: Install Solr. 2: Install Solarium using composer 3: User your scripts within controller or component files to get results.
You can get complete reference and example codebase from here:
http://findnerd.com/list/view/Solr-integration-in-CakePHP-with-solarium-client/1946/
Thanks.

I've integrated Solr using Sam's Option 2 (as DataSource)
https://github.com/Highstrike/cakephp-solr-datasource
You can also find there instructions on how to use it with examples

Related

Apache Solr: Can apache solr be used as a third part system for indexing and searching for documents from different websites?

I am working on implementing a research web application or portal that integrates different research portal or website using an open source platform called search kit. The web application will act as a central point of access to research publications on different research portals. To do this, I also need to implement a third party system that does the following:
Searches for documents based on user query on the other different research portals and presents or displays the results to the users on my web application.
Index the documents
Should be used by system administrators to configure the web application. Whereby system administrators can add,remove or modify the URL of the website Solr is pulling documents from
Displays the results to the user in one standard format.
My question is, can apache solr be used to implement the third party system? if not, what open source platform or way would you recommend I used to implement the third party system?
In general, Solr seems like a good fit here, but you might need some custom code (apart from configuration) here and there. To go through the points:
Querying is one of the main features of Solr, so this is definitely possible.
Indexing is handled by Solr.
There was a component for Solr called "Data Import Handler" that supported indexing from URLs (see the docs). However, this was removed from the main Solr distribution, and was moved to a separate package. This package doesn't seem to be actively maintained though, so you will probably run into some problems if you decide to use it. The alternative is to develop your document-pulling code yourself.
Solr can display the results in multiple formats, but it still might not support the exact format you would like it to be. In this case, you need to build your transformation based on the result from Solr.

Solr luceneMatchVersion syntax

I have Solr 4.10 and I have collection on it with solorconfig.xml has the value for <luceneMatchVersion> as follows:
<luceneMatchVersion>4.7</luceneMatchVersion>
Is this correct? I saw other examples that has values such as LUCENE_35 What I need to know also, how could I express LUCENE_xx from my current Solr version?
You should use:
<luceneMatchVersion>4.10.4</luceneMatchVersion>
I recommend you to check your current solr version, in my case was 4.10.4.
if you are going to reindex, then both numbers should match. The only reason you might want to have them different, is if you had and index created with say Lucene 4.7, then you would have
<luceneMatchVersion>4.7</luceneMatchVersion>
Then, you upgrade lucene to 4.10.
Now, if among the changes in between 4.7 and 4.10 there are things that work differently regarding analysis (you get the same sentence analysed in both versions and get different output as a result), then, you might want to keep the version number at 4.7, otherwise some queries that contain affected terms might not work (as they were analysed at index time in a different way than at query time). You have to asses how critical that issue might be.
That is why the recommendation is to upgrade, change the setting to the current number, and reindex. This way you are sure to avoid any issue.
If anyone is using Drupal, the Search API Solr (search_api_solr) module has config templates by version in /sites/all/modules/search_api_solr/solr-conf/.
The template README.md states the following:
The solr-conf-templates directory contains config-set templates for
different Solr versions.
These are templates and are not to be used as config-sets!
To get a functional config-set you need to generate it via the Drupal
admin UI or with drush solr-gsc. See README.md in the module
directory for details.
The module's README.md lists these instructions:
Make sure you have Apache Solr started and accessible (i.e. via port 8983). You can start it without having a core configured at
this stage.
Visit Drupal configuration (/admin/config/search/search-api) and create a new Search API Server according to the search_api
documentation using "Solr" as Backend and the connector that
matches your setup. Input the correct core name (which you will
create at step 4, below).
Download the config.zip from the server's details page or by using drush solr-gsc with proper options, for example for a server named
"my_solr_server": drush solr-gsc my_solr_server config.zip 8.4.
Copy the config.zip to the Solr server and extract.
I generated a config file for 8.x, and it uses this:
<luceneMatchVersion>${solr.luceneMatchVersion:LUCENE_80}</luceneMatchVersion>

CakePHP 3 - Where is custom datasource?

I have a large number of CakePHP 2 web applications, many of them use remote data over the custom DataSource. I'm reading the documentation of CakePHP 3, but I can not find instructions for creating custom datesource.
My next project also requires a custom DataSource (ArangoDB), so I planned to build it with the new version of CakePHP.
Please do specify where and how to build a DataSource in CakePHP 3.
Thank You.
Just look at the code of any existing database driver and see how it is built? There is nothing in the book yet about how to create your own datasource.
ArangoDB seems to be yet another NoSQL DB, so take a look at how this Elastic Search datasource is done. By a quick look I think you can use it as a base for your implemention, they seem to be similar.
I was facing the same problem so I decided to write my own models in CakePHP 3.x. If you want you can use it from here. hope you like it.

How to integrate Solr with Web Application

After reading many Solr books and article all over on the net, now I have an idea of the power of this server.
But... how to integrate it in a real application? For example: a web site written in PHP, etc.
Right now, I understand that Solr produces XML, JSON etc results... so to integrate this in a web application, the "simple" work is to convert this information for render in a page or there are other technique to avoid this?
I'm my case, I have to develop a search engine to scan many documents and find result.
My idea was:
Use Solr to build an index and search documents
Use a web application to show the result
Looking on the net I haven't find anything that explains how to integrate Solr in a real application, all the reading are about "How to use Solr... with Solr..." Anything about a real integration.
Does someone have some useful resource how to integrate Solr in a real application, with some clean examples?
Edit: It looks like Apache maintains their own list of recommended
client APIs, and their recommended tool for PHP is Google's
library (though they refer to it as SolPHP). Given this, I imagine that this is the best place
to start.
A Solr library for the programming language you're using could save you some of the trouble in implementing the integration. For instance, if your site is written in PHP, you could try Google's Solr library for PHP.
I have done most of my Solr work in Java, so I have used SolrJ quite a bit. This is a well supported tool because it comes from Apache in parallel with the Solr product itself.
If you are doing work in any other languages, you are likely to find libraries available for them. The amount of time they save you may vary according to the quality of the library itself.
When I was using Solr in my project, only my application server (that is Tomcat) was communicating with Solr server. I wrote a class, which executes GET requests to Solr server based on input provided by end user. When Solr returns XML/JSON back to an application server you may parse it and process as every other bussiness data (render an *.html). So, summing up, Web Browser never communicates directly with Solr, all goes through an application server:
WebBrowser -> GET to application server -> GET to Solr server
show *.html <- parse XML/JSON, render *.html <- return XML/JSON

Running Solr in read-only mode

I think I'm missing something obvious here. I have to imagine a lot of people open up their Solr servers to other developers and don't want them to be able to modify the index.
Is there something in solrconfig.xml that can be set to effectively make the index read-only?
Update for clarification:
My goal is to use Solr with an existing Lucene index managed by another application. This works just fine, but I want to be sure Solr never tries to write to this index.
Exposing a Solr instance to the public internet is a bad idea. Even though you can strip some components to make it read-only, it just wasn't designed with security in mind, it's meant to be used as an internal service, just like you wouldn't expose a RDBMS.
From the Solr Security wiki page:
First and foremost, Solr does not
concern itself with security either at
the document level or the
communication level. It is strongly
recommended that the application
server containing Solr be firewalled
such the only clients with access to
Solr are your own. A default/example
installation of Solr allows any client
with access to it to add, update, and
delete documents (and of course
search/read too), including access to
the Solr configuration and schema
files and the administrative user
interface.
Even ajax-solr, a Solr client for javascript meant to run in a browser, recommends talking to Solr through a proxy.
Take for example guardian.co.uk: it's well-known that they use Solr for searching, but they built an API to let others access their content. This way they can define and control exactly what and how they want people to search for things.
Otherwise, any script kiddie can write a trivial loop to DoS your Solr instance and therefore bring down your site.
You can probably just remove the line that defines your solr.XmlUpdateRequestHandler in solrconfig.xml.
Replication is a nice way to setup read-only while being able to do indexation. Just setup a master with restricted access and a slave that is read-only (by removing your XmlUpdateRequestHandler from the config). The slave will be replicated from the master but won't accept any indexation directly.
UPDATE
I just read that in Solr 1.4, you can disable component. I just tried it on the /update requestHandler and I was not able to index anymore.

Resources