solr healthcheck for >0 documents - solr

The default configuration for solr of /admin/ping provided for load balancer health check integrates well with the Amazon ELB load balancer health checks.
However since we're using master-slave replication when we provision a new node, Solr starts up, and replication happens, but in the meantime /admin/ping return success before the index has replicated across from master and there are documents.
We'd like nodes to only be brought live once they have done the first replication and have documents. I don't see any way of doing this with /admin/ping PingRequestHandler - it always return success if the search succeeds, even with zero results.
Nor is there anyway of matching/not matching expected text in the response with the ELB health check configuration.
How can we achieve this?

To expand on the nature of the problem here, the PingRequestHandler will always return a success unless....
Its query results in an exception being thrown.
It is configured to use a healthcheck file, and that file is not found.
Thus my suggestion is that you configure the PingRequestHandler handler to use a healthcheck file. You can then use a cron job on your Solr system whose job is to check for the existence of documents and create (or remove) the healthcheck file accordingly. If the healthcheck file is not present, the PingRequestHandler will throw a HTTP 503 which should be sufficient for ELB.
The rough algorithm that I'd use...
Every minute, query http://localhost:8983/solr/select?q=*:*
If numDocs > 0 then touch /path/to/solr-enabled
Else rm /path/to/solr-enabled (optional, depending on your strictness)
The healthcheck file can be configured in the <admin> block, and you can use an absolute path, or a filename relative to the directory from which you have started Solr.
<admin>
<defaultQuery>solr</defaultQuery>
<pingQuery>q=*:*</pingQuery>
<healthcheck type="file">/path/to/solr-enabled</healthcheck>
</admin>
Let me know how that works out! I'm tempted to implement something similar for read slaves at Websolr.

I ran into an interesting solution here: https://jobs.zalando.com/tech/blog/zookeeper-less-solr-architecture-aws/?gh_src=4n3gxh1
It's basically a servlet that you could add to the Solr webapp and then check all of the cores to make sure they have documents.
I'm toying with a more sophisticated solution but haven't tested it/made much progress yet: https://gist.github.com/er1c/e261939629d2a279a6d74231ce2969cf
What I like about this approach (in theory) is the ability to check the replication status/success for multiple cores. If anyone finds an actual implementation of this approach please let me know!

Related

Idempotency in a camel application running in Kubernetes cluster

I am using apache camel as integration framework in my microservice. I am deploying it in a Kubernetes cloud as multiple pods. I had written a route for reading file from a directory and write to another. But I am facing an issue as the different pods are picking same file. I need to avoid that. I only want any of the pod to pick the file and process but currently all the pods are picking and processing the file. Can someone help with this. Please suggest some examples available in GitHub or any other.
Thanks in advance.
Camel recently introduced some interesting clustering capabilities - see here.
In your particular case, you could model a route which is taking the leadership when starting the directory polling, preventing thereby other nodes from picking the (same or other) files.
Set it up is very easy and all you need is to prefix singleton
endpoints according to the master component syntax:
master:namespace:delegateUri
This would result in something like this:
from("master:mycluster:file://...")
.routeId("clustered-route")
.log("Clustered file polling !");

SolrCloud: Unable to Create Collection, Locking Issues

I have been trying to implement a SolrCloud, and everything works fine until I try to create a collection with 6 shards. My setup is as follows:
5 virtual servers, all running Ubuntu 14.04, hosted by a single company across different data centers
3 servers running ZooKeeper 3.4.6 for the ensemble
2 servers, each running Solr 5.1.0 server (Jetty)
The Solr instances each have a 2TB, ext4 secondary disk for the indexes, mounted at /solrData/Indexes. I set this value in solrconfig.xml via <dataDir>/solrData/Indexes</dataDir>, and uploaded it to the ZooKeeper ensemble. Note that these secondary disks are neither NAS nor NFS, which I know can cause problems. The solr user owns /solrData.
All the intra-server communication is via private IP, since all are hosted by the same company. I'm using iptables for firewall, and the ports are open and all the servers are communicating successfully. Config upload to ZooKeeper is successful, and I can see via the Solr admin interface that both nodes are available.
The trouble starts when I try to create a collection using the following command:
http://xxx.xxx.xxx.xxx:8983/solr/admin/collections?action=CREATE&name=coll1&maxShardsPerNode=6&router.name=implicit&shards=shard1,shard2,shard3,shard4,shard5,shard6&router.field=shard&async=4444
Via the Solr UI logging, I see that multiple index creation commands are issued simultaneously, like so:
6/25/2015, 7:55:45 AM WARN SolrCore [coll1_shard2_replica1] Solr index directory '/solrData/Indexes/index' doesn't exist. Creating new index...
6/25/2015, 7:55:45 AM WARN SolrCore [coll1_shard1_replica2] Solr index directory '/solrData/Indexes/index' doesn't exist. Creating new index...
Ultimately the task gets reported as complete, but in the log, I have locking errors:
Error creating core [coll1_shard2_replica1]: Lock obtain timed out: SimpleFSLock#/solrData/Indexes/index/write.lock
SolrIndexWriter was not closed prior to finalize(),​ indicates a bug -- POSSIBLE RESOURCE LEAK!!!
Error closing IndexWriter
If I look at the cloud graph, maybe a couple of the shards will have been created, others are closed or recovering, and if I restart Solr, none of the cores can fire up.
Now, I know what you're going to say: follow this SO post and change solrconfig.xml locking settings to this:
<unlockOnStartup>true</unlockOnStartup>
<lockType>simple</lockType>
I did that, and it had no impact whatsoever. Hence the question. I'm about to have to release a single Solr instance into production, which I hate to do. Does anybody know how to fix this?
Based on the log entry you supplied, it looks like Solr may be creating the data (index) directory for EACH shard in the same folder.
Solr index directory '/solrData/Indexes/index' doesn't exist. Creating new index...
This message was shown for two different collections and it references the same location. What I usually do, is change my Solr Home to a different directory, under which all collection "instance" stuff will be created. Then I manually edit the core.properties for each shard to specify the location of the index data.

Which are the necessary files to execute Solr in a remote host and how to configure the security (permissions, etc)?

I have a database (mysql) at localhost. I use Solr to index and make queries, it works fine.
Now I want to put the Solr index on a remote host. I know that I must have the permission to run java and ssh access. But I must admit that, even though I can make Solr work, I don't understand very well each file that are part of Solr.
So what I want to know is which are the files estrictly necessary to make Solr index work to make queries? data files? yes, but what else? And how should I configure the permissions? read only? execute? And... I guess that there are another security items I must pay attention.
Usually you have to install at least a web-container (e.g. Tomcat or Jetty) and deploy the solr-[version].war file onto it.
For security reasons you can restrict the access to the Server via web-interface in the configuration of this web-container.
Beneath that solr needs a home-directory in which Solr stores its index and configuration. I think this must have rw permissions since Solr changes the index on import and maintains index-usage-information.

updating Solr from Lucene Index

I'm currently working on a web archiving project. Basically, what we try to do is archive a collection of websites (using heritrix crawler) and provide access to the archived contents through a web interface.
We also offer full-text search throughout the archives. Currently, the index is generated using nutchwax (a customised version of apache Nutch, tailored to index .warc files, as generated by heritrix). Nutchwax dumps out a Lucene index and for using it in Solr, all that has to be done is to generate a correct schema.
This is all done and its running like it should, however the archive is not static and there are new .warc files generated periodically.
What I can do now, is to generate a new index, merge it with the existing one and import it back into Solr. However, to do that Solr has to be restarted.
It would be great if the index could be updated "on the fly" as this is usually the case (when updating the index via http requests)
Does anyone have an idea, how this can be done? My first shot at this was generating .xml files out of the Lucene index file and posting them to Solr. Is this worth a try or are there more elegant solutions?
You could probably leverage the use of multiple cores to accomplish what you need. See the Solr Wiki - CoreAdmin for more details. I think you could leverage the MergeIndexes capability or the ability to Swap cores for a better experience in your scenario.

Running Solr in read-only mode

I think I'm missing something obvious here. I have to imagine a lot of people open up their Solr servers to other developers and don't want them to be able to modify the index.
Is there something in solrconfig.xml that can be set to effectively make the index read-only?
Update for clarification:
My goal is to use Solr with an existing Lucene index managed by another application. This works just fine, but I want to be sure Solr never tries to write to this index.
Exposing a Solr instance to the public internet is a bad idea. Even though you can strip some components to make it read-only, it just wasn't designed with security in mind, it's meant to be used as an internal service, just like you wouldn't expose a RDBMS.
From the Solr Security wiki page:
First and foremost, Solr does not
concern itself with security either at
the document level or the
communication level. It is strongly
recommended that the application
server containing Solr be firewalled
such the only clients with access to
Solr are your own. A default/example
installation of Solr allows any client
with access to it to add, update, and
delete documents (and of course
search/read too), including access to
the Solr configuration and schema
files and the administrative user
interface.
Even ajax-solr, a Solr client for javascript meant to run in a browser, recommends talking to Solr through a proxy.
Take for example guardian.co.uk: it's well-known that they use Solr for searching, but they built an API to let others access their content. This way they can define and control exactly what and how they want people to search for things.
Otherwise, any script kiddie can write a trivial loop to DoS your Solr instance and therefore bring down your site.
You can probably just remove the line that defines your solr.XmlUpdateRequestHandler in solrconfig.xml.
Replication is a nice way to setup read-only while being able to do indexation. Just setup a master with restricted access and a slave that is read-only (by removing your XmlUpdateRequestHandler from the config). The slave will be replicated from the master but won't accept any indexation directly.
UPDATE
I just read that in Solr 1.4, you can disable component. I just tried it on the /update requestHandler and I was not able to index anymore.

Resources