Connecting to SolrCloud with SolrJ and Zookeeper with ACL - solr

I'm trying to connect an application using SolrJ to a SolrCloud cluster via ZooKeeper which is secured using ACLs.
No problem you say? Right, we can just use the specific Java Properties (e.g. -DzkACLProvider etc.) and be done with it. Unfortunately we do not want to set these options with system properties because they would also contain the password. Our preferred way would be to provide these flags via environment variables. Looking into the SolrJ code this looks impossible without implementing some wrappers/extensions of SolrJ classes.
I've come so far that I think that maybe we're just doing it wrong, especially because I don't find any questions regarding this. Is this even the right way to do this? Should we even connect via Zookeeper? Or would you normally just connect to a loadbalanced SolrCloud HTTP endpoint and not connect to Zookeeper at all?
I'm thankful for any input!

Related

Change "Solr Cluster" in Lucidworks Fusion 4

I am running Fusion 4.2.4 with external Zookeeper (3.5.6) and Solr (7.7.2). I have been running a local set of servers and am trying to move to AWS instances. All of the configuration from my local Zookeepers has been duplicated to the AWS instances so they should be functionally equivalent.
I am to the point where I want to shut down the old (local) Zookeeper instances and just use the ones running in AWS. I have changed the configuration for Solr and Fusion (fusion.properties) so that they only use the AWS instances.
The problem I have is that Fusion's solr cluster (System->Solr Clusters) associated with all of my collections is still set to the old Zookeepers :9983,:9983,:9983 so if I turn off all of the old instances of Zookeeper my queries through Fusion's Query API no longer work. When I try to change the "Connect String" for that cluster it fails because the cluster is currently in use by collections. I am able to create a new cluster but there is no way that I can see to associate the new cluster with any of my collections. In a test environment set up similar to production, I have changed the searchClusterId for a specific collection using Fusion's Collections API however after doing so the queries still fail when I turn off all of the "old" Zookeeper instances. It seems like this is the way to go so I'm surprised that it doesn't seem to work.
So far, Lucidworks's support has not been able to provide a solution - I am open to suggestions.
This is what I came up with to solve this problem.
I created a test environment with an AWS Fusion UI/API/etc., local Solr, AWS Solr, local ZK, and AWS ZK.
1. Configure Fusion and Solr to only have the AWS ZK configured
2. Configure the two ZKs to be an ensemble
3. Create a new Solr Cluster in Fusion containing only the AWS ZK
4. For each collection in Solr
a. GET the json from <fusion_url>:8764/api/collections/<collection>
b. Edit the json to change “searchClusterId” to the new cluster defined in Fusion
c. PUT the new json to <fusion_url>:8764/api/collections/<collection>
After doing all of this, I was able to change the “default” Solr cluster in the Fusion Admin UI to confirm that no collections were using it (I wasn’t sure if anything would use the ‘default’ cluster so I thought it would be wise to not take the chance).
I was able to then stop the local ZK, put the AWS ZK in standalone mode, and restart Fusion. Everything seems to have started without issues.
I am not sure that this is the best way to do it, but it solved the problem as far as I could determine.

Need some information about Crawl Anywhere and solr

I gone through the crawl anywhere documentation but i am very much confuse about its installation steps.
What i understood is Apache is optional. But do need independent tomcat instance for crawl? Because what i saw in folder structure, there is tomcat folder already present and war file is also there?
Also do we need independent instance of Apache solr also ?
If we want to add postgresql database to crawl, how we can do that?
Please provide some link also so that I can go through it and clarify any doubt I have in my mind.
Apache is needed to use admin interface. Tomcat is needed for some interactive features. You can crawl without both of them.
No.
MySQL and MongoDB are supported. The code is open source, so you can add postgresql support.
Try Google Groups for other questions

Which are the necessary files to execute Solr in a remote host and how to configure the security (permissions, etc)?

I have a database (mysql) at localhost. I use Solr to index and make queries, it works fine.
Now I want to put the Solr index on a remote host. I know that I must have the permission to run java and ssh access. But I must admit that, even though I can make Solr work, I don't understand very well each file that are part of Solr.
So what I want to know is which are the files estrictly necessary to make Solr index work to make queries? data files? yes, but what else? And how should I configure the permissions? read only? execute? And... I guess that there are another security items I must pay attention.
Usually you have to install at least a web-container (e.g. Tomcat or Jetty) and deploy the solr-[version].war file onto it.
For security reasons you can restrict the access to the Server via web-interface in the configuration of this web-container.
Beneath that solr needs a home-directory in which Solr stores its index and configuration. I think this must have rw permissions since Solr changes the index on import and maintains index-usage-information.

Solr : replication options

I've got a SOLR instance running behind a firewall. I'm about to put up another instance which will not be firewalled. Howevever, SOLR appears to only support pull replication and not push replication.
What are my options with regard to maintaining the same level of security? I'd rather not open too many ports in the firewall. Would HTTP over a SSH tunnel be the best option? Would it also be possible to just replicate the index files using plain old rsync (not using any SOLR specific features) or would this break something?
Would it also be possible to just replicate the index files using plain old rsync
Solr actually supports this kind of distribution with its snappuller mechanism, documented here: http://wiki.apache.org/solr/CollectionDistribution
I would open a port and specify the IP address of the slave, and just use ordinary HTTP-based replication; that would be quite secure, I think, and easier to maintain probably. I know it's not exactly where you were angling, but it's what I'd recommend.
I'm answering my own question as the solution i went for is different than what the two other answers suggested. I ended up using a SSH tunnel for HTTP traffic. Thus, i used SSH to redirect all traffic to port 8080 on the HostA to port 8080 on hostB through a SSH tunnel.
The solution appears to be working fine. I'm using a script which validates the tunnel every 5 minutes or so.
You could use HTTP basic authentication (see https://wiki.apache.org/solr/SolrReplication#Slave) but since the password will be passed in plain text, an SSH tunnel or secure VPN would also be required in order to deter more determined attackers.
I'll be going for a VPN solution to start with and consider an SSH tunnel before moving to production if we feel we are unable to place sufficient trust in our internal networks.

Running Solr in read-only mode

I think I'm missing something obvious here. I have to imagine a lot of people open up their Solr servers to other developers and don't want them to be able to modify the index.
Is there something in solrconfig.xml that can be set to effectively make the index read-only?
Update for clarification:
My goal is to use Solr with an existing Lucene index managed by another application. This works just fine, but I want to be sure Solr never tries to write to this index.
Exposing a Solr instance to the public internet is a bad idea. Even though you can strip some components to make it read-only, it just wasn't designed with security in mind, it's meant to be used as an internal service, just like you wouldn't expose a RDBMS.
From the Solr Security wiki page:
First and foremost, Solr does not
concern itself with security either at
the document level or the
communication level. It is strongly
recommended that the application
server containing Solr be firewalled
such the only clients with access to
Solr are your own. A default/example
installation of Solr allows any client
with access to it to add, update, and
delete documents (and of course
search/read too), including access to
the Solr configuration and schema
files and the administrative user
interface.
Even ajax-solr, a Solr client for javascript meant to run in a browser, recommends talking to Solr through a proxy.
Take for example guardian.co.uk: it's well-known that they use Solr for searching, but they built an API to let others access their content. This way they can define and control exactly what and how they want people to search for things.
Otherwise, any script kiddie can write a trivial loop to DoS your Solr instance and therefore bring down your site.
You can probably just remove the line that defines your solr.XmlUpdateRequestHandler in solrconfig.xml.
Replication is a nice way to setup read-only while being able to do indexation. Just setup a master with restricted access and a slave that is read-only (by removing your XmlUpdateRequestHandler from the config). The slave will be replicated from the master but won't accept any indexation directly.
UPDATE
I just read that in Solr 1.4, you can disable component. I just tried it on the /update requestHandler and I was not able to index anymore.

Resources