Solr - How do you get cores on different servers to have the same name when creating via HTTP - solr

I have run the following via HTTP:
http://solr-uat.cambridgeassessment.org.uk/solr/admin/collections?action=create&name=ocr_education_and_learning_web8&numShards=1&maxShardsPerNode=8&replicationFactor=3&collection.configName=ocr_education_and_learning
and it created the collection but the cores on each server (there are 3 servers) have had the name appended (e.g. ocr_education_and_learning_web8_shard1_replica1). I am integrating with SI4T and it seems to use the core name rather than the collection name so the core names need to be the same across servers but I can't find how to do this.
Can anyone advise how best to do this?

As far as I know you can't do this. Core names must be unique. This naming scheme is internal to SolrCloud and is used to distinguish different indexes ('cores') from each other (which each make up part of the overall collection).
See this nice answer for more information

Related

Apache Solr: Can apache solr be used as a third part system for indexing and searching for documents from different websites?

I am working on implementing a research web application or portal that integrates different research portal or website using an open source platform called search kit. The web application will act as a central point of access to research publications on different research portals. To do this, I also need to implement a third party system that does the following:
Searches for documents based on user query on the other different research portals and presents or displays the results to the users on my web application.
Index the documents
Should be used by system administrators to configure the web application. Whereby system administrators can add,remove or modify the URL of the website Solr is pulling documents from
Displays the results to the user in one standard format.
My question is, can apache solr be used to implement the third party system? if not, what open source platform or way would you recommend I used to implement the third party system?
In general, Solr seems like a good fit here, but you might need some custom code (apart from configuration) here and there. To go through the points:
Querying is one of the main features of Solr, so this is definitely possible.
Indexing is handled by Solr.
There was a component for Solr called "Data Import Handler" that supported indexing from URLs (see the docs). However, this was removed from the main Solr distribution, and was moved to a separate package. This package doesn't seem to be actively maintained though, so you will probably run into some problems if you decide to use it. The alternative is to develop your document-pulling code yourself.
Solr can display the results in multiple formats, but it still might not support the exact format you would like it to be. In this case, you need to build your transformation based on the result from Solr.

How to Insert a document to a specific shard in Apache Solr Cloud Mode

Is there a way we can add documents into a specific shard?
For example, documents type A will always get inserted into shard1 and document type B always go to shard2.
I have tried using custom router but it does not guaranty that different prefix will route to different shard.
PS. I am on Solr 5 using cloud mode.
A caveat: I'm using SolrNet to access SolrCloud, and it doesn't integrate with ZooKeeper yet. For Java clients, this might be far easier.
Despite what I read here and here with regard to the CompositeId Router, I could never get it to work. What #jay helped me figure out is a way to use "implicit" routing to achieve this. If you create your collection like this (leave out the numShards parameter):
http://localhost:8983/solr/admin/collections?action=CREATE&name=myCol&maxShardsPerNode=2&router.name=implicit&shards=shard1,shard2&router.field=shard
then add a field to your schema.xml named "shard" (matching the router.field parameter), you can index to a specific shard simply by adding the shard field to the document being indexed and specifying the shard name. At query time, you can specify shards to search -- more here (I was able to simply specify the shard name w/o a specific address).
I haven't tested this in production yet, but have verified using multiple VirtualBox instances, with ZooKeeper, HAProxy, and several Solr nodes, and it's doing exactly what I expected. Corrections and comments welcome.

Solr cloud distributed search on collections

Currently I have a zookeeper instance controlling replication on 3 physical servers. It is the solr integrated zookeeper. 1 shard, 1 collection.
I have a new requirement in which I will need a new static solr instance (1 new collection, no replication). Same schema as previous collection. A copy of this instance will also be placed on the 3 physical servers mentioned above. A caveat is that I need to perform distributed searches across the 2 collections and have the results blended.
Thanks to javacreed I now know that sharding is not in my solution. Previous questions answers here and here.
In my current setup I run the following command on the server running zookeeper -
java -Dbootstrap_confdir=solr/myApp/conf -Dcollection.configName=myConfig -DzkRun -DnumShards=1 -jar start.jar
Am I correct in saying that this will not change and I will now also manually start the non replicated collection. I really only need to change my search queries to include the 'collection' parameter? Something like -
http://localhost:8983/solr/collection1/select?collection=collection1,collection2
This example is from Solr documentation. I am slightly confused as to whether it should be ...solr/collection1/select?... or ...solr/collection2/select?... or if it even matters?
Thanks
Thanks for your kind word stewart.You can search it directly on solr as
http://localhost:8983/solr/select?collection=collection1,collection2
There is no need to mention any collection path since you are defining them in the collection parameters.

Apache Solr setup for two diffrent projects

I just started using apache solr with it's data import functionality on my project
by following steps in http://tudip.blogspot.in/2013/02/install-apache-solr-on-ubuntu.html
but now I have to make two different instances of my project on same server with different databases but with same configuration of solr for both projects. How can I do that?
Please help me if anyone can?
Probably the closest you can get is having two different Solr cores. They will run under the same server but will have different configuration (which you can copy paste).
When you say "different databases", do you mean you want to copy from several databases into one join collection/core? If so, you just define multiple entities and possibly multiple datasources in your DataImportHandler config and run either all of them or individually. See the other question for some tips.
If, on the other hand, you mean different Solr cores/collections, then you just want to run Solr with multiple cores. It is very easy, you just need solr.xml file above your collection level as described on the Solr wiki. Once you get the basics, you may want to look at sharing instance directories and having separate data directories to avoid changing the same config twice (instanceDir vs. dataDir settings on each core).

Running Solr in read-only mode

I think I'm missing something obvious here. I have to imagine a lot of people open up their Solr servers to other developers and don't want them to be able to modify the index.
Is there something in solrconfig.xml that can be set to effectively make the index read-only?
Update for clarification:
My goal is to use Solr with an existing Lucene index managed by another application. This works just fine, but I want to be sure Solr never tries to write to this index.
Exposing a Solr instance to the public internet is a bad idea. Even though you can strip some components to make it read-only, it just wasn't designed with security in mind, it's meant to be used as an internal service, just like you wouldn't expose a RDBMS.
From the Solr Security wiki page:
First and foremost, Solr does not
concern itself with security either at
the document level or the
communication level. It is strongly
recommended that the application
server containing Solr be firewalled
such the only clients with access to
Solr are your own. A default/example
installation of Solr allows any client
with access to it to add, update, and
delete documents (and of course
search/read too), including access to
the Solr configuration and schema
files and the administrative user
interface.
Even ajax-solr, a Solr client for javascript meant to run in a browser, recommends talking to Solr through a proxy.
Take for example guardian.co.uk: it's well-known that they use Solr for searching, but they built an API to let others access their content. This way they can define and control exactly what and how they want people to search for things.
Otherwise, any script kiddie can write a trivial loop to DoS your Solr instance and therefore bring down your site.
You can probably just remove the line that defines your solr.XmlUpdateRequestHandler in solrconfig.xml.
Replication is a nice way to setup read-only while being able to do indexation. Just setup a master with restricted access and a slave that is read-only (by removing your XmlUpdateRequestHandler from the config). The slave will be replicated from the master but won't accept any indexation directly.
UPDATE
I just read that in Solr 1.4, you can disable component. I just tried it on the /update requestHandler and I was not able to index anymore.

Resources