Extending a solr collection across multiple machines - solr

I am trying to set up a solr collection that extends across multiple servers. If I am correct in understanding things, I am able to set up a collection, which consists of shards. Those shards consist of replicas, which is correspond to cores. Please correct any holes in my understanding of this.
Ok.
So I've got solr set up and am able to create a collection on machine one by doing this.
bin/solr create_collection -c test_collection -shards 2 -replicationFactor 2 -d server/solr/configsets/basic_configs/conf
This appears to do something right, I am able to check the health and see something. I input
bin/solr healthcheck -c test_collection
and I see the shard information.
Now what I want to do, and this is the part I am stuck on, is to take this collection that I have created, and extend it across multiple servers. I'm not sure if I understand how this works correctly, but I think what I want to do is put shard1 on machine1, and shard2 on machine2.
I can't really figure out how to do this based on the documentation, although I am pretty sure this is what SolrCloud is meant to solve. Can someone give me a nudge in the right direction with this...? Either a way to extend the collection across multiple servers or a reason for not doing so.

When you say -shards 2, you're saying that you want your collection to be split across two servers already. -replicationFactor 2 says that you want those shards present on at least two servers as well.
A shard is a piece of the collection - without a shard, you won't have access to all the documents. The replicationFactor indicates how many copies should be made available of the same shard (or "partition" which some times is used to represent the piece of the index) in the collection, so two shards with two replicas will end up with four "cores" distributed across the available servers (these "cores" are managed internally by Solr).
Start a set of new SolrCloud instances in the same cluster and you should see that the documents are spread across your nodes as expected.

As said before, the shards are pieces of the collection (data) in actual servers.
When you ran the command, you've asked that your collection will be split into 2 machines - at that point in time.
Once you add more machines to the mix, (by registering them to the same zookeeper), you can use the collection API to manage and add them to the fold as well.
https://cwiki.apache.org/confluence/display/solr/Collections+API
You can split shards into 2 (or more) new shards.
You can create new shards, or delete shards.
The question of course - is how do the documents split among the shards?
When you create the collection, you can define a router.name
router.name - The router name that will be used.
The router defines how documents will be distributed among the shards.
The value can be either implicit, which uses an internal default hash,
or compositeId, which allows defining the specific shard to assign documents to.
When using the 'implicit' router, the shards parameter is required.
When using the 'compositeId' router, the numShards parameter is required.
For more information, see also the section Document Routing.
What this means - is that you can define the number of shards (like you did) or go to a totally different approach which distinguishes shards by a prefix in the document id.
For more information about the second approach see: https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud#ShardsandIndexingDatainSolrCloud-DocumentRouting

Related

Multiple collections in single SOLR instance

I'm using SOLR 4.0. I need to make 4 different indexes for searching, let's say, first is a list of students in a university, second is a list of products being sold on an online marketplace and so on. What I mean here is that they all hold completely different types of data.
Currently I'm running 4 instances of solr on 4 different ports each having a single collection serving one type of data. The problem is that running 4 instances of solr takes up a lot of memory space.
How can I run all 4 collections in a single solr instance? While searching, maybe I can specify in the url the collection that I'm interested in.
You can create multiple cores within a single Solr instance. There is a CoreAdmin API for such purposes.
It has a CREATE action which creates a new core and registers it. Here is the sample create core request:
http://localhost:8983/solr/admin/cores?action=CREATE&name=coreX&instanceDir=path/to/dir&config=config_file_name.xml&dataDir=data
Bear in mind that CREATE call must be able to find a configuration, or it will not succeed.
You can read documentation from here: https://cwiki.apache.org/confluence/display/solr/CoreAdmin+API#CoreAdminAPI-CREATE

Solr Cloud Document Routing

Currently I have a zookeeper multi solr server, single shard setup. Unique ids are generated automatically by solr.
I now have a zookeeper mult solr server, multi shard requirement. I need to be able to route updates to a specific shard.
After reading http://searchhub.org/2013/06/13/solr-cloud-document-routing/ I am concerned that I cannot allow solr to generate random unique ids if I want to route updates to a specific shard.
Cannot anyone confirm this for me and perhaps give an explanation of the best approach.
Thanks
There is no way you can route your documents to a particular shard since it is being managed by the zookeeper.
Solution to your problem is that you should create two collections instead of two shards. Use your 1st collection with two servers and 2nd collection can use the third server and then you can send your updates to particular servers.The design should look like
collection1---->shard1---->server1,server2
collection2---->shard1----->server3
This way you can separate your indexes as per your requirement.

Different shards in a SolrCloud set up with different solrconfig's?

I'd like to set up SolrCloud with one collection consisting of three different shards.
I understand that since a collection represents a single logical index, it must have a single schema. I'm wondering, however, if each shard can have a different solrconfig?
Despite a fair amount of searching, I haven't seen any examples where a collection consists of a single schema but multiple solrconfig's. The SolrCloud tutorials I've worked through all init the collection with one bootstrapping config:
java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar
However, there are some elements in SolrCloud documentation that leads me to believe a SolrCloud set up with a single schema yet different solrconfig files for each shard might be possible. From "Solr Glossary":
"Collection: In Solr, one or more documents grouped together in a single logical index. A collection must have a single schema, but can be spread across multiple cores."
If a collection must have a single schema, but can consist of multiple cores, is that an indication that these different cores can have different solrconfig's? If so, how can this be set up?
Any help would be much appreciated.
Collection is a logical container for the same configuration. You cannot have cores with different configuration in single collection.
In general, you may query several collections (see SolrCloud wiki for that), if those collections have same schema. This will work only if both collections reside on the same zookeeper cluster. Give it a try.

Solr 4.x and replicating collections

I have 3 questions related to replicating collections in Solr4
Each Collection has own solrconfig.xml which has own replication setting and has to point into particular collection on master server. Is it possible to have only one global setting which will handle multiple collection?
If I will have 100 collections they will do 100 request every minute (or whatever the interval is) to the master server. Is it possible to make it work more clever and do only 1 request?
I need to create cores dynamically. One can add a collection though the admin panel (or script) but it requires collection directory with configuration already in place. Is there a way to create collections "on the fly" and replicate it to slave servers?
I'm afraid the answer to (1) and (2) is no, you can't. Collection is a logical configuration boundary.
I found the solution. It's very easy to achieve all of that with SolrCloud http://wiki.apache.org/solr/SolrCloud

Multiple index locations Solr

I am new to Solr, and am trying to figure out the best way to index and search our catalogs.
We have to index multiple manufactures and each manufacturer has a different catalog per country. Each catalog for each manufacture per country is about 8GB of data.
I was thinking it might be easier to have an index per manufacture per country and have some way to tell Solr in the URL which index to search from.
Is that the best way of doing this? If so, how would I do it? Where should I start looking? If not, what would be the best way?
I am using Solr 3.5
In general there are two ways of solving this:
Split each catalog into its own core, running a large multi core setup. This will keep each index physically separated from each other, and will allow you to use different properties (language, etc) and configuration for each core. This might be practical, but will require quite a bit of overhead if you plan on searching through all the core at the same time. It'll be easy to split the different cores into running on different servers later - simply spin the cores up on a different server.
Run everything in a single core - if all the attributes and properties of the different catalogs are the same, add two fields - one containing the manufacturer and one containing the country. Filter on these values when you need to limit the hits to a particular country or manufacturer. It'll allow you to easily search the complete index, and scalability can be implemented by replication or something like SolrCloud (coming in 4.0). If you need multilanguage support you'll have to have a field for each language with the settings you need for that language (such as stemming).
There are a few tidbits of information about this on the Solr wiki, but my suggestion is to simply try one of the methods and see if that solves your issue. Moving to the other solution shouldn't be too much work. The simplest implementation is to keep everything in the same index.

Resources