Solr 4.x and replicating collections - solr

I have 3 questions related to replicating collections in Solr4
Each Collection has own solrconfig.xml which has own replication setting and has to point into particular collection on master server. Is it possible to have only one global setting which will handle multiple collection?
If I will have 100 collections they will do 100 request every minute (or whatever the interval is) to the master server. Is it possible to make it work more clever and do only 1 request?
I need to create cores dynamically. One can add a collection though the admin panel (or script) but it requires collection directory with configuration already in place. Is there a way to create collections "on the fly" and replicate it to slave servers?

I'm afraid the answer to (1) and (2) is no, you can't. Collection is a logical configuration boundary.

I found the solution. It's very easy to achieve all of that with SolrCloud http://wiki.apache.org/solr/SolrCloud

Related

How to copy a Watson retrieve-and-rank solr collection on Bluemix

We have a large Solr collection on Watson's Retrieve and Rank service that I need to copy to another collection. Is there any way to do this in Retrieve and Rank? I know Solr has backup and restore capability, but it uses the file system and I don't think I have access to that in Bluemix.
I'm not aware of any way to do this, beyond the brute force approach of just fetching every doc in the index and adding the contents to a different collection. (And even this would be limited to only letting you fetch the fields that you have stored in the first collection).

Extending a solr collection across multiple machines

I am trying to set up a solr collection that extends across multiple servers. If I am correct in understanding things, I am able to set up a collection, which consists of shards. Those shards consist of replicas, which is correspond to cores. Please correct any holes in my understanding of this.
Ok.
So I've got solr set up and am able to create a collection on machine one by doing this.
bin/solr create_collection -c test_collection -shards 2 -replicationFactor 2 -d server/solr/configsets/basic_configs/conf
This appears to do something right, I am able to check the health and see something. I input
bin/solr healthcheck -c test_collection
and I see the shard information.
Now what I want to do, and this is the part I am stuck on, is to take this collection that I have created, and extend it across multiple servers. I'm not sure if I understand how this works correctly, but I think what I want to do is put shard1 on machine1, and shard2 on machine2.
I can't really figure out how to do this based on the documentation, although I am pretty sure this is what SolrCloud is meant to solve. Can someone give me a nudge in the right direction with this...? Either a way to extend the collection across multiple servers or a reason for not doing so.
When you say -shards 2, you're saying that you want your collection to be split across two servers already. -replicationFactor 2 says that you want those shards present on at least two servers as well.
A shard is a piece of the collection - without a shard, you won't have access to all the documents. The replicationFactor indicates how many copies should be made available of the same shard (or "partition" which some times is used to represent the piece of the index) in the collection, so two shards with two replicas will end up with four "cores" distributed across the available servers (these "cores" are managed internally by Solr).
Start a set of new SolrCloud instances in the same cluster and you should see that the documents are spread across your nodes as expected.
As said before, the shards are pieces of the collection (data) in actual servers.
When you ran the command, you've asked that your collection will be split into 2 machines - at that point in time.
Once you add more machines to the mix, (by registering them to the same zookeeper), you can use the collection API to manage and add them to the fold as well.
https://cwiki.apache.org/confluence/display/solr/Collections+API
You can split shards into 2 (or more) new shards.
You can create new shards, or delete shards.
The question of course - is how do the documents split among the shards?
When you create the collection, you can define a router.name
router.name - The router name that will be used.
The router defines how documents will be distributed among the shards.
The value can be either implicit, which uses an internal default hash,
or compositeId, which allows defining the specific shard to assign documents to.
When using the 'implicit' router, the shards parameter is required.
When using the 'compositeId' router, the numShards parameter is required.
For more information, see also the section Document Routing.
What this means - is that you can define the number of shards (like you did) or go to a totally different approach which distinguishes shards by a prefix in the document id.
For more information about the second approach see: https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud#ShardsandIndexingDatainSolrCloud-DocumentRouting

How to replace a group of documents without "downtime" in Solr?

i have a solr standalone server (not solr cloud), holding documents from a few different sources.
Routinely i need to update the documents for a source, typically i do this by deleting all documents from that source/group, and indexing the new documents for that source, but this creates a time gap where i have no documents for that source, and that's not ideal.
Some of these documents will probably remain from one update to the other, some change and could be updated, but some may disappear, and need to get deleted.
What's the best way to do this?
Is there a way to delete all documents from a source, but not committing, and in the same transaction index that source again and only then commit? (that would not create a time gap of no information for that source)
Is using core swapping a solution? (or am i over complicating?)
Seems like you need a live index which will keep serving queries while you update the index without having any downtime. In a way you are partially re-indexing your data.
You can look into maintaining two indices, and interacting with them using ALIASES.
Check this link: https://www.elastic.co/guide/en/elasticsearch/guide/current/multiple-indices.html
Although its on Elasticsearch website, you can easily use the concepts in solr.
Here is another link on how to create/use ALIASES
http://blog.cloudera.com/blog/2013/10/collection-aliasing-near-real-time-search-for-really-big-data/
Collection aliases are also useful for re-indexing – especially when
dealing with static indices. You can re-index in a new collection
while serving from the existing collection. Once the re-index is
complete, you simply swap in the new collection and then remove the
first collection using your read side aliases.

Solr Cloud Document Routing

Currently I have a zookeeper multi solr server, single shard setup. Unique ids are generated automatically by solr.
I now have a zookeeper mult solr server, multi shard requirement. I need to be able to route updates to a specific shard.
After reading http://searchhub.org/2013/06/13/solr-cloud-document-routing/ I am concerned that I cannot allow solr to generate random unique ids if I want to route updates to a specific shard.
Cannot anyone confirm this for me and perhaps give an explanation of the best approach.
Thanks
There is no way you can route your documents to a particular shard since it is being managed by the zookeeper.
Solution to your problem is that you should create two collections instead of two shards. Use your 1st collection with two servers and 2nd collection can use the third server and then you can send your updates to particular servers.The design should look like
collection1---->shard1---->server1,server2
collection2---->shard1----->server3
This way you can separate your indexes as per your requirement.

building in support for future Solr sharding

Building an application. Right now we have one Solr server. But we would like to design the app so that it can support multiple Solr shard in future if we outgrow the indexing needs.
What are keys things to keep in mind when developing an application that can support multiple shards in future?
we stored the solr URL /solr/ in a DB. Which is used to execute queries against solr. There is one URL for Updates and one URL for Searches in the DB
If we add shards to the solr environment at a future date, will the process for using the shards be as simple as updating the URLs in the DB? Or are there other things that need to be updated. We are using SolrJ
e.g. change the SolrSearchBaseURL in DB to:
https://solr2/solr/select?shards=solr1/solr,solr2/solr&indent=true&q={search_query}
And updating the SolrUpdateBaseURL in DB to
https://solr2/solr/
?
Basically, what you are describing has already been implemented in SolrCloud. There the ZooKeeper maintains the state of your search cluster (which shards in what collections, shard replicas, leader and slave nodes and more). It can handle the load on indexing and querying sides by using hashing.
You could, in principle, get by (at least in the beginning of your cluster growth) with the system you have developed. But think about replicating, adding load balancers, external cache servers (like e.g. varnish): in the long run you would end up implementing smth like SolrCloud yourself.
Having said that, there are some caveats to using hash based indexing and hence searching. If you want to implement logical partitioning of you data (say, by date) at this point there is no way to this but making a custom code. There is some work projected around this though.

Resources