I've been using and watching Solr documentation for replication and shard handling in a Solr Cloud cluster.
I can create and remove replicas without problems.
I can split a shard easily (It create two new shards and disable the old one).
shard1 -> shard1_0 and shard1_1
But can I do the opposite operation ?
shard1_0 and shard1_1 -> shard 1
I have not seen anything in the documentation that answers my need.
Related
I'm having a bit of trouble understanding exactly how indexing and querying would work if I don't have a smart-client available. I'm using SolrNet with C#, which currently doesn't integrate with ZooKeeper.
As a basic example, let's say I have a single collection, split into two shards, replicated across two separate nodes/servers, and I have a standard HTTP load-balancer in front of the servers (a scenario mentioned here). If I use the standard compositeId router, I believe that indexing would work without issue and be replicated to both nodes by ZooKeeper behind the scenes. I wouldn't need to worry about which node received the "update" command -- ZooKeeper would handle document routing and replication automatically.
However, in this same scenario, would ZooKeeper handle query routing behind the scenes correctly? Given that I'm using built-in sharding and not custom sharding, would a query request to the load-balancer get routed to the correct shard, or would I have to include all known shards in the "shards" parameter (see here) to make sure I don't miss anything? Obviously this would be onerous to maintain as the number of shards grows.
Is seems like custom sharding would provide the greatest efficiency across indexing and querying, although then you run the risk wildly unequal shard sizes. Any thoughts on these matters would be appreciated.
Lets take the example of a two shard collection, with each shard on a separate node/server.
10.x.x.100:8983/solr/ --> shard 1 / node 1
10.x.x.101:8983/solr/ --> shard 2 / node 2
Using default routing you indexed 100 documents which got split into these two servers and now they have 50 documents each.
If you query any of the two servers for documents, solr will search in both the shards by default. You do not need to specify anything in shards parameter.
So
10.x.x.100:8983/solr/collection/select?q=solr rocks
will run this same query on 10.x.x.101:8983/solr/ also and the results returned will be a combination of results from both shards, sorted and ranked by score.
The &shards parameter comes into picture when you know which "group" of data is in which shard. For example using the above example, you have custom routing enabled and you use the field "city" to route the documents. For sake of example, lets assume there can be only two values for "city" field. Your documents will be routed to one of the shards based on this field.
On your application side, if you want to specifically query for documents belonging to a city, you can specify the &shard parameter, and all the results for the query will be only from that shard.
I am testing High Availability feature of SolrCloud. I am using the following setup
8 linux hosts
8 Shards
1 leader, 1 replica / host
Using Curl for update operation
I tried to index 80K documents on replicas (10K/replica in parallel). During indexing process, I stopped 4 Leader nodes. Once indexing is done, out of 80K docs only 79808 docs are indexed.
Is this an expected behaviour ? In my opinion replica should take care of indexing if leader is down.
If this is an expected behaviour, any steps that can be taken from the client side to avoid such a situation.
I suggest you should use CloudSolrServer to update solrcloud index.As it take cares of down nodes do not receive any update request and routes all further request to an appropriate node in the cluster.One more thing you need to ensure is all your 80k documents have unique field value, and its value is really unique across all documents
I struggle with understanding the difference between collections and cores. If I understand it correctly, cores are multiple indexes. Collection consists of cores, so essentially they share the same logic in separation, i.e. separate cores and collections have separate end-points.
I have the following scenario. I create a backend for cloud service for several online shops. Each shop has a set of products, to which customers can add reviews. I want to index static data (product information) separately from dynamic information(reviews) so I can improve performance.
How can I best separate in Solr???
From the SolrCloud Documentation
Collection: A single search index.
Shard: A logical section of a single collection (also called
Slice). Sometimes people will talk about "Shard" in a physical sense
(a manifestation of a logical shard)
Replica: A physical manifestation of a logical Shard, implemented
as a single Lucene index on a SolrCore
Leader: One Replica of every Shard will be designated as a Leader to
coordinate indexing for that Shard
SolrCore: Encapsulates a single physical index. One or more make up
logical shards (or slices) which make up a collection.
Node: A single instance of Solr. A single Solr instance can have
multiple SolrCores that can be part of any number of collections.
Cluster: All of the nodes you are using to host SolrCores.
So basically a Collection (Logical group) has multiple cores (physical indexes).
Also, check the discussion
Core
In Solr, a core is composed of a set of configuration files, Lucene index files, and Solr’s
transaction log.
a Solr core is a
uniquely named, managed, and configured index running in a Solr server; a Solr server
can host one or more cores. A core is typically used to separate documents that have
different schemas
collection
Solr also uses the term collection, which only has meaning in the context
of a Solr cluster in which a single index is distributed across multiple servers.
SolrCloud introduces the concept of a collection, which extends the concept of a uniquely
named, managed, and configured index to one that is split into shards and distributed
across multiple servers.
As per my understanding:
In distributed search,
Collection is a logical index spread across multiple servers.
Core is that part of server which runs one collection.
In non-distributed search,
Single server running the Solr can have multiple collections and each of those collection is also a core. So collection and core are same if search is not distributed.
Summary
Collection per server is called a core.
Collection is same as an index.
One Solr server can have many cores.
Collection is a logical index (Example usage for multiple collections: Say two teams in same group are not big enough to justify a full Solr server of their own. But they also do not want to mix their data in a single index. They can then create separate collections/indexes which will keep their data separate).
Its better to use a separate Solr Cloud rather than create collections if the data for a collection is big enough (not sure, comments please?)
Single instance
On a single instance, Solr has something called a SolrCore that is essentially a single index. If you want multiple indexes, you create multiple SolrCores.
Solr Cloud
With SolrCloud, a single index can span multiple Solr instances. This means that a single index can be made up of multiple SolrCore's on different machines. We call all of these SolrCores that make up one logical index a collection.
A collection is a essentially a single index that spans many SolrCore's, both for index scaling as well as redundancy. If you wanted to move your 2 SolrCore Solr setup to SolrCloud, you would have 2 collections, each made up of multiple individual SolrCores.
From Solr Wiki:
Collections are made up of one or more shards. Shards have one or
more replicas. Each replica is a core. A single collection represents
a single logical index.
This explains the use of cores and collections.
Single instance
When dealing with a single solr instance you query to cores.
The admin UI of a single Solr instance has no collection selector:
Solr Cloud
When dealing with Solr Cloud you query to collections.
The collections are organized in different cores (replicas, shards) on different solr instances.
The admin UI of a Solr Cloud instance has a collection and core selector. But cores are technically instances, here:
From the Solr docs:
Usage: solr create [-c name] [-d confdir] [-n configName] [-shards #]
[-replicationFactor #] [-p port] [-V]
Create a core or collection depending on whether Solr is running in
standalone (core) or SolrCloud mode (collection). In other words,
this action detects which mode Solr is running in, and then takes
the appropriate action (either create_core or create_collection).
I am looking at using distribution and sharding with Solr 3.6 vs Solr 4+ (SolrCloud)
I can see that 3.6 can have multiple shards set up, ideally each shard would rest on a different box. On a large scale once the boxes start to run low on memory I would like to add new shards to the index. From what I have seen this cannot be done/ isn't documented.
Does this require a full re index of the data?
Can a 3 shard indexed be re-indexed into a 4 shard instance?
Can queries still be invoked on the index during a re-index?
What are the space overheads required to re index?
The schema.xml (field names and types) would not be changed, just a new shard location added.
Self Answer- From what I have seen it would be best to stop filling the 3 shards and fill only the new 4th shard with data. Then update the shards parameter to include the new shard to search.
Background: I just finished reading the Apache Solr 4 Cookbook. In it the author mentions that setting up shards needs to be done wisely b/c new ones cannot be added to an existing cluster. However, this was written using Solr 4.0 and at the present I am using 4.1. Is this still the case? I wish I hadn't found this issue and I'm hoping someone can tell me otherwise.
Question: Am I expected to know how much data I'll store in the future when setting up shards in a SolrCloud cluster?
I have played with Solandra and read up on elastic search, but quite honestly I am a fan of Solr as it is (and its large community!). I also like Zookeeper. Am I stuck for now or is there a workaround/patch?
Edit: If Question above is NO, could I build a SolrCloud with a bunch (maybe 100 or more) shards and let them grow (internally) and while I grow my data start peeling them off one by one and put them into larger, faster servers with more resources?
Yes, of course you can. You have to setup a new Solr server pointing to the same zookeeper instance. During the bootstrap the server connects to zk ensemble and registers itself as a cluster member.
Once the registration process is complete, the server is ready to create new cores. You can create replicas of the existing shards using CoreAdmin. Also you can create new shards, but they won't be balanced due to Lucene index format (not all fields are stored), because it may not have all document information to rebalance the cluster, so only new indexed/updated documents will get to this server (doing this is not recommendable).
When you setup your SolrCloud you have to create the cluster taking into account your document number growth factor, so if you have 1M documents at first and it grows as 10k docs/day, setup the cluster with 5 shards, so at start you have to host this shards in your two machines initial setup, but in the future, as needed, you can add new servers to the cluster and move those shards to this new servers. Be careful to not overgrow you cluster because, in Lucene, a single 20Gb index split across 5 shards won't be a 4Gb index in every shard. Every shard will take about (single_index_size/num_shards)*1.1 (due to dictionary compression). This may change depending on your term frequency.
The last chance you have is to add the new servers to the cluster and instead of adding new shards/replicas to the existing server, setup a new different collection using your new shards and reindex in parallel to this new collection. Then, once your reindex process finished, swap this collection and the old one.
One solution to the problem is to use the "implicit router" when creating your Collection.
Lets say - you have to index all "Audit Trail" data of your application into Solr. New Data gets added every day. You might most probably want to shard by year.
You could do something like the below during the initial setup of your collection:
admin/collections?
action=CREATE&
name=AuditTrailIndex&
router.name=implicit&
shards=2010,2011,2012,2013,2014&
router.field=year
The above command:
a) Creates 5 shards - one each for the current and the last 4 years 2010,2011,2012,2013,2014
b) Routes data to the correct shard based on the value of the "year" field (specified as router.field)
In December 2014, you might add a new shard in preparation for 2015 using the CREATESHARD API (part of the Collections API) - Do something like:
/admin/collections?
action=CREATESHARD&
shard=2015&
collection=AuditTrailIndex
The above command creates a new shard on the same collection.
When its 2015, all data will get automatically indexed into the "2015" shard assuming your data has the "year" field populated correctly to 2015.
In 2015, if you think you don't need the 2010 shard (based on your data retention requirements) - you could always use the DELETESHARD API to do so:
/admin/collections?
action=DELETESHARD&
shard=2015&
collection=AuditTrailIndex
P.S. This solution only works if you used the "implicit router" when creating your collection. Does NOT work when you use the default "compositeId router" - i.e. collections created with the numshards parameter.
This feature is truly a game changer - allows shards to be added dynamically based on growing demands of your business.