mongodb sharding Collection(?) - database

Is it possible to have different collections in sharded machines? For example the nodeA to have the collection: rolesA, nodeB the collection rolesB and so one, but the mongo router to handle transparently for my code. In other words to query the roleA without need to know that it's store to the nodeA?
The reason i am asking is that a have a large collection with around 100GB so i have to options:
To shard the large collection let's say to 5 nodes
To split the large collection to 10K collections in a single node
My queries are $sum aggregations. Right now i am using the second approace so every aggregation query use exactly one collection. The perfomance is by far good < 1 second, but in a production environment the CPU is always 100%
With the approach 1. the load will be balanced by i am worrying about the response time

Is it possible to have different collections in sharded machines?
In a sharded cluster a database can have number of sharded and unsharded collections. You can decide which collections need to be sharded (distribute a collection's data across shards). A collection is sharded based upon the Shard Key. Shard key determines how the collection data will be distributed across the shards. The shard key is also important in the read operation performance.
For example the nodeA to have the collection: rolesA, nodeB the
collection rolesB and so on..
You cannot specify that a specific collection's data be stored on a specific shard. Sharding is about distributing a collection's data among multiple shards.
...the mongo router to handle transparently for my code.
A mongos (the router) receives all the queries in a sharded cluster before routing to specific shard(s). When you submit a query, it is like submitting a query to a standalone server, the syntax is same; it is transparent to the application (or code). The router determines which shards the query must visit to get your data.
If you query's filter criteria has the shard key, then the router will determine the specific shard(s) to target (and without the shard key in the query filter criteria the query will be routed to all shards in the cluster).
To shard the large collection let's say to 5 nodes
You can distribute a large collection's data in cluster with 5 shards (or more as the data grows - this is called as horizontal scaling).
Finally, I tried to answer some of your questions about sharding. Also, please browse the referenced links below for complete information.
References:
Advantages of
Sharding
Considerations Before Sharding
Sharding Strategy
FAQ: Sharding with MongoDB

Related

Understing some concepts of apache solr

I am new to Apache solr, can someone please explain the meaning of following terms with examples :-
Solr Core
Solr Collection
Logical vs Physical index
Sharding
I went through various blog posts but i am not able to understand.
The terminology is used a bit haphazardly, so you'll probably find texts that use a few of these terms interchangeably.
Solr core
A core is a named set of documents living on a single server. A server can have many cores. The core can be replicated to other servers (this is "old style" replication when done manually).
Solr Collection
A collection is a set of cores, from one to .. many. It's a logical description of "these cores together form the entire collection". This was introduced with SolrCloud, as that's the first time that Solr handles clustering for you.
Logical vs Physical
A collection is a logical index - it can span many cores. Each core is a physical index (it has the actual index files from Lucene on its disk). You interact with the collection as you'd interact with the core, and all the details of clustering are (usually) hidden from you by Solr (in SolrCloud mode).
Sharding
Since a collection can span many cores, sharding means that the documents that make up a single collection are present in many cores. Each core is a "shard" of the total index. Compare this to replication, where a copy of a core is distributed to many Solr instances (the same documents are present in both cores, while when sharding the documents are just present in one core and you need all cores to have a complete collection).
Sharding is what makes it possible to store more documents than a single server can handle (or keep in memory/cache to respond quickly enough).
SolrCloud (Added by me to make this all come togheter)
Previously (and still, if you're not using SolrCloud mode) sharding and replication were handled manually by the user when querying and configuring Solr. You set up replication to spread the same core across many servers, and you used sharding to make Solr query many Solr instances to get all the required documents. Today you'll usually just use SolrCloud and let Solr abstract away all these details. You'll come across these terms when creating a collection (numShards and replicationFactor) which tells Solr how many cores you want to spread the collection across, and how many servers should hold copies of these cores.
Collection -> Sharded across [1..N] cores, replicated [0..M] times for redundancy and higher query throughput.

SolrCloud Indexing/Querying without a Smart-Client

I'm having a bit of trouble understanding exactly how indexing and querying would work if I don't have a smart-client available. I'm using SolrNet with C#, which currently doesn't integrate with ZooKeeper.
As a basic example, let's say I have a single collection, split into two shards, replicated across two separate nodes/servers, and I have a standard HTTP load-balancer in front of the servers (a scenario mentioned here). If I use the standard compositeId router, I believe that indexing would work without issue and be replicated to both nodes by ZooKeeper behind the scenes. I wouldn't need to worry about which node received the "update" command -- ZooKeeper would handle document routing and replication automatically.
However, in this same scenario, would ZooKeeper handle query routing behind the scenes correctly? Given that I'm using built-in sharding and not custom sharding, would a query request to the load-balancer get routed to the correct shard, or would I have to include all known shards in the "shards" parameter (see here) to make sure I don't miss anything? Obviously this would be onerous to maintain as the number of shards grows.
Is seems like custom sharding would provide the greatest efficiency across indexing and querying, although then you run the risk wildly unequal shard sizes. Any thoughts on these matters would be appreciated.
Lets take the example of a two shard collection, with each shard on a separate node/server.
10.x.x.100:8983/solr/ --> shard 1 / node 1
10.x.x.101:8983/solr/ --> shard 2 / node 2
Using default routing you indexed 100 documents which got split into these two servers and now they have 50 documents each.
If you query any of the two servers for documents, solr will search in both the shards by default. You do not need to specify anything in shards parameter.
So
10.x.x.100:8983/solr/collection/select?q=solr rocks
will run this same query on 10.x.x.101:8983/solr/ also and the results returned will be a combination of results from both shards, sorted and ranked by score.
The &shards parameter comes into picture when you know which "group" of data is in which shard. For example using the above example, you have custom routing enabled and you use the field "city" to route the documents. For sake of example, lets assume there can be only two values for "city" field. Your documents will be routed to one of the shards based on this field.
On your application side, if you want to specifically query for documents belonging to a city, you can specify the &shard parameter, and all the results for the query will be only from that shard.

Multiple solr Shard on single Machine with a master shard exposed to outside

Though I am new to distributed search I have fair idea on Sharding and solr cloud.
Technically When we have bigger index we split it to multiple shards for faster distributed search(Not including Solr Cloud benefits).
I have huge index (obviously same schema) but There is logical separation of data, I call it bucket. Each of these bucket has its own update/deletion/add.
1st Approach:
So technically I can create N number of shared depending on the bucket count. But this will lead to too many server and will slowdown the search because it has to merge the result.
2nd Approach:
To reduce the distribution I can logically combine these buckets and create a less number parent buckets which will then reduce the overall shard . But then I will loose some of the advantage related to individual bucket.
3nd Approach
I was thinking if I can create something similar to shard of shard. On each machine I will have one parent bucket as shard(Call it Parent Shard) which inter will have multiple child shard hosted on the same solr instance on the same machine (call it as child shard).This will like multiple core in single solr with same schema. In this scenario parent shard will merge the record from each of the child shard by executing queries parallel on child shards and merge it. As it is on the same solr instance I believe it will be faster. I want the parent shared to be empty and just handle merging of result. Is it possible and if so will performance match 2nd Approach. Can someone give me some idea how to implement it I am fine with customizing solr for this implementation.

Solr Collection vs Cores

I struggle with understanding the difference between collections and cores. If I understand it correctly, cores are multiple indexes. Collection consists of cores, so essentially they share the same logic in separation, i.e. separate cores and collections have separate end-points.
I have the following scenario. I create a backend for cloud service for several online shops. Each shop has a set of products, to which customers can add reviews. I want to index static data (product information) separately from dynamic information(reviews) so I can improve performance.
How can I best separate in Solr???
From the SolrCloud Documentation
Collection: A single search index.
Shard: A logical section of a single collection (also called
Slice). Sometimes people will talk about "Shard" in a physical sense
(a manifestation of a logical shard)
Replica: A physical manifestation of a logical Shard, implemented
as a single Lucene index on a SolrCore
Leader: One Replica of every Shard will be designated as a Leader to
coordinate indexing for that Shard
SolrCore: Encapsulates a single physical index. One or more make up
logical shards (or slices) which make up a collection.
Node: A single instance of Solr. A single Solr instance can have
multiple SolrCores that can be part of any number of collections.
Cluster: All of the nodes you are using to host SolrCores.
So basically a Collection (Logical group) has multiple cores (physical indexes).
Also, check the discussion
Core
In Solr, a core is composed of a set of configuration files, Lucene index files, and Solr’s
transaction log.
a Solr core is a
uniquely named, managed, and configured index running in a Solr server; a Solr server
can host one or more cores. A core is typically used to separate documents that have
different schemas
collection
Solr also uses the term collection, which only has meaning in the context
of a Solr cluster in which a single index is distributed across multiple servers.
SolrCloud introduces the concept of a collection, which extends the concept of a uniquely
named, managed, and configured index to one that is split into shards and distributed
across multiple servers.
As per my understanding:
In distributed search,
Collection is a logical index spread across multiple servers.
Core is that part of server which runs one collection.
In non-distributed search,
Single server running the Solr can have multiple collections and each of those collection is also a core. So collection and core are same if search is not distributed.
Summary
Collection per server is called a core.
Collection is same as an index.
One Solr server can have many cores.
Collection is a logical index (Example usage for multiple collections: Say two teams in same group are not big enough to justify a full Solr server of their own. But they also do not want to mix their data in a single index. They can then create separate collections/indexes which will keep their data separate).
Its better to use a separate Solr Cloud rather than create collections if the data for a collection is big enough (not sure, comments please?)
Single instance
On a single instance, Solr has something called a SolrCore that is essentially a single index. If you want multiple indexes, you create multiple SolrCores.
Solr Cloud
With SolrCloud, a single index can span multiple Solr instances. This means that a single index can be made up of multiple SolrCore's on different machines. We call all of these SolrCores that make up one logical index a collection.
A collection is a essentially a single index that spans many SolrCore's, both for index scaling as well as redundancy. If you wanted to move your 2 SolrCore Solr setup to SolrCloud, you would have 2 collections, each made up of multiple individual SolrCores.
From Solr Wiki:
Collections are made up of one or more shards. Shards have one or
more replicas. Each replica is a core. A single collection represents
a single logical index.
This explains the use of cores and collections.
Single instance
When dealing with a single solr instance you query to cores.
The admin UI of a single Solr instance has no collection selector:
Solr Cloud
When dealing with Solr Cloud you query to collections.
The collections are organized in different cores (replicas, shards) on different solr instances.
The admin UI of a Solr Cloud instance has a collection and core selector. But cores are technically instances, here:
From the Solr docs:
Usage: solr create [-c name] [-d confdir] [-n configName] [-shards #]
[-replicationFactor #] [-p port] [-V]
Create a core or collection depending on whether Solr is running in
standalone (core) or SolrCloud mode (collection). In other words,
this action detects which mode Solr is running in, and then takes
the appropriate action (either create_core or create_collection).

Solr 4 Adding Shard to existing Cluster

Background: I just finished reading the Apache Solr 4 Cookbook. In it the author mentions that setting up shards needs to be done wisely b/c new ones cannot be added to an existing cluster. However, this was written using Solr 4.0 and at the present I am using 4.1. Is this still the case? I wish I hadn't found this issue and I'm hoping someone can tell me otherwise.
Question: Am I expected to know how much data I'll store in the future when setting up shards in a SolrCloud cluster?
I have played with Solandra and read up on elastic search, but quite honestly I am a fan of Solr as it is (and its large community!). I also like Zookeeper. Am I stuck for now or is there a workaround/patch?
Edit: If Question above is NO, could I build a SolrCloud with a bunch (maybe 100 or more) shards and let them grow (internally) and while I grow my data start peeling them off one by one and put them into larger, faster servers with more resources?
Yes, of course you can. You have to setup a new Solr server pointing to the same zookeeper instance. During the bootstrap the server connects to zk ensemble and registers itself as a cluster member.
Once the registration process is complete, the server is ready to create new cores. You can create replicas of the existing shards using CoreAdmin. Also you can create new shards, but they won't be balanced due to Lucene index format (not all fields are stored), because it may not have all document information to rebalance the cluster, so only new indexed/updated documents will get to this server (doing this is not recommendable).
When you setup your SolrCloud you have to create the cluster taking into account your document number growth factor, so if you have 1M documents at first and it grows as 10k docs/day, setup the cluster with 5 shards, so at start you have to host this shards in your two machines initial setup, but in the future, as needed, you can add new servers to the cluster and move those shards to this new servers. Be careful to not overgrow you cluster because, in Lucene, a single 20Gb index split across 5 shards won't be a 4Gb index in every shard. Every shard will take about (single_index_size/num_shards)*1.1 (due to dictionary compression). This may change depending on your term frequency.
The last chance you have is to add the new servers to the cluster and instead of adding new shards/replicas to the existing server, setup a new different collection using your new shards and reindex in parallel to this new collection. Then, once your reindex process finished, swap this collection and the old one.
One solution to the problem is to use the "implicit router" when creating your Collection.
Lets say - you have to index all "Audit Trail" data of your application into Solr. New Data gets added every day. You might most probably want to shard by year.
You could do something like the below during the initial setup of your collection:
admin/collections?
action=CREATE&
name=AuditTrailIndex&
router.name=implicit&
shards=2010,2011,2012,2013,2014&
router.field=year
The above command:
a) Creates 5 shards - one each for the current and the last 4 years 2010,2011,2012,2013,2014
b) Routes data to the correct shard based on the value of the "year" field (specified as router.field)
In December 2014, you might add a new shard in preparation for 2015 using the CREATESHARD API (part of the Collections API) - Do something like:
/admin/collections?
action=CREATESHARD&
shard=2015&
collection=AuditTrailIndex
The above command creates a new shard on the same collection.
When its 2015, all data will get automatically indexed into the "2015" shard assuming your data has the "year" field populated correctly to 2015.
In 2015, if you think you don't need the 2010 shard (based on your data retention requirements) - you could always use the DELETESHARD API to do so:
/admin/collections?
action=DELETESHARD&
shard=2015&
collection=AuditTrailIndex
P.S. This solution only works if you used the "implicit router" when creating your collection. Does NOT work when you use the default "compositeId router" - i.e. collections created with the numshards parameter.
This feature is truly a game changer - allows shards to be added dynamically based on growing demands of your business.

Resources