Multiple solr Shard on single Machine with a master shard exposed to outside - solr

Though I am new to distributed search I have fair idea on Sharding and solr cloud.
Technically When we have bigger index we split it to multiple shards for faster distributed search(Not including Solr Cloud benefits).
I have huge index (obviously same schema) but There is logical separation of data, I call it bucket. Each of these bucket has its own update/deletion/add.
1st Approach:
So technically I can create N number of shared depending on the bucket count. But this will lead to too many server and will slowdown the search because it has to merge the result.
2nd Approach:
To reduce the distribution I can logically combine these buckets and create a less number parent buckets which will then reduce the overall shard . But then I will loose some of the advantage related to individual bucket.
3nd Approach
I was thinking if I can create something similar to shard of shard. On each machine I will have one parent bucket as shard(Call it Parent Shard) which inter will have multiple child shard hosted on the same solr instance on the same machine (call it as child shard).This will like multiple core in single solr with same schema. In this scenario parent shard will merge the record from each of the child shard by executing queries parallel on child shards and merge it. As it is on the same solr instance I believe it will be faster. I want the parent shared to be empty and just handle merging of result. Is it possible and if so will performance match 2nd Approach. Can someone give me some idea how to implement it I am fine with customizing solr for this implementation.

Related

mongodb sharding Collection(?)

Is it possible to have different collections in sharded machines? For example the nodeA to have the collection: rolesA, nodeB the collection rolesB and so one, but the mongo router to handle transparently for my code. In other words to query the roleA without need to know that it's store to the nodeA?
The reason i am asking is that a have a large collection with around 100GB so i have to options:
To shard the large collection let's say to 5 nodes
To split the large collection to 10K collections in a single node
My queries are $sum aggregations. Right now i am using the second approace so every aggregation query use exactly one collection. The perfomance is by far good < 1 second, but in a production environment the CPU is always 100%
With the approach 1. the load will be balanced by i am worrying about the response time
Is it possible to have different collections in sharded machines?
In a sharded cluster a database can have number of sharded and unsharded collections. You can decide which collections need to be sharded (distribute a collection's data across shards). A collection is sharded based upon the Shard Key. Shard key determines how the collection data will be distributed across the shards. The shard key is also important in the read operation performance.
For example the nodeA to have the collection: rolesA, nodeB the
collection rolesB and so on..
You cannot specify that a specific collection's data be stored on a specific shard. Sharding is about distributing a collection's data among multiple shards.
...the mongo router to handle transparently for my code.
A mongos (the router) receives all the queries in a sharded cluster before routing to specific shard(s). When you submit a query, it is like submitting a query to a standalone server, the syntax is same; it is transparent to the application (or code). The router determines which shards the query must visit to get your data.
If you query's filter criteria has the shard key, then the router will determine the specific shard(s) to target (and without the shard key in the query filter criteria the query will be routed to all shards in the cluster).
To shard the large collection let's say to 5 nodes
You can distribute a large collection's data in cluster with 5 shards (or more as the data grows - this is called as horizontal scaling).
Finally, I tried to answer some of your questions about sharding. Also, please browse the referenced links below for complete information.
References:
Advantages of
Sharding
Considerations Before Sharding
Sharding Strategy
FAQ: Sharding with MongoDB

Understing some concepts of apache solr

I am new to Apache solr, can someone please explain the meaning of following terms with examples :-
Solr Core
Solr Collection
Logical vs Physical index
Sharding
I went through various blog posts but i am not able to understand.
The terminology is used a bit haphazardly, so you'll probably find texts that use a few of these terms interchangeably.
Solr core
A core is a named set of documents living on a single server. A server can have many cores. The core can be replicated to other servers (this is "old style" replication when done manually).
Solr Collection
A collection is a set of cores, from one to .. many. It's a logical description of "these cores together form the entire collection". This was introduced with SolrCloud, as that's the first time that Solr handles clustering for you.
Logical vs Physical
A collection is a logical index - it can span many cores. Each core is a physical index (it has the actual index files from Lucene on its disk). You interact with the collection as you'd interact with the core, and all the details of clustering are (usually) hidden from you by Solr (in SolrCloud mode).
Sharding
Since a collection can span many cores, sharding means that the documents that make up a single collection are present in many cores. Each core is a "shard" of the total index. Compare this to replication, where a copy of a core is distributed to many Solr instances (the same documents are present in both cores, while when sharding the documents are just present in one core and you need all cores to have a complete collection).
Sharding is what makes it possible to store more documents than a single server can handle (or keep in memory/cache to respond quickly enough).
SolrCloud (Added by me to make this all come togheter)
Previously (and still, if you're not using SolrCloud mode) sharding and replication were handled manually by the user when querying and configuring Solr. You set up replication to spread the same core across many servers, and you used sharding to make Solr query many Solr instances to get all the required documents. Today you'll usually just use SolrCloud and let Solr abstract away all these details. You'll come across these terms when creating a collection (numShards and replicationFactor) which tells Solr how many cores you want to spread the collection across, and how many servers should hold copies of these cores.
Collection -> Sharded across [1..N] cores, replicated [0..M] times for redundancy and higher query throughput.

SolrCloud Indexing/Querying without a Smart-Client

I'm having a bit of trouble understanding exactly how indexing and querying would work if I don't have a smart-client available. I'm using SolrNet with C#, which currently doesn't integrate with ZooKeeper.
As a basic example, let's say I have a single collection, split into two shards, replicated across two separate nodes/servers, and I have a standard HTTP load-balancer in front of the servers (a scenario mentioned here). If I use the standard compositeId router, I believe that indexing would work without issue and be replicated to both nodes by ZooKeeper behind the scenes. I wouldn't need to worry about which node received the "update" command -- ZooKeeper would handle document routing and replication automatically.
However, in this same scenario, would ZooKeeper handle query routing behind the scenes correctly? Given that I'm using built-in sharding and not custom sharding, would a query request to the load-balancer get routed to the correct shard, or would I have to include all known shards in the "shards" parameter (see here) to make sure I don't miss anything? Obviously this would be onerous to maintain as the number of shards grows.
Is seems like custom sharding would provide the greatest efficiency across indexing and querying, although then you run the risk wildly unequal shard sizes. Any thoughts on these matters would be appreciated.
Lets take the example of a two shard collection, with each shard on a separate node/server.
10.x.x.100:8983/solr/ --> shard 1 / node 1
10.x.x.101:8983/solr/ --> shard 2 / node 2
Using default routing you indexed 100 documents which got split into these two servers and now they have 50 documents each.
If you query any of the two servers for documents, solr will search in both the shards by default. You do not need to specify anything in shards parameter.
So
10.x.x.100:8983/solr/collection/select?q=solr rocks
will run this same query on 10.x.x.101:8983/solr/ also and the results returned will be a combination of results from both shards, sorted and ranked by score.
The &shards parameter comes into picture when you know which "group" of data is in which shard. For example using the above example, you have custom routing enabled and you use the field "city" to route the documents. For sake of example, lets assume there can be only two values for "city" field. Your documents will be routed to one of the shards based on this field.
On your application side, if you want to specifically query for documents belonging to a city, you can specify the &shard parameter, and all the results for the query will be only from that shard.

Keeping index optimized / merged in SolrCloud

With master-slave implementation of distributed Solr (prior to Solr 4.x) it was a straight design solution to have master which takes load for indexing, merging and optimizing index. Then the index gets copied to replicas while replicas meanwhile are always serving searches.
Could someone explain how this is done now with SolrCloud?
Seems like SolrCloud sends indexing commands to each replica from leader. But how the search performance could be achieved then? Indexing and searching on each replica makes load on each node server (to index and run merge thread in background) and since my index is quite big it takes a lot of time usually to merge segments or simply optimize.
Should I deliver that all now to merge policy and not worry at all? Does TieredMergePolicy provide both good search performance and low resource load (CPU, I/O) at the same time?
I'll try to answer part of your questions: SolrCloud indeed indexes on all nodes, and therefore it has a performance impact on replicas. This is done due to 'hot replication' model instead of 'cold replication' as you are used to. It comes to solve data integrity issues as well as real time search on a cluster. You get consistent data and faster data availability as a price of performance impact. Actually, you can always split data to shards (at a price of additional hardware), and have comparable performance.
In either case, it's up to you to decide whether SolrCloud suits your needs. You can use Solr 4 without cloud model and manage it yourself as before.

Solr Collection vs Cores

I struggle with understanding the difference between collections and cores. If I understand it correctly, cores are multiple indexes. Collection consists of cores, so essentially they share the same logic in separation, i.e. separate cores and collections have separate end-points.
I have the following scenario. I create a backend for cloud service for several online shops. Each shop has a set of products, to which customers can add reviews. I want to index static data (product information) separately from dynamic information(reviews) so I can improve performance.
How can I best separate in Solr???
From the SolrCloud Documentation
Collection: A single search index.
Shard: A logical section of a single collection (also called
Slice). Sometimes people will talk about "Shard" in a physical sense
(a manifestation of a logical shard)
Replica: A physical manifestation of a logical Shard, implemented
as a single Lucene index on a SolrCore
Leader: One Replica of every Shard will be designated as a Leader to
coordinate indexing for that Shard
SolrCore: Encapsulates a single physical index. One or more make up
logical shards (or slices) which make up a collection.
Node: A single instance of Solr. A single Solr instance can have
multiple SolrCores that can be part of any number of collections.
Cluster: All of the nodes you are using to host SolrCores.
So basically a Collection (Logical group) has multiple cores (physical indexes).
Also, check the discussion
Core
In Solr, a core is composed of a set of configuration files, Lucene index files, and Solr’s
transaction log.
a Solr core is a
uniquely named, managed, and configured index running in a Solr server; a Solr server
can host one or more cores. A core is typically used to separate documents that have
different schemas
collection
Solr also uses the term collection, which only has meaning in the context
of a Solr cluster in which a single index is distributed across multiple servers.
SolrCloud introduces the concept of a collection, which extends the concept of a uniquely
named, managed, and configured index to one that is split into shards and distributed
across multiple servers.
As per my understanding:
In distributed search,
Collection is a logical index spread across multiple servers.
Core is that part of server which runs one collection.
In non-distributed search,
Single server running the Solr can have multiple collections and each of those collection is also a core. So collection and core are same if search is not distributed.
Summary
Collection per server is called a core.
Collection is same as an index.
One Solr server can have many cores.
Collection is a logical index (Example usage for multiple collections: Say two teams in same group are not big enough to justify a full Solr server of their own. But they also do not want to mix their data in a single index. They can then create separate collections/indexes which will keep their data separate).
Its better to use a separate Solr Cloud rather than create collections if the data for a collection is big enough (not sure, comments please?)
Single instance
On a single instance, Solr has something called a SolrCore that is essentially a single index. If you want multiple indexes, you create multiple SolrCores.
Solr Cloud
With SolrCloud, a single index can span multiple Solr instances. This means that a single index can be made up of multiple SolrCore's on different machines. We call all of these SolrCores that make up one logical index a collection.
A collection is a essentially a single index that spans many SolrCore's, both for index scaling as well as redundancy. If you wanted to move your 2 SolrCore Solr setup to SolrCloud, you would have 2 collections, each made up of multiple individual SolrCores.
From Solr Wiki:
Collections are made up of one or more shards. Shards have one or
more replicas. Each replica is a core. A single collection represents
a single logical index.
This explains the use of cores and collections.
Single instance
When dealing with a single solr instance you query to cores.
The admin UI of a single Solr instance has no collection selector:
Solr Cloud
When dealing with Solr Cloud you query to collections.
The collections are organized in different cores (replicas, shards) on different solr instances.
The admin UI of a Solr Cloud instance has a collection and core selector. But cores are technically instances, here:
From the Solr docs:
Usage: solr create [-c name] [-d confdir] [-n configName] [-shards #]
[-replicationFactor #] [-p port] [-V]
Create a core or collection depending on whether Solr is running in
standalone (core) or SolrCloud mode (collection). In other words,
this action detects which mode Solr is running in, and then takes
the appropriate action (either create_core or create_collection).

Resources