SolrCloud Indexing/Querying without a Smart-Client - solr

I'm having a bit of trouble understanding exactly how indexing and querying would work if I don't have a smart-client available. I'm using SolrNet with C#, which currently doesn't integrate with ZooKeeper.
As a basic example, let's say I have a single collection, split into two shards, replicated across two separate nodes/servers, and I have a standard HTTP load-balancer in front of the servers (a scenario mentioned here). If I use the standard compositeId router, I believe that indexing would work without issue and be replicated to both nodes by ZooKeeper behind the scenes. I wouldn't need to worry about which node received the "update" command -- ZooKeeper would handle document routing and replication automatically.
However, in this same scenario, would ZooKeeper handle query routing behind the scenes correctly? Given that I'm using built-in sharding and not custom sharding, would a query request to the load-balancer get routed to the correct shard, or would I have to include all known shards in the "shards" parameter (see here) to make sure I don't miss anything? Obviously this would be onerous to maintain as the number of shards grows.
Is seems like custom sharding would provide the greatest efficiency across indexing and querying, although then you run the risk wildly unequal shard sizes. Any thoughts on these matters would be appreciated.

Lets take the example of a two shard collection, with each shard on a separate node/server.
10.x.x.100:8983/solr/ --> shard 1 / node 1
10.x.x.101:8983/solr/ --> shard 2 / node 2
Using default routing you indexed 100 documents which got split into these two servers and now they have 50 documents each.
If you query any of the two servers for documents, solr will search in both the shards by default. You do not need to specify anything in shards parameter.
So
10.x.x.100:8983/solr/collection/select?q=solr rocks
will run this same query on 10.x.x.101:8983/solr/ also and the results returned will be a combination of results from both shards, sorted and ranked by score.
The &shards parameter comes into picture when you know which "group" of data is in which shard. For example using the above example, you have custom routing enabled and you use the field "city" to route the documents. For sake of example, lets assume there can be only two values for "city" field. Your documents will be routed to one of the shards based on this field.
On your application side, if you want to specifically query for documents belonging to a city, you can specify the &shard parameter, and all the results for the query will be only from that shard.

Related

Load balancing and indexing in SolrCloud

I have some questions regarding SolrCloud:
If I send a request directly to a solr node, which belons to a solr cluster, does it delegate the query to the zookeeper ensemble to handle it?
I want to have a single url to send requests to SolrCloud. Is there a better way of achieving this, than setting up an external load balancer, which balances directly between individual solr nodes? If 1 isn't true, this approach seems like a bad idea. On top I feel like it would somewhat defeat the purpose of zookeeper ensemble.
There is an option to break up a collection in shards. If I do so, how exactly does SolrCloud decide which document goes to which shard? Is there a need and/or an option to configure this process?
What happens if I send a collection of documents directly to one of the solr nodes? Would the data set somehow distribute itself across the shards evenly? If so, how does it happen?
Thanks a lot!
Zookeeper "just" keeps configuration data available for all nodes - i.e. the state of the cluster, etc. It does not get any queries "delegated" to it; it's just a way for Solr nodes and clients to know which collections are handled by which nodes in the cluster, and have that information be stored in resilient and available manner (i.e. dedicate the hard part out of managing a cluster to Zookeeper).
The best is to use a cloud aware Solr client - it will connect to any of the available Zookeeper nodes given in its configuration, retrieve the cluster state and connect directly to one the nodes that has the information it needs (i.e. the collection it needs to query). If you can't do that, you can either load balance with an external load balancer across all nodes in your cluster or let the client load balance if the client you use supports round robin, etc. - but having an external load balancer gives you other gains (such as being able to remove a node from load balancing for all clients at the same time, having dedicated http caching in front of th enodes, etc.) for a bit more administration.
It will use the unique id field to decide which node a given document should be routed to. You don't have to configure anything, but you can tell Solr to use a specific field or a specific prefix of a field, etc. as the route key. See Document Routing. for specific information. It allows you to make sure that all documents that belong to a specific client/application is placed on the same node (which is important for some calculations and possible operations).
It gets routed to the correct node. Whether that is evenly depends on your routing key, but by default, it'll be about as even as you can get it.

mongodb sharding Collection(?)

Is it possible to have different collections in sharded machines? For example the nodeA to have the collection: rolesA, nodeB the collection rolesB and so one, but the mongo router to handle transparently for my code. In other words to query the roleA without need to know that it's store to the nodeA?
The reason i am asking is that a have a large collection with around 100GB so i have to options:
To shard the large collection let's say to 5 nodes
To split the large collection to 10K collections in a single node
My queries are $sum aggregations. Right now i am using the second approace so every aggregation query use exactly one collection. The perfomance is by far good < 1 second, but in a production environment the CPU is always 100%
With the approach 1. the load will be balanced by i am worrying about the response time
Is it possible to have different collections in sharded machines?
In a sharded cluster a database can have number of sharded and unsharded collections. You can decide which collections need to be sharded (distribute a collection's data across shards). A collection is sharded based upon the Shard Key. Shard key determines how the collection data will be distributed across the shards. The shard key is also important in the read operation performance.
For example the nodeA to have the collection: rolesA, nodeB the
collection rolesB and so on..
You cannot specify that a specific collection's data be stored on a specific shard. Sharding is about distributing a collection's data among multiple shards.
...the mongo router to handle transparently for my code.
A mongos (the router) receives all the queries in a sharded cluster before routing to specific shard(s). When you submit a query, it is like submitting a query to a standalone server, the syntax is same; it is transparent to the application (or code). The router determines which shards the query must visit to get your data.
If you query's filter criteria has the shard key, then the router will determine the specific shard(s) to target (and without the shard key in the query filter criteria the query will be routed to all shards in the cluster).
To shard the large collection let's say to 5 nodes
You can distribute a large collection's data in cluster with 5 shards (or more as the data grows - this is called as horizontal scaling).
Finally, I tried to answer some of your questions about sharding. Also, please browse the referenced links below for complete information.
References:
Advantages of
Sharding
Considerations Before Sharding
Sharding Strategy
FAQ: Sharding with MongoDB

Multiple solr Shard on single Machine with a master shard exposed to outside

Though I am new to distributed search I have fair idea on Sharding and solr cloud.
Technically When we have bigger index we split it to multiple shards for faster distributed search(Not including Solr Cloud benefits).
I have huge index (obviously same schema) but There is logical separation of data, I call it bucket. Each of these bucket has its own update/deletion/add.
1st Approach:
So technically I can create N number of shared depending on the bucket count. But this will lead to too many server and will slowdown the search because it has to merge the result.
2nd Approach:
To reduce the distribution I can logically combine these buckets and create a less number parent buckets which will then reduce the overall shard . But then I will loose some of the advantage related to individual bucket.
3nd Approach
I was thinking if I can create something similar to shard of shard. On each machine I will have one parent bucket as shard(Call it Parent Shard) which inter will have multiple child shard hosted on the same solr instance on the same machine (call it as child shard).This will like multiple core in single solr with same schema. In this scenario parent shard will merge the record from each of the child shard by executing queries parallel on child shards and merge it. As it is on the same solr instance I believe it will be faster. I want the parent shared to be empty and just handle merging of result. Is it possible and if so will performance match 2nd Approach. Can someone give me some idea how to implement it I am fine with customizing solr for this implementation.

Solr Collection vs Cores

I struggle with understanding the difference between collections and cores. If I understand it correctly, cores are multiple indexes. Collection consists of cores, so essentially they share the same logic in separation, i.e. separate cores and collections have separate end-points.
I have the following scenario. I create a backend for cloud service for several online shops. Each shop has a set of products, to which customers can add reviews. I want to index static data (product information) separately from dynamic information(reviews) so I can improve performance.
How can I best separate in Solr???
From the SolrCloud Documentation
Collection: A single search index.
Shard: A logical section of a single collection (also called
Slice). Sometimes people will talk about "Shard" in a physical sense
(a manifestation of a logical shard)
Replica: A physical manifestation of a logical Shard, implemented
as a single Lucene index on a SolrCore
Leader: One Replica of every Shard will be designated as a Leader to
coordinate indexing for that Shard
SolrCore: Encapsulates a single physical index. One or more make up
logical shards (or slices) which make up a collection.
Node: A single instance of Solr. A single Solr instance can have
multiple SolrCores that can be part of any number of collections.
Cluster: All of the nodes you are using to host SolrCores.
So basically a Collection (Logical group) has multiple cores (physical indexes).
Also, check the discussion
Core
In Solr, a core is composed of a set of configuration files, Lucene index files, and Solr’s
transaction log.
a Solr core is a
uniquely named, managed, and configured index running in a Solr server; a Solr server
can host one or more cores. A core is typically used to separate documents that have
different schemas
collection
Solr also uses the term collection, which only has meaning in the context
of a Solr cluster in which a single index is distributed across multiple servers.
SolrCloud introduces the concept of a collection, which extends the concept of a uniquely
named, managed, and configured index to one that is split into shards and distributed
across multiple servers.
As per my understanding:
In distributed search,
Collection is a logical index spread across multiple servers.
Core is that part of server which runs one collection.
In non-distributed search,
Single server running the Solr can have multiple collections and each of those collection is also a core. So collection and core are same if search is not distributed.
Summary
Collection per server is called a core.
Collection is same as an index.
One Solr server can have many cores.
Collection is a logical index (Example usage for multiple collections: Say two teams in same group are not big enough to justify a full Solr server of their own. But they also do not want to mix their data in a single index. They can then create separate collections/indexes which will keep their data separate).
Its better to use a separate Solr Cloud rather than create collections if the data for a collection is big enough (not sure, comments please?)
Single instance
On a single instance, Solr has something called a SolrCore that is essentially a single index. If you want multiple indexes, you create multiple SolrCores.
Solr Cloud
With SolrCloud, a single index can span multiple Solr instances. This means that a single index can be made up of multiple SolrCore's on different machines. We call all of these SolrCores that make up one logical index a collection.
A collection is a essentially a single index that spans many SolrCore's, both for index scaling as well as redundancy. If you wanted to move your 2 SolrCore Solr setup to SolrCloud, you would have 2 collections, each made up of multiple individual SolrCores.
From Solr Wiki:
Collections are made up of one or more shards. Shards have one or
more replicas. Each replica is a core. A single collection represents
a single logical index.
This explains the use of cores and collections.
Single instance
When dealing with a single solr instance you query to cores.
The admin UI of a single Solr instance has no collection selector:
Solr Cloud
When dealing with Solr Cloud you query to collections.
The collections are organized in different cores (replicas, shards) on different solr instances.
The admin UI of a Solr Cloud instance has a collection and core selector. But cores are technically instances, here:
From the Solr docs:
Usage: solr create [-c name] [-d confdir] [-n configName] [-shards #]
[-replicationFactor #] [-p port] [-V]
Create a core or collection depending on whether Solr is running in
standalone (core) or SolrCloud mode (collection). In other words,
this action detects which mode Solr is running in, and then takes
the appropriate action (either create_core or create_collection).

Handling large number of ids in Solr

I need to perform an online search in Solr i.e user need to find list of user which are online with particular criteria.
How I am handling this: we store the ids of user in a table and I send all online user id in Solr request like
&fq=-id:(id1 id2 id3 ............id5000)
The problem with this approach is that when ids become large, Solr is taking too much time to resolved and we need to transfer large request over the network.
One solution can be use of join in Solr but online data change regularly and I can't index data every time (say 5-10 min, it should be at-least an hour).
Other solution I think of firing this query internally from Solr based on certain parameter in URL. I don't have much idea about Solr internals so don't know how to proceed.
With Solr4's soft commits, committing has become cheap enough that it might be feasible to actually store the "online" flag directly in the user record, and just have &fq=online:true on your query. That reduces the overhead involved in sending 5000 id's over the wire and parsing them, and lets Solr optimize the query a bit. Whenever someone logs in or out, set their status and set the commitWithin on the update. It's worth a shot, anyway.
We worked around this issue by implementing Sharding of the data.
Basically, without going heavily into code detail:
Write your own indexing code
use consistent hashing to decide which ID goes to which Solr server
index each user data to the relevant shard (it can be a several machines)
make sure you have redundancy
Query Solr shards
Do sharded queries in Solr using the shards parameter
Start an EmbeddedSolr and use it to do a sharded query
Solr will query all the shards and merge the results, it also provides timeouts if you need to limit the query time for each shard
Even with all of what I said above, I do not believe Solr is a good fit for this. Solr is not really well suited for searches on indexes that are constantly changing and also if you mainly search by IDs than a search engine is not needed.
For our project we basically implement all the index building, load balancing and query engine ourselves and use Solr mostly as storage. But we have started using Solr when sharding was flaky and not performant, I am not sure what the state of it is today.
Last note, if I was building this system today from scratch without all the work we did over the past 4 years I would advise using a cache to store all the users that are currently online (say memcached or redis) and at request time I would simply iterate over all of them and filter out according to the criteria. The filtering by criteria can be cached independently and updated incrementally, also iterating over 5000 records is not necessarily very time consuming if the matching logic is very simple.
Any robust solution will include bringing your data close to SOLR (batch) and using it internally. NOT running a very large request during search which is low latency thing.
You should develop your own filter; The filter will cache the online users data once in a while (say, every minute). If the data changes VERY frequently, consider implementing PostFilter.
You can find a good example of filter implementation here:
http://searchhub.org/2012/02/22/custom-security-filtering-in-solr/
one solution can be use of join in solr but online data change
regularly and i cant index data everytime(say 5-10 min, it should be
at-least an hr)
I think you could very well use Solr joins, but after a little bit of improvisation.
The Solution, I propose is as follows:
You can have 2 Indexes (Solr Cores)
1. Primary Index (The one you have now)
2. Secondary Index with only two fields , "ID" and "IS_ONLINE"
You could now update the Secondary Index frequently (in the order of seconds) and keep it in sync with the table you have, for storing online users.
NOTE: This secondary Index even if updated frequently, would not degrade any performance provided we do the necessary tweaks like usage of appropriate queries during delta-import, etc.
You could now perform a Solr join on the ID field on these two Indexes to achieve what you want. Here is the link on how to perform Solr Joins between Indexes/ Solr Cores.

Resources