Understing some concepts of apache solr - solr

I am new to Apache solr, can someone please explain the meaning of following terms with examples :-
Solr Core
Solr Collection
Logical vs Physical index
Sharding
I went through various blog posts but i am not able to understand.

The terminology is used a bit haphazardly, so you'll probably find texts that use a few of these terms interchangeably.
Solr core
A core is a named set of documents living on a single server. A server can have many cores. The core can be replicated to other servers (this is "old style" replication when done manually).
Solr Collection
A collection is a set of cores, from one to .. many. It's a logical description of "these cores together form the entire collection". This was introduced with SolrCloud, as that's the first time that Solr handles clustering for you.
Logical vs Physical
A collection is a logical index - it can span many cores. Each core is a physical index (it has the actual index files from Lucene on its disk). You interact with the collection as you'd interact with the core, and all the details of clustering are (usually) hidden from you by Solr (in SolrCloud mode).
Sharding
Since a collection can span many cores, sharding means that the documents that make up a single collection are present in many cores. Each core is a "shard" of the total index. Compare this to replication, where a copy of a core is distributed to many Solr instances (the same documents are present in both cores, while when sharding the documents are just present in one core and you need all cores to have a complete collection).
Sharding is what makes it possible to store more documents than a single server can handle (or keep in memory/cache to respond quickly enough).
SolrCloud (Added by me to make this all come togheter)
Previously (and still, if you're not using SolrCloud mode) sharding and replication were handled manually by the user when querying and configuring Solr. You set up replication to spread the same core across many servers, and you used sharding to make Solr query many Solr instances to get all the required documents. Today you'll usually just use SolrCloud and let Solr abstract away all these details. You'll come across these terms when creating a collection (numShards and replicationFactor) which tells Solr how many cores you want to spread the collection across, and how many servers should hold copies of these cores.
Collection -> Sharded across [1..N] cores, replicated [0..M] times for redundancy and higher query throughput.

Related

Is there a way to index multiple Solr cores simultaneously?

I am developing an indexing application using Solr. Our current system has two live cores and indexes only one core at a time. It has recently become apparent that the current indexing system will not work long term. One of the live cores needs to be split into two new cores. They will have some overlapping information, but different schemas. Both will need to be updated quickly whenever a new project is ingested into the database.
Is there a way to simultaneously update multiple solr cores using SolrJ?
All cores are in the same solr instance.
We are not using SolrCloud.
The core that needs to be split currently contains approx. 2500000 documents.
Any help is appreciated.
Since you are indexing many documents on a single core I would assume the indexing process takes quite some time and using all system resources ( if configured correctly ). In that case - parallel indexing on the same instance will not help as your multiple threads will be sharing the same resources.
But what you could do is index another core on another instance and then do replication of each core separately.
When you build a Solr client using SolrJ it's specific to the core and not to your complete Solr instance. Having said that you could have multiple process updating any number of cores in your application.

Correct use case of multiple cores in Solr 4

We use Solr 4.8 for our project.
One colleague created 2 cores in the same instance to index 80GB documents XML, from the same source. He said that one core can contain a maximum of 50GB of indexed data, so we split the 80GB to 2 cores. These cores have the same config files and schema.
For indexation, he puts odd docs in the 1st core, and even docs in the 2nd core.
For search, he uses one of SolrJ API to query on all documents from each core.
As we have only one server, distribution and replication aren't applied for the project.
My question: is this architecture a correct use case for Solr multiple cores? Anyone have some suggests?
instead of storing two indexes and manually managing storing of documents on different cores, you should create solrcloud, which automatically distributes the data among the shards. It also allows you to distribute your data on multiple machines.
It will also make your performance better, querying would be much easier and you could add multiple collections(with different schema's) too.
you should be using Solr Cloud, with a collection that has 2 shards. Take a look at https://cwiki.apache.org/confluence/display/solr/SolrCloud
Generally cores are created to differentiate the application data in different collection entity format.
It generally becomes useful to migrate core data from lower version to higher version. You can have many cores in solr. Suppose you have data harvested from two different source like one from X source and other from Y source, we generally would store them in 2 separate cores.
In your case it would be good idea to have 2 cores over same set of collection of data as the memory limit is huge. Generally a single core can accommodate huge amount memory. According to me its just the matter of your resource capability(hardware configuration like RAM and HDD)

Combining Solr 3x-style Master/Slave "Repeater" to feed remote 4x SolrCloud instances?

Solr 3x "Repeaters" and Multiple Data Centers:
Solr 3x let a node behave as both a slave and master, pull from one master, and then feed copies downstream to its own slaves. This was so common/useful it even had a name, a "Repeater".
This was useful if you wanted span multiple data centers. You could have the real master in data center A (DCA), and a "repeater" in data center B (DCB). That repeater would then grab content from DCA and feed all of the other nodes in DCB, saving on bandwidth.
Suppose you want to upgrade this setup to Solr 4x and SolrCloud. (Note that Solr 4x still supports Solr 3x-style legacy replication)
It's said that you should NOT have a single SolrCloud cluster span disparate data centers. So data center B should have it's own SolrCloud.
One idea is to have the DCA -> DCB link still use Solr 3x-style Master/Slave replication. And then the "repeater" in DCB, being also a SolrCloud node, would automatically be propagated to other nodes.
Main question:
Can a Solr node participate in both Solr 3x-style master/slave mode (as a slave) and also be part of a SolrCloud cluster? And if so, how is this configured?
Complications:
In the simple case, if it's just 1 shard with replicas, it's easy to see how that might work in terms of data. It's a little less clear if you have multiple shards in DCB, how do I tell each shard to only replicate its own share of data? Note that SolrCloud normally replicates via transactions, whereas 3x uses binary indices.
Another complexity is if you're doing replication. How do you tell just the master node for each shard to pull from the remote DCA node?
Alternatives:
On solution is to upgrade to 4x but continue using 3x-style replication in DCB, so just don't use SolrCloud.
I realize that another solution would be to have the data feed send it's updates to both data centers, or usE something like RabbitMQ. For the sake of this question, let's assume thats not an option (long story...)
Maybe there's some other way I haven't thought of?
Has anybody actually tried having SolrCloud span data centers? How horrible is it?
Somebody must have asked this question before!
But I've looked on Google and, although it finds tons of pages with the keywords, I haven't seen this specific "hybrid" mode fleshed out. I found one thread from 2013 but it didn't really talk about the configuration and complexity.
To answer your first question, a Solr slave in 3.X style cannot be a node in a Solr Cloud. The reason is the slave in a master/slave 3.X Solr config simply replicates, byte for byte, all the index files on the master. That's all it does. It can, in the repeater config, then also be a master for others to replicate from, or be a dedicated query slave or both. But that's it.
A node in a Solr Cloud config is a full participant in a distributed computing cluster where indexing is generally intended to be distributed across all nodes, and all nodes participate in queries. It's a very powerful feature which automatically handles failed nodes and significantly eases the work load of scaling up that was very manual in 3.X style.
However, part of what you pay for that is increased complexity (Zookeeper), requirements for lower latency inter-node communications (because all the nodes now talk to each other and to Zookeeper) and the loss of the simplicity of Master/Slave replication.
At 20M docs you are well within the constraints of a single node master index with an effectively unlimited number of slaves and therefor very high query capacity. I do this today with a production environment where each master has on the order of 60M docs in it with no significant problems.
The question is do you need NRT, multi-node indexing, automated failover, the ability to autoscale well past 100M docs? If so then Master/Slave it probably not going to work for you.
You could take a look at writing the same data to two different Solr Cloud clusters, one in each datacenter. You could do that directly, or use something like Apache Flume to do it for you - in either there are some issues with doing this and so the real question is are dealing with those issues worth it to get the added benefit of Solr Cloud?

Keeping index optimized / merged in SolrCloud

With master-slave implementation of distributed Solr (prior to Solr 4.x) it was a straight design solution to have master which takes load for indexing, merging and optimizing index. Then the index gets copied to replicas while replicas meanwhile are always serving searches.
Could someone explain how this is done now with SolrCloud?
Seems like SolrCloud sends indexing commands to each replica from leader. But how the search performance could be achieved then? Indexing and searching on each replica makes load on each node server (to index and run merge thread in background) and since my index is quite big it takes a lot of time usually to merge segments or simply optimize.
Should I deliver that all now to merge policy and not worry at all? Does TieredMergePolicy provide both good search performance and low resource load (CPU, I/O) at the same time?
I'll try to answer part of your questions: SolrCloud indeed indexes on all nodes, and therefore it has a performance impact on replicas. This is done due to 'hot replication' model instead of 'cold replication' as you are used to. It comes to solve data integrity issues as well as real time search on a cluster. You get consistent data and faster data availability as a price of performance impact. Actually, you can always split data to shards (at a price of additional hardware), and have comparable performance.
In either case, it's up to you to decide whether SolrCloud suits your needs. You can use Solr 4 without cloud model and manage it yourself as before.

Solr Collection vs Cores

I struggle with understanding the difference between collections and cores. If I understand it correctly, cores are multiple indexes. Collection consists of cores, so essentially they share the same logic in separation, i.e. separate cores and collections have separate end-points.
I have the following scenario. I create a backend for cloud service for several online shops. Each shop has a set of products, to which customers can add reviews. I want to index static data (product information) separately from dynamic information(reviews) so I can improve performance.
How can I best separate in Solr???
From the SolrCloud Documentation
Collection: A single search index.
Shard: A logical section of a single collection (also called
Slice). Sometimes people will talk about "Shard" in a physical sense
(a manifestation of a logical shard)
Replica: A physical manifestation of a logical Shard, implemented
as a single Lucene index on a SolrCore
Leader: One Replica of every Shard will be designated as a Leader to
coordinate indexing for that Shard
SolrCore: Encapsulates a single physical index. One or more make up
logical shards (or slices) which make up a collection.
Node: A single instance of Solr. A single Solr instance can have
multiple SolrCores that can be part of any number of collections.
Cluster: All of the nodes you are using to host SolrCores.
So basically a Collection (Logical group) has multiple cores (physical indexes).
Also, check the discussion
Core
In Solr, a core is composed of a set of configuration files, Lucene index files, and Solr’s
transaction log.
a Solr core is a
uniquely named, managed, and configured index running in a Solr server; a Solr server
can host one or more cores. A core is typically used to separate documents that have
different schemas
collection
Solr also uses the term collection, which only has meaning in the context
of a Solr cluster in which a single index is distributed across multiple servers.
SolrCloud introduces the concept of a collection, which extends the concept of a uniquely
named, managed, and configured index to one that is split into shards and distributed
across multiple servers.
As per my understanding:
In distributed search,
Collection is a logical index spread across multiple servers.
Core is that part of server which runs one collection.
In non-distributed search,
Single server running the Solr can have multiple collections and each of those collection is also a core. So collection and core are same if search is not distributed.
Summary
Collection per server is called a core.
Collection is same as an index.
One Solr server can have many cores.
Collection is a logical index (Example usage for multiple collections: Say two teams in same group are not big enough to justify a full Solr server of their own. But they also do not want to mix their data in a single index. They can then create separate collections/indexes which will keep their data separate).
Its better to use a separate Solr Cloud rather than create collections if the data for a collection is big enough (not sure, comments please?)
Single instance
On a single instance, Solr has something called a SolrCore that is essentially a single index. If you want multiple indexes, you create multiple SolrCores.
Solr Cloud
With SolrCloud, a single index can span multiple Solr instances. This means that a single index can be made up of multiple SolrCore's on different machines. We call all of these SolrCores that make up one logical index a collection.
A collection is a essentially a single index that spans many SolrCore's, both for index scaling as well as redundancy. If you wanted to move your 2 SolrCore Solr setup to SolrCloud, you would have 2 collections, each made up of multiple individual SolrCores.
From Solr Wiki:
Collections are made up of one or more shards. Shards have one or
more replicas. Each replica is a core. A single collection represents
a single logical index.
This explains the use of cores and collections.
Single instance
When dealing with a single solr instance you query to cores.
The admin UI of a single Solr instance has no collection selector:
Solr Cloud
When dealing with Solr Cloud you query to collections.
The collections are organized in different cores (replicas, shards) on different solr instances.
The admin UI of a Solr Cloud instance has a collection and core selector. But cores are technically instances, here:
From the Solr docs:
Usage: solr create [-c name] [-d confdir] [-n configName] [-shards #]
[-replicationFactor #] [-p port] [-V]
Create a core or collection depending on whether Solr is running in
standalone (core) or SolrCloud mode (collection). In other words,
this action detects which mode Solr is running in, and then takes
the appropriate action (either create_core or create_collection).

Resources