Why we need sharding in solr and what is the benifit of it - solr

I am beginner in solr and I have no idea about how to do sharding in solr so my question is why we need sharding when we create collection and what is the benifit of it .If I am not creating sharding what happened.

Sharding allows us to have indexes that span more than a single instance of Solr - i.e. multiple servers or multiple running instances of Solr (which could be useful under specific conditions because of some single thread limitations in Lucene, as well as some memory usage patterns).
If we didn't have sharding, you'd be limited to a total size of your index to whatever you could fit on a single server. Sharding means that one part of the index (for example half of all your documents) will be located on one server, while the other half will be located on the other server. When you query Solr for any results, each shard will receive the query, and the result will then be merged before being returned back to you.
There's a few limitations in features that won't work properly when an index is shared (and scores are calculated locally on each server, which is why you usually want your documents spread as evenly as possible), but in those cases where sharding is useful (and it very often is!), there really isn't any better solutions.

Sharding helps us split the data into multiple replicas.
eg. If you have a collection named Employee with 1 shard and 2 replica.
Then assuming there are 100 records,
Employee_shard1_replica1 will have 100 records and
Employee_shard1_replica2 will have 100 records
The replica did the copying of entire records into another core so that you have loan balancing as well as fault taulrence.
Now, eg2. If you have the same collection Employee with 2 shard and 2 replica. In this scenario, the data will be split to both the shards.
Employee_shard1_replica1 will have 50 records
Employee_shard1_replica2 will have 50 records
Employee_shard2_replica2 will have 50 records
Employee_shard2_replica2 will have 50 records
Note : Shard 1 replicas have same data here and shard 2 replicas will have same data.

Related

Elasticsearch merge data from multiple indexes into merged index

My company uses an out of the box software, and that software export logs to Elasticsearch (and uses these logs). The software create an index per day for every data type, for example:
"A" record data => A_Data_2022_12_13, A_Data_2022_12_14 and so on..
Because this data storing method our Elastic has thousands of shards for 100GB of data.
I want to merge all those shards into a small amount of shards, 1 or 2 for every data type.
I thought about reindex, but I think it is overkill for my purpose, because I want the data to stay the same as it is now, but merged into one shard.
What is the best practice to do it?
Thanks!
I tried reindex, but it takes a lot of time, and I think it is not the right solution.
Too many shards can cause over-heap usage. Unbalanced shards can cause hot spots in clusters. Your decision is true and you should combine small indices into one or multiple indexes. Thus, you will have more stable shards, that is, a more stable cluster.
What you can do?
Create a rollover index and point your indexer to that index. In
that way, new data will store in the new index, so you need only be
concerned about the existing data.
Use filtered alias to search your data.
Reindex or wait. The new data is indexing into a new index, but what
are you gonna do for the existing indices? There are 2 ways for this. I
assume you have an index retention period, so you can wait until all
separated indices are deleted or you can directly reindex your data.
Note: You can tune the reindex speed with slice and set the number_of_replicas to 0.

solr multicore vs sharding vs 1 big collection

I currently have a single collection with 40 million documents and index size of 25 GB. The collections gets updated every n minutes and as a result the number of deleted documents is constantly growing.
The data in the collection is an amalgamation of more than 1000+ customer records. The number of documents per each customer is around 100,000 records on average.
Now that being said, I 'm trying to get an handle on the growing deleted document size. Because of the growing index size both the disk space and memory is being used up. And would like to reduce it to a manageable size.
I have been thinking of splitting the data into multiple core, 1 for each customer. This would allow me manage the smaller collection easily and can create/update the collection also fast. My concern is that number of collections might become an issue. Any suggestions on how to address this problem.
Solr: 4.9
Index size:25 GB
Max doc: 40 million
Doc count:29 million
Thanks
I had the similar sort of issue having multiple customer and big indexed data.
I have the implemented it with version 3.4 by creating a separate core for a customer.
i.e One core per customer. Creating core is some sort of creating indexes or splitting the data as like we do in case of sharding...
Here you are splitting the large indexed data in different smaller segments.
Whatever the seach will happen it will carry in the smaller indexed segment.. so the response time would be faster..
I have almost 700 core created as of now and its running fine for me.
As of now I did not face any issue with managing the core...
I would suggest to go with combination of core and sharding...
It will help you in achieve
Allows to have a different configuration for each core with different behavior and that will not have impact on other cores.
you can perform action like update, load etc. on each core differently.

Merging collections split across multiple shards

Brief overview of the setup:
5 x SolrCloud (Solr 4.6.1) node instances (separate machines).
The setup is intended to store last 48 hours webapp logs (which are pretty intense... ~ 3MB/sec)
"logs" collection has 5 shards (one per node instance).
One logline represents one document of "logs" collection
If I keep storing log documents to this "logs" collection, cores on shards start getting really big and CPU graphs show that instances spend more and more time waiting for disk I/O.
So, my idea is to create new collection with each 15 minutes and name it "logs-201402051400" with shards spread across 5 instances. Document writers will start writing to the new collection as soon as it is created. At some time I will get the list of collection like that:
...
logs-201402051400
logs-201402051415
logs-201402051430
logs-201402051445
logs-201402051500
...
Since there will be max 192 collections (~1000 cores) in the SolrCloud at some certain period of time. It seems that search performance should degrade drastically.
So, I would like to merge collections that are not being currently written to into one large collection (but still sharded across 5 instances). I have found information how to merge cores, but how can I merge collections?
This might NOT be a complete answer to your query - but something tells me that you need to redo the design of your collection.
This is a classic debate between using a Single Collection with Multiple Shards versus Multiple Collections.
I think you ought to setup a Single Collection - and then use Solr Cloud's dynamic sharding capability (implicit router) to add new shards (for newer 15 minute intervals) / delete old shards (for older 15 minute intervals).
Managing a single collection means that you will have a single end point and will save you from complexity of querying multiple collections.
Take a look at one of the answers on this link that talks about using the implicit router for dynamic sharding in SolrCloud.
How to add shards dynamically to collection in solr?

Query about Elasticsearch

I am writing a service that will be creating and managing user records. 100+ million of them.
For each new user, service will generate a unique user id and write it in database. Database is sharded based on unique user id that gets generated.
Each user record has several fields. Now one of the requirement is that the service be able to search if there exists a user with a matching field value. So those fields are declared as index in database schema.
However since database is sharded based on primary key ( unique user id ). I will need to search on all shards to find a user record that matches a particular column.
So to make that lookup fast. One thing i am thinking of doing is setting up an ElasticSearch cluster. Service will write to the ES cluster every time it creates a new user record. ES cluster will index the user record based on the relevant fields.
My question is :
-- What kind of performance can i expect from ES here ? Assuming i have 100+million user records where 5 columns of each user record need to be indexed. I know it depends on hardware config as well. But please assume a well tuned hardware.
-- Here i am trying to use ES as a memcache alternative that provides multiple keys. So i want all dataset to be in memory and does not need to be durable. Is ES right tool to do that ?
Any comment/recommendation based on experience with ElasticSearch for large dataset is very much appreciated.
ES is not explicitly designed to run completely in memory - you normally wouldn't want to do that with large unbounded datasets in a Java application (though you can using off-heap memory). Rather, it'll cache what it can and rely on the OS's disk cache for the rest.
100+ million records shouldn't be an issue at all even on a single machine. I run an index consisting 15 million records of ~100 small fields (no large text fields) amounting to 65Gb of data on disk on a single machine. Fairly complex queries that just return id/score execute in less than 500ms, queries that require loading the documents return in 1-1.5 seconds on a warmed up vm against a single SSD. I tend to given the JVM 12-16GB of memory - any more and I find it's just better to scale up via a cluster than a single huge vm.

Search using Solr vs Map Reduce on Files - which is reliable?

I have an application which needs to store a huge volume of data (around 200,000 txns per day), each record around 100 kb to 200 kb size. The format of the data is going to be JSON/XML.
The application should be highly available , so we plan to store the data on S3 or AWS DynamoDB.
We have use-cases where we may need to search the data based on a few attributes (date ranges, status, etc.). Most searches will be on few common attributes but there may be some arbitrary queries for certain operational use cases.
I researched the ways to search non-relational data and so far found two ways being used by most technologies
1) Build an index (Solr/CloudSearch,etc.)
2) Run a Map Reduce job (Hive/Hbase, etc.)
Our requirement is for the search results to be reliable (consistent with data in S3/DB - something like a oracle query, it is okay to be slow but when we get the data, we should have everything that matched the query returned or atleast let us know that some results were skipped)
At the outset it looks like the index based approach would be faster than the MR. But I am not sure if it is reliable - index may be stale? (is there a way to know the index was stale when we do the search so that we can correct it? is there a way to have the index always consistent with the values in the DB/S3? Something similar to the indexes on Oracle DBs).
The MR job seems to be reliable always (as it fetches data from S3 for each query), is that assumption right? Is there anyway to speed this query - may be partition data in S3 and run multiple MR jobs based on each partition?
You can <commit /> and <optimize /> the Solr index after you add documents, so I'm not sure a stale index is a concern. I set up a Solr instance that handled maybe 100,000 additional documents per day. At the time I left the job we had 1.4 million documents in the index. It was used for internal reporting and it was performant (the most complex query too under a minute). I just asked a former coworker and it's still doing fine a year later.
I can't speak to the map reduce software, though.
You should think about having one Solr core per week/month for instance, this way older cores will be read only, and easier to manager and very easy to spread over several Solr instances. If 200k docs are to be added per day for ever you need either that or Solr sharding, a single core will not be enough for ever.

Resources