Brief overview of the setup:
5 x SolrCloud (Solr 4.6.1) node instances (separate machines).
The setup is intended to store last 48 hours webapp logs (which are pretty intense... ~ 3MB/sec)
"logs" collection has 5 shards (one per node instance).
One logline represents one document of "logs" collection
If I keep storing log documents to this "logs" collection, cores on shards start getting really big and CPU graphs show that instances spend more and more time waiting for disk I/O.
So, my idea is to create new collection with each 15 minutes and name it "logs-201402051400" with shards spread across 5 instances. Document writers will start writing to the new collection as soon as it is created. At some time I will get the list of collection like that:
...
logs-201402051400
logs-201402051415
logs-201402051430
logs-201402051445
logs-201402051500
...
Since there will be max 192 collections (~1000 cores) in the SolrCloud at some certain period of time. It seems that search performance should degrade drastically.
So, I would like to merge collections that are not being currently written to into one large collection (but still sharded across 5 instances). I have found information how to merge cores, but how can I merge collections?
This might NOT be a complete answer to your query - but something tells me that you need to redo the design of your collection.
This is a classic debate between using a Single Collection with Multiple Shards versus Multiple Collections.
I think you ought to setup a Single Collection - and then use Solr Cloud's dynamic sharding capability (implicit router) to add new shards (for newer 15 minute intervals) / delete old shards (for older 15 minute intervals).
Managing a single collection means that you will have a single end point and will save you from complexity of querying multiple collections.
Take a look at one of the answers on this link that talks about using the implicit router for dynamic sharding in SolrCloud.
How to add shards dynamically to collection in solr?
Related
My company uses an out of the box software, and that software export logs to Elasticsearch (and uses these logs). The software create an index per day for every data type, for example:
"A" record data => A_Data_2022_12_13, A_Data_2022_12_14 and so on..
Because this data storing method our Elastic has thousands of shards for 100GB of data.
I want to merge all those shards into a small amount of shards, 1 or 2 for every data type.
I thought about reindex, but I think it is overkill for my purpose, because I want the data to stay the same as it is now, but merged into one shard.
What is the best practice to do it?
Thanks!
I tried reindex, but it takes a lot of time, and I think it is not the right solution.
Too many shards can cause over-heap usage. Unbalanced shards can cause hot spots in clusters. Your decision is true and you should combine small indices into one or multiple indexes. Thus, you will have more stable shards, that is, a more stable cluster.
What you can do?
Create a rollover index and point your indexer to that index. In
that way, new data will store in the new index, so you need only be
concerned about the existing data.
Use filtered alias to search your data.
Reindex or wait. The new data is indexing into a new index, but what
are you gonna do for the existing indices? There are 2 ways for this. I
assume you have an index retention period, so you can wait until all
separated indices are deleted or you can directly reindex your data.
Note: You can tune the reindex speed with slice and set the number_of_replicas to 0.
I am beginner in solr and I have no idea about how to do sharding in solr so my question is why we need sharding when we create collection and what is the benifit of it .If I am not creating sharding what happened.
Sharding allows us to have indexes that span more than a single instance of Solr - i.e. multiple servers or multiple running instances of Solr (which could be useful under specific conditions because of some single thread limitations in Lucene, as well as some memory usage patterns).
If we didn't have sharding, you'd be limited to a total size of your index to whatever you could fit on a single server. Sharding means that one part of the index (for example half of all your documents) will be located on one server, while the other half will be located on the other server. When you query Solr for any results, each shard will receive the query, and the result will then be merged before being returned back to you.
There's a few limitations in features that won't work properly when an index is shared (and scores are calculated locally on each server, which is why you usually want your documents spread as evenly as possible), but in those cases where sharding is useful (and it very often is!), there really isn't any better solutions.
Sharding helps us split the data into multiple replicas.
eg. If you have a collection named Employee with 1 shard and 2 replica.
Then assuming there are 100 records,
Employee_shard1_replica1 will have 100 records and
Employee_shard1_replica2 will have 100 records
The replica did the copying of entire records into another core so that you have loan balancing as well as fault taulrence.
Now, eg2. If you have the same collection Employee with 2 shard and 2 replica. In this scenario, the data will be split to both the shards.
Employee_shard1_replica1 will have 50 records
Employee_shard1_replica2 will have 50 records
Employee_shard2_replica2 will have 50 records
Employee_shard2_replica2 will have 50 records
Note : Shard 1 replicas have same data here and shard 2 replicas will have same data.
I currently have a single collection with 40 million documents and index size of 25 GB. The collections gets updated every n minutes and as a result the number of deleted documents is constantly growing.
The data in the collection is an amalgamation of more than 1000+ customer records. The number of documents per each customer is around 100,000 records on average.
Now that being said, I 'm trying to get an handle on the growing deleted document size. Because of the growing index size both the disk space and memory is being used up. And would like to reduce it to a manageable size.
I have been thinking of splitting the data into multiple core, 1 for each customer. This would allow me manage the smaller collection easily and can create/update the collection also fast. My concern is that number of collections might become an issue. Any suggestions on how to address this problem.
Solr: 4.9
Index size:25 GB
Max doc: 40 million
Doc count:29 million
Thanks
I had the similar sort of issue having multiple customer and big indexed data.
I have the implemented it with version 3.4 by creating a separate core for a customer.
i.e One core per customer. Creating core is some sort of creating indexes or splitting the data as like we do in case of sharding...
Here you are splitting the large indexed data in different smaller segments.
Whatever the seach will happen it will carry in the smaller indexed segment.. so the response time would be faster..
I have almost 700 core created as of now and its running fine for me.
As of now I did not face any issue with managing the core...
I would suggest to go with combination of core and sharding...
It will help you in achieve
Allows to have a different configuration for each core with different behavior and that will not have impact on other cores.
you can perform action like update, load etc. on each core differently.
I've setup a SolrCloud structures having 3 shards. Each shard consist of 2 nodes. One is Leader and another is replica. Each solr instance (as node) is running in the separate machine. Now I need to add more machines as my data volume increases. But if I add new node without creating new shard, it'll simply increase more replica of shards. I want to create more shards with new machines and the data should be distributed among the shards.
For testing purpose, I created a SolrCloud with one shard (2 nodes). I tried solr SPLITSHARD with solr-4.5.1. Finally, I see total 3 shards (shard1, shard1_0 and shard1_1) from the admin window. Now it's showing total 6 nodes.
In the background, it has created the following folders under each node.
node1 :
solr/collection1
solr/collection1_shard1_0_replica1
solr/collection1_shard1_1_replica1
node2 :
solr/collection1
solr/collection1_shard1_0_replica2
solr/collection1_shard1_1_replica2
It means, it created 2 new cores under each instance. But I want to run a single core under each machine.
We have been into the same problem too. The only solution I can see for the current version of Solr is to add replicas on the new machines, wait for replication done and delete the original one.
In addition, if you split only one shard in the collection, the cluster will not be uniformly distributed. So you have to split every shards by the same factor.
Once you set numShards property when creating a collection, your intention become impossible. Other answers are only depicting about splitting origial no. of shards into more no. of shards but the data won't be distributed evenly, i.e. suppose 1 data starts with 2 shards, say S1 and S2. When do splitting-shards on S1, it becomes S11,S12,S2 which data in S2 is much more than S11,S12. But I think what you want is data in S1 & S2 is cut evenly into S11, S12, and S2 where S11,S12, and S2 are running on different nodes on different machines. That's NOT possible in current Solr (even v6) AFAIK.
What you want is also me and many other Solrcloud users want and I think it's a very normal intention. Let's hope future version of Solrcloud will provide this functionality.
I have an application which needs to store a huge volume of data (around 200,000 txns per day), each record around 100 kb to 200 kb size. The format of the data is going to be JSON/XML.
The application should be highly available , so we plan to store the data on S3 or AWS DynamoDB.
We have use-cases where we may need to search the data based on a few attributes (date ranges, status, etc.). Most searches will be on few common attributes but there may be some arbitrary queries for certain operational use cases.
I researched the ways to search non-relational data and so far found two ways being used by most technologies
1) Build an index (Solr/CloudSearch,etc.)
2) Run a Map Reduce job (Hive/Hbase, etc.)
Our requirement is for the search results to be reliable (consistent with data in S3/DB - something like a oracle query, it is okay to be slow but when we get the data, we should have everything that matched the query returned or atleast let us know that some results were skipped)
At the outset it looks like the index based approach would be faster than the MR. But I am not sure if it is reliable - index may be stale? (is there a way to know the index was stale when we do the search so that we can correct it? is there a way to have the index always consistent with the values in the DB/S3? Something similar to the indexes on Oracle DBs).
The MR job seems to be reliable always (as it fetches data from S3 for each query), is that assumption right? Is there anyway to speed this query - may be partition data in S3 and run multiple MR jobs based on each partition?
You can <commit /> and <optimize /> the Solr index after you add documents, so I'm not sure a stale index is a concern. I set up a Solr instance that handled maybe 100,000 additional documents per day. At the time I left the job we had 1.4 million documents in the index. It was used for internal reporting and it was performant (the most complex query too under a minute). I just asked a former coworker and it's still doing fine a year later.
I can't speak to the map reduce software, though.
You should think about having one Solr core per week/month for instance, this way older cores will be read only, and easier to manager and very easy to spread over several Solr instances. If 200k docs are to be added per day for ever you need either that or Solr sharding, a single core will not be enough for ever.