How to merge the segments in Lucene(Solr)

How to merge the segments in Lucene(Solr) - solr

I have a scenario that I need to merge the solr indexes online.
I have a primary solr index of 100 Gb and it is serving the end users and it can't go offline for a moment. Everyday new lucene indexes(2 GB) are generated separately.
I have tried Merging Indexes: coreadmin
Even I tried the IndexWriter API AddIndexes. But no luck.
And it will create a new core or new folder. which means it will copy 100Gb every time to a new folder.
Is there a way I can do a segment level merging?

Your question is about merging two cores.
I will answer for solr5.
You can merge with the core api.
You can merge with lucene outsite from solr, create a core and then switch with the old one.
If you are using SolrCloud you can use a list of cores for your collection via ALIAS or Migrate Documents from new core to your central core.

Related

Migrating Solr Cloud cluster over new cloud vendor

We need to move our solr cloud cluster from one cloud vendor to another, the cluster is composed of 8 shards with 2 replica factor spread among 8 servers with roughly a total of 500GB worth of data.
I wonder what are the common approaches to migrate the cluster but specially its data with the less impact in availability and performance etc..
I was thinking in some sort of initial dump copy to then synchronize them catching up the diff (which could be huge) after keeping them in sync just switch whenever everything is ready from the other side.
Is that something doable? what tools should/could I use?
Thanks!

You have multiple choices depending on your existing setup and Solr version:
As mentioned earlier, make use of backup and restore APIs from Collections API
If you have Solr 6 and above, I would recommend exploring the option of CDCR, which is Solr's native Cross Data Centre Replication.
Reindexing onto the new cluster and then leverage Solr Collection Aliasing to change your application end points to the target provider upon the completion of reindexing

Is there a way to index multiple Solr cores simultaneously?

I am developing an indexing application using Solr. Our current system has two live cores and indexes only one core at a time. It has recently become apparent that the current indexing system will not work long term. One of the live cores needs to be split into two new cores. They will have some overlapping information, but different schemas. Both will need to be updated quickly whenever a new project is ingested into the database.
Is there a way to simultaneously update multiple solr cores using SolrJ?
All cores are in the same solr instance.
We are not using SolrCloud.
The core that needs to be split currently contains approx. 2500000 documents.
Any help is appreciated.

Since you are indexing many documents on a single core I would assume the indexing process takes quite some time and using all system resources ( if configured correctly ). In that case - parallel indexing on the same instance will not help as your multiple threads will be sharing the same resources.
But what you could do is index another core on another instance and then do replication of each core separately.

When you build a Solr client using SolrJ it's specific to the core and not to your complete Solr instance. Having said that you could have multiple process updating any number of cores in your application.

Generate Solr cores from an existing one

I have a Solr core with hundreds of millions of documents.
I want to create 100 duplicates of this core where I only change 2-3 fields (time and ID) on the original docs and save them to the new cores (so each core contains a different time data for testing).
I need it to work as fast as possible.
I was thinking opening the core files with Lucene and read the entire content while writing the altered documents to a new index but I've realized I'll need to configure all the analyzers of the destination core which may be complex and in additional not all my fields are stored.
If there is a low level API in Lucene to alter documents / indexes, I could copy the index files and change the documents on the lowest level.
Anyone familiar with such?

Adding Data to a particular core in multicore solr instance

I have done the following steps to create a multicore setup of Solr.
I have a Solr instance running on Jetty (I have used the default configurations).
I have copied 1 core to another,
1) Now in this scenario. if I run the post.jar command to add a document to an index, then will it be added to both cores?
2) So if I query the Solr index then which core will fetch the result?
3) Which command should I use to post a new document for indexing in a particular core?

Did you shard it or was it replicated? Read this if you don't know. 1) Either sharding or replicated are synchronized internally by solr, so it should be divided or added in both cores. 2) It doesn't matter which one, solr does that for you, you just need to have zooKeeper ready to accept and balance requests between each core. 3) You can't if your're adding data to a core that is replicated, but if you're sharding cores I think it's possible, it was answered here: How to index data in a specific shard using solrj

Solr 4 Adding Shard to existing Cluster

Background: I just finished reading the Apache Solr 4 Cookbook. In it the author mentions that setting up shards needs to be done wisely b/c new ones cannot be added to an existing cluster. However, this was written using Solr 4.0 and at the present I am using 4.1. Is this still the case? I wish I hadn't found this issue and I'm hoping someone can tell me otherwise.
Question: Am I expected to know how much data I'll store in the future when setting up shards in a SolrCloud cluster?
I have played with Solandra and read up on elastic search, but quite honestly I am a fan of Solr as it is (and its large community!). I also like Zookeeper. Am I stuck for now or is there a workaround/patch?
Edit: If Question above is NO, could I build a SolrCloud with a bunch (maybe 100 or more) shards and let them grow (internally) and while I grow my data start peeling them off one by one and put them into larger, faster servers with more resources?

Yes, of course you can. You have to setup a new Solr server pointing to the same zookeeper instance. During the bootstrap the server connects to zk ensemble and registers itself as a cluster member.
Once the registration process is complete, the server is ready to create new cores. You can create replicas of the existing shards using CoreAdmin. Also you can create new shards, but they won't be balanced due to Lucene index format (not all fields are stored), because it may not have all document information to rebalance the cluster, so only new indexed/updated documents will get to this server (doing this is not recommendable).
When you setup your SolrCloud you have to create the cluster taking into account your document number growth factor, so if you have 1M documents at first and it grows as 10k docs/day, setup the cluster with 5 shards, so at start you have to host this shards in your two machines initial setup, but in the future, as needed, you can add new servers to the cluster and move those shards to this new servers. Be careful to not overgrow you cluster because, in Lucene, a single 20Gb index split across 5 shards won't be a 4Gb index in every shard. Every shard will take about (single_index_size/num_shards)*1.1 (due to dictionary compression). This may change depending on your term frequency.
The last chance you have is to add the new servers to the cluster and instead of adding new shards/replicas to the existing server, setup a new different collection using your new shards and reindex in parallel to this new collection. Then, once your reindex process finished, swap this collection and the old one.

One solution to the problem is to use the "implicit router" when creating your Collection.
Lets say - you have to index all "Audit Trail" data of your application into Solr. New Data gets added every day. You might most probably want to shard by year.
You could do something like the below during the initial setup of your collection:
admin/collections?
action=CREATE&
name=AuditTrailIndex&
router.name=implicit&
shards=2010,2011,2012,2013,2014&
router.field=year
The above command:
a) Creates 5 shards - one each for the current and the last 4 years 2010,2011,2012,2013,2014
b) Routes data to the correct shard based on the value of the "year" field (specified as router.field)
In December 2014, you might add a new shard in preparation for 2015 using the CREATESHARD API (part of the Collections API) - Do something like:
/admin/collections?
action=CREATESHARD&
shard=2015&
collection=AuditTrailIndex
The above command creates a new shard on the same collection.
When its 2015, all data will get automatically indexed into the "2015" shard assuming your data has the "year" field populated correctly to 2015.
In 2015, if you think you don't need the 2010 shard (based on your data retention requirements) - you could always use the DELETESHARD API to do so:
/admin/collections?
action=DELETESHARD&
shard=2015&
collection=AuditTrailIndex
P.S. This solution only works if you used the "implicit router" when creating your collection. Does NOT work when you use the default "compositeId router" - i.e. collections created with the numshards parameter.
This feature is truly a game changer - allows shards to be added dynamically based on growing demands of your business.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight