Generate Solr cores from an existing one - solr

I have a Solr core with hundreds of millions of documents.
I want to create 100 duplicates of this core where I only change 2-3 fields (time and ID) on the original docs and save them to the new cores (so each core contains a different time data for testing).
I need it to work as fast as possible.
I was thinking opening the core files with Lucene and read the entire content while writing the altered documents to a new index but I've realized I'll need to configure all the analyzers of the destination core which may be complex and in additional not all my fields are stored.
If there is a low level API in Lucene to alter documents / indexes, I could copy the index files and change the documents on the lowest level.
Anyone familiar with such?

Related

full build solr index with large amount of data

I have a text file containing over 10 million records of web pages.
I want to build solr index with this file every day(because this file is updated daily).
Is there any effective solutions to full build solr index at once? Such as using map reduce model to accelerate building process.
I think using solr api to add document is a little bit slow.
It is not clear how much content is in those 10 million records, but it may actually be simple enough to index those in bulk. Just check your solrconfig.xml for your commit settings, you may, for example, have autoCommit configured with low maxDocs settings. In your case, you may want to disable autoCommit completely and just do it manually at the end.
However, if it is still a bit slow, before going to map-reduce, you could think about building a separate index and then swapping it with the current index.
This way, you actually have the previous collection to roll-back to and/or to compare if needed. The new collection can even be built on a different machine and/or more close to the data.

Understing some concepts of apache solr

I am new to Apache solr, can someone please explain the meaning of following terms with examples :-
Solr Core
Solr Collection
Logical vs Physical index
Sharding
I went through various blog posts but i am not able to understand.
The terminology is used a bit haphazardly, so you'll probably find texts that use a few of these terms interchangeably.
Solr core
A core is a named set of documents living on a single server. A server can have many cores. The core can be replicated to other servers (this is "old style" replication when done manually).
Solr Collection
A collection is a set of cores, from one to .. many. It's a logical description of "these cores together form the entire collection". This was introduced with SolrCloud, as that's the first time that Solr handles clustering for you.
Logical vs Physical
A collection is a logical index - it can span many cores. Each core is a physical index (it has the actual index files from Lucene on its disk). You interact with the collection as you'd interact with the core, and all the details of clustering are (usually) hidden from you by Solr (in SolrCloud mode).
Sharding
Since a collection can span many cores, sharding means that the documents that make up a single collection are present in many cores. Each core is a "shard" of the total index. Compare this to replication, where a copy of a core is distributed to many Solr instances (the same documents are present in both cores, while when sharding the documents are just present in one core and you need all cores to have a complete collection).
Sharding is what makes it possible to store more documents than a single server can handle (or keep in memory/cache to respond quickly enough).
SolrCloud (Added by me to make this all come togheter)
Previously (and still, if you're not using SolrCloud mode) sharding and replication were handled manually by the user when querying and configuring Solr. You set up replication to spread the same core across many servers, and you used sharding to make Solr query many Solr instances to get all the required documents. Today you'll usually just use SolrCloud and let Solr abstract away all these details. You'll come across these terms when creating a collection (numShards and replicationFactor) which tells Solr how many cores you want to spread the collection across, and how many servers should hold copies of these cores.
Collection -> Sharded across [1..N] cores, replicated [0..M] times for redundancy and higher query throughput.

How to merge the segments in Lucene(Solr)

I have a scenario that I need to merge the solr indexes online.
I have a primary solr index of 100 Gb and it is serving the end users and it can't go offline for a moment. Everyday new lucene indexes(2 GB) are generated separately.
I have tried Merging Indexes: coreadmin
Even I tried the IndexWriter API AddIndexes. But no luck.
And it will create a new core or new folder. which means it will copy 100Gb every time to a new folder.
Is there a way I can do a segment level merging?
Your question is about merging two cores.
I will answer for solr5.
You can merge with the core api.
You can merge with lucene outsite from solr, create a core and then switch with the old one.
If you are using SolrCloud you can use a list of cores for your collection via ALIAS or Migrate Documents from new core to your central core.

Correct use case of multiple cores in Solr 4

We use Solr 4.8 for our project.
One colleague created 2 cores in the same instance to index 80GB documents XML, from the same source. He said that one core can contain a maximum of 50GB of indexed data, so we split the 80GB to 2 cores. These cores have the same config files and schema.
For indexation, he puts odd docs in the 1st core, and even docs in the 2nd core.
For search, he uses one of SolrJ API to query on all documents from each core.
As we have only one server, distribution and replication aren't applied for the project.
My question: is this architecture a correct use case for Solr multiple cores? Anyone have some suggests?
instead of storing two indexes and manually managing storing of documents on different cores, you should create solrcloud, which automatically distributes the data among the shards. It also allows you to distribute your data on multiple machines.
It will also make your performance better, querying would be much easier and you could add multiple collections(with different schema's) too.
you should be using Solr Cloud, with a collection that has 2 shards. Take a look at https://cwiki.apache.org/confluence/display/solr/SolrCloud
Generally cores are created to differentiate the application data in different collection entity format.
It generally becomes useful to migrate core data from lower version to higher version. You can have many cores in solr. Suppose you have data harvested from two different source like one from X source and other from Y source, we generally would store them in 2 separate cores.
In your case it would be good idea to have 2 cores over same set of collection of data as the memory limit is huge. Generally a single core can accommodate huge amount memory. According to me its just the matter of your resource capability(hardware configuration like RAM and HDD)

Solr 4 Adding Shard to existing Cluster

Background: I just finished reading the Apache Solr 4 Cookbook. In it the author mentions that setting up shards needs to be done wisely b/c new ones cannot be added to an existing cluster. However, this was written using Solr 4.0 and at the present I am using 4.1. Is this still the case? I wish I hadn't found this issue and I'm hoping someone can tell me otherwise.
Question: Am I expected to know how much data I'll store in the future when setting up shards in a SolrCloud cluster?
I have played with Solandra and read up on elastic search, but quite honestly I am a fan of Solr as it is (and its large community!). I also like Zookeeper. Am I stuck for now or is there a workaround/patch?
Edit: If Question above is NO, could I build a SolrCloud with a bunch (maybe 100 or more) shards and let them grow (internally) and while I grow my data start peeling them off one by one and put them into larger, faster servers with more resources?
Yes, of course you can. You have to setup a new Solr server pointing to the same zookeeper instance. During the bootstrap the server connects to zk ensemble and registers itself as a cluster member.
Once the registration process is complete, the server is ready to create new cores. You can create replicas of the existing shards using CoreAdmin. Also you can create new shards, but they won't be balanced due to Lucene index format (not all fields are stored), because it may not have all document information to rebalance the cluster, so only new indexed/updated documents will get to this server (doing this is not recommendable).
When you setup your SolrCloud you have to create the cluster taking into account your document number growth factor, so if you have 1M documents at first and it grows as 10k docs/day, setup the cluster with 5 shards, so at start you have to host this shards in your two machines initial setup, but in the future, as needed, you can add new servers to the cluster and move those shards to this new servers. Be careful to not overgrow you cluster because, in Lucene, a single 20Gb index split across 5 shards won't be a 4Gb index in every shard. Every shard will take about (single_index_size/num_shards)*1.1 (due to dictionary compression). This may change depending on your term frequency.
The last chance you have is to add the new servers to the cluster and instead of adding new shards/replicas to the existing server, setup a new different collection using your new shards and reindex in parallel to this new collection. Then, once your reindex process finished, swap this collection and the old one.
One solution to the problem is to use the "implicit router" when creating your Collection.
Lets say - you have to index all "Audit Trail" data of your application into Solr. New Data gets added every day. You might most probably want to shard by year.
You could do something like the below during the initial setup of your collection:
admin/collections?
action=CREATE&
name=AuditTrailIndex&
router.name=implicit&
shards=2010,2011,2012,2013,2014&
router.field=year
The above command:
a) Creates 5 shards - one each for the current and the last 4 years 2010,2011,2012,2013,2014
b) Routes data to the correct shard based on the value of the "year" field (specified as router.field)
In December 2014, you might add a new shard in preparation for 2015 using the CREATESHARD API (part of the Collections API) - Do something like:
/admin/collections?
action=CREATESHARD&
shard=2015&
collection=AuditTrailIndex
The above command creates a new shard on the same collection.
When its 2015, all data will get automatically indexed into the "2015" shard assuming your data has the "year" field populated correctly to 2015.
In 2015, if you think you don't need the 2010 shard (based on your data retention requirements) - you could always use the DELETESHARD API to do so:
/admin/collections?
action=DELETESHARD&
shard=2015&
collection=AuditTrailIndex
P.S. This solution only works if you used the "implicit router" when creating your collection. Does NOT work when you use the default "compositeId router" - i.e. collections created with the numshards parameter.
This feature is truly a game changer - allows shards to be added dynamically based on growing demands of your business.

Resources