How to add shards dynamically to collection in solr? - solr

Using the following query when I create the collection I set two shards for the collection10.
/solr/admin/collections?action=CREATE&name=collection10&numShards=2&replicationFactor=2
But what is my requirement is, I have to add 3rd shard dynamically after 10000 documents has been indexed in first two shards.
Is it possible to add shards dynamically once we started the collection and indexing at existing shards? If it possible means how to add shards dynamically once after we started the collection?
And also, is it possible to add replicas dynamically once we started the collection.For example I set replicationFactor=2, then later I need to start the new replication for the already started collection. Is it possible to do? If it so, how to do it?

One solution to the problem is to use the "implicit router" when creating your Collection.
Solr does supports the ability to add New Shards (or DELETE existing shards) to your index (whenever you want) via the "implicit router" configuration (CREATE COLLECTION API).
Lets say - you have to index all "Audit Trail" data of your application into Solr. New Data gets added every day. You might most probably want to shard by year.
You could do something like the below during the initial setup of your collection:
admin/collections?
action=CREATE&
name=AuditTrailIndex&
router.name=implicit&
shards=2010,2011,2012,2013,2014&
router.field=year
The above command:
a) Creates 5 shards - one each for the current and the last 4 years 2010,2011,2012,2013,2014
b) Routes data to the correct shard based on the value of the "year" field (specified as router.field)
In December 2014, you might add a new shard in preparation for 2015 using the CREATESHARD API (part of the Collections API) - Do something like:
/admin/collections?
action=CREATESHARD&
shard=2015&
collection=AuditTrailIndex
The above command creates a new shard on the same collection.
When its 2015, all data will get automatically indexed into the "2015" shard assuming your data has the "year" field populated correctly to 2015.
In 2015, if you think you don't need the 2010 shard (based on your data retention requirements) - you could always use the DELETESHARD API to do so:
/admin/collections?
action=DELETESHARD&
shard=2015&
collection=AuditTrailIndex
P.S. This solution only works if you used the "implicit router" when creating your collection. Does NOT work when you use the default "compositeId router" - i.e. collections created with the numshards parameter.
This feature is truly a gamechanger - allows shards to be added dynamically based on growing demands of your business.
Is this feature available in Elastic Search. If not, I am sure they will in time.

Now it's possible with Solr 4.4.0 (feature introduced in 4.3):
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-SplitaShard

Currently you cannot add new shards once the collection is made
https://issues.apache.org/jira/browse/SOLR-3755

Related

Why are Solr's logs time series stored in different collections based on time instead of different shards based on time

If you see Lucidworks Time Based Partitioning or Large Scale Log Analytics with Solr, multiple solr "collections" are created partitioned on time.
My question is
Why not in such cases just create multiple shards based on time ?
In case of multiple collection, how would a query spanning multiple collections/time be done ?
There is not much difference between multiple shards with implicit routing or multiple collections. When you issue a query, you can (optionally) specify which shards or which collections to search.
Alternatively you can set up an alias containing multiple collections, thus hiding the logistics from the search client. This makes it easy to create custom views over the full data set, such as an alias for each year, one for everything and one for the last quarter. If you at a later time decide to slice your data differently, e.g. make a collection for each week instead of each month, this change will be transparent to the client application. Aliases does not work for shards, so that is one reason to prefer collections.

When is Luke data distributed across Solr cores?

On a Solr installation with 2+ shards, when is the data returned by the LukeRequestHandler distributed across the shards? I ask because I want to be able to detect new (previously unseen) dynamic fields within a short amount of time after they are added.
Example desired sequence of events:
Assume dynamic field *_s
Query Luke and receive list of dynamic fields
Add document with field example_s
Query Luke and receive same list as before but with additional example_s in result (this currently doesn't happen)
Query collection for example_s:* and match the document added above
I am aware that newly added documents become immediately searchable even before being hard committed, but I am looking for a way to have that info appear in Luke too.
Info on the following would be useful:
Does Luke query all shards at request time, or just one? It would appear to only query one at random.
Exactly when does knowledge of previously unseen dynamic fields become distributed across all shards (equivalently, available to Luke)?
Can I configure the delay/trigger for this supposed Luke propagation in order to minimize the delay between addition of a document with a new dynamic field on an arbitrary shard and the moment it becomes visible in Luke responses on every other shard?
See https://issues.apache.org/jira/browse/SOLR-8127
Never.
As indicated by responses on the linked ticket, the Luke request handler isn't at a high enough level to understand multiple shards. Luke provides information about an index, not a collection, and certainly not a cluster.
You need to query each shard directly. This can be done by using the exact core path /solr/collection_shard1_replica1/admin/luke

Custom sharding or auto Sharding on SolrCloud?

I want to establish a SolrCloud clsuter for over 10 millions of news articles. After reading this article: Shards and Indexing Data in SolrCloud, I have a plan as follows:
Add prefix ED2001! to document ID where ED means some newspaper source and 2001 is the year part in published date of news article, i.e. I want to put all news articles of specific news paper source published in specific year to a shard.
Create collection with router.name set to compositeID.
Add documents?
Query Collection?
Practically, I got some questions:
How to add doucments based on this plan? Do I have to specify special parameters when updating the collection/core?
Is this called "custom sharding"? If not, what is "custom sharding"?
Is auto sharding a better choice for my case since there's a shard-splitting feature for auto sharding when the shard is too big?
Can I query without _router_ parameter?
EDIT # 2015/9/2:
This is how I think SolrCloud will do: "The amount of news articles of specific newspaper source of specific year tends to be around a fix number, e.g. Every year ED has around 80,000 articles, so each shard's size won't increase dramatically. For the next year's news articles of ED, I only have to add prefix 'ED2016!' to document ID, SolrCloud will create a new shard for me (which contains all ED2016 articles), and later the Leader will spread the replica of this new shard to other nodes (per replica per node other than leader?)". Am I right? If yes, it seems no need for shard-splitting.
Answer-1: If have the schema (structure) of the document then you can provide the same in schema.xml configuration or you can use Solr's schema-less mode for indexing the document. The schema-less mode will automatically identify the fields in your document and index them. The configuration of schema-less mode is little different then schema based configuration mode in solr. Afterwards, you need to send the documents to solr for indexing using curl or solrj java api. Essentially, solr provides rest end points for all the different operations. You can write the client in any language which suits you better.
Answer-2: What you have mentioned in your plan, use of compositeId, is called custom sharding. Because you are deciding to which shard a particular document should go.
Answer-3: I would suggest to go with auto-sharding feature if are not certain how much data you need to index at present and in future. As the index size increases you can split the shards and scale the solr horizontally.
Answer-4: I went through the solr documentation, did not find anywhere mentioning _route_ as mandatory parameter. But in some situations, this may improve query performance because it overcomes network latency when querying all the shards.
Answer-5: The meaning of auto-sharding is routing the document to a shards, based on the hash range assigned while creating the shards. It does not create the new shards automatically, just by specifying a new prefix for compositeId. So once the index grows large enough in size, you might need to split it. Check here for more.
This is actually a guide to answer my own question:
I kinda understand some concepts:
"custom sharding" IS NOT "custom hashing".
Solr averagely splits hash values as default hashing behavior.
compositeId router applies "custom hashing" cause it changes default hashing behavior by prefixing shard_key/num-of-bits.
Implicit router applies "custom sharding" since we need to manually specify which shards our docs will be sent to.
compositeId router still is auto sharding since it's Solr who see the shard_key prefix and route the docs to specific shards.
compositeId router needs to specify numShards parameter (possbily because Solr needs to distribute various hash value space ranges for each of the shard).
So obviously my strategy doesn't work since I need to always add in new year's news articles to Solr and there's no way I can predict how many shards in advance. So to speak, Implicit router seems a possible choice for me (We create shard we need and add docs to shard we intend to).

How to add a node to SolrCloud dynamically without SPLITSHARD?

I have setup SolrCloud with 4 shards. I added 8 nodes to the SolrCloud(4 Leaders and 4 Replicas). Each node is running in different machine. But later I identified that my data is growing more and more(daily 4 million files) so that my 4 shards are not sufficient. So, I want to add one more shard to this SolrCloud dynamically. When I add a new node that is created as replica,that is not what I want. When I search for this in Google, the answers I got is use Collection API SPLITSHARD. If I use SPLITSHARD that will split the already existed shard. But here my requirement is to add new shard to this SolrCloud. How to do this?
Any suggestion will be appreciated. Thanks in advance.
The answer is buried in the SolrCloud docs. See https://cwiki.apache.org/confluence/display/solr/Nodes,+Cores,+Clusters+and+Leaders the section 'Resizing a Cluster'
Basically the process is
Split a Shard - now you will have two shards on that one machine
Setup a replica of this new shard on your new machine
Remove the new shard from the original machine. ZooKeeper will promote the replica to the leader for that shard.
Setup a replica for that new shard
Very kludgy and manual process.
SolrCloud isn't very "Cloudy" i.e. elastic.
When you create the collection at the first time you make a very important decision, which is the sharding technique. Solr provides two different ways, implicit, or compositeId.
if you set it to compositeId, this means you want solr to calculate the shard based on a field of your choice (or the id by default), Solr will calculate a 32-bit integer hash key based on that field, and allocate a range for each shard. You also need to specify the number of shards in advance. So, solr will allocate a range of the 32-bit integer values for each shard, and according to the hash value it will route the document to the proper shard. For example if you set it to 4 shards, and the hash key happens to be in the first quarter of the 32-bit range, then it goes to first shard, and so on...
With this way you cannot change the number of shards later on. Because that will break the whole structure, you can still split one range into two separate sub-ranges. But you cannot just extend existing structure.
Second way, which is implicit, you don't have to specify the number of shards in advance, but you do the sharding manually in your application, and provide a field that has the name of the shard so, solr can route the document directly without calculating any thing. In this way, you can create as many shards in the future without affecting existing shards, you will simply create a new shard by name, and your application will start populating future documents with the new name.
So, in your situation, if you already chose compositeId, you cannot add shards, you can only split existing ones. If you think your shards will change much in the future, I'd suggest you re-build your cloud using implicit sharding.
check out Solr collection Api for more details : https://cwiki.apache.org/confluence/display/solr/Collections+API

Partial Update of documents

We have a requirement that documents that we currently index in SOLR may periodically need to be PARTIALLY UPDATED. The updates can either be
a. add new fields
b. update the content of existing fields.
Some of the fields in our schema are stored, others are not.
SOLR 4 does allow this but all the fields must be stored. See Update a new field to existing document and http://solr.pl/en/2012/07/09/solr-4-0-partial-documents-update/
Questions:
1. Is there a way that SOLR can achieve this. We've tried SOLR JOINs in the past but it wasn't the right fit for all our use cases.
On the other hand, can elastic search , linkedin's senseidb or other text search engines achieve this ?
For now, we manage by re-indexing the affected documents when they need to be indexed
Thanks
Solr has the limitation of stored fields, that's correct. The underlying lucene always requires to delete the old document and index the new one. In fact lucene segments are write-once, it never goes back to modify the existing ones, thus it only markes documents as deleted and deletes them for real when a merge happens.
Search servers on top of lucene try to work around this problem by exposing a single endpoint that's able to delete the old document and reindex the new one automatically, but there must be a way to retrieve the old document somehow. Solr can do that only if you store all the fields.
Elasticsearch works around it storing the source documents by default, in a special field called _source. That's exactly the document that you sent to the search engine in the first place, while indexing. This is by the way one of the features that make elasticsearch similar to NoSQL databases. The elasticsearch Update API allows you to update a document in two ways:
Sending a new partial document that will be merged with the existing one (still deleting the old one and indexing the result of the merge
Executing a script on the existing document and indexing the result after deleting the old one
Both options rely on the presence of the _source field. Storing the source can be disabled, if you disable it you of course lose this great feature.

Resources