I have setup SolrCloud with 4 shards. I added 8 nodes to the SolrCloud(4 Leaders and 4 Replicas). Each node is running in different machine. But later I identified that my data is growing more and more(daily 4 million files) so that my 4 shards are not sufficient. So, I want to add one more shard to this SolrCloud dynamically. When I add a new node that is created as replica,that is not what I want. When I search for this in Google, the answers I got is use Collection API SPLITSHARD. If I use SPLITSHARD that will split the already existed shard. But here my requirement is to add new shard to this SolrCloud. How to do this?
Any suggestion will be appreciated. Thanks in advance.
The answer is buried in the SolrCloud docs. See https://cwiki.apache.org/confluence/display/solr/Nodes,+Cores,+Clusters+and+Leaders the section 'Resizing a Cluster'
Basically the process is
Split a Shard - now you will have two shards on that one machine
Setup a replica of this new shard on your new machine
Remove the new shard from the original machine. ZooKeeper will promote the replica to the leader for that shard.
Setup a replica for that new shard
Very kludgy and manual process.
SolrCloud isn't very "Cloudy" i.e. elastic.
When you create the collection at the first time you make a very important decision, which is the sharding technique. Solr provides two different ways, implicit, or compositeId.
if you set it to compositeId, this means you want solr to calculate the shard based on a field of your choice (or the id by default), Solr will calculate a 32-bit integer hash key based on that field, and allocate a range for each shard. You also need to specify the number of shards in advance. So, solr will allocate a range of the 32-bit integer values for each shard, and according to the hash value it will route the document to the proper shard. For example if you set it to 4 shards, and the hash key happens to be in the first quarter of the 32-bit range, then it goes to first shard, and so on...
With this way you cannot change the number of shards later on. Because that will break the whole structure, you can still split one range into two separate sub-ranges. But you cannot just extend existing structure.
Second way, which is implicit, you don't have to specify the number of shards in advance, but you do the sharding manually in your application, and provide a field that has the name of the shard so, solr can route the document directly without calculating any thing. In this way, you can create as many shards in the future without affecting existing shards, you will simply create a new shard by name, and your application will start populating future documents with the new name.
So, in your situation, if you already chose compositeId, you cannot add shards, you can only split existing ones. If you think your shards will change much in the future, I'd suggest you re-build your cloud using implicit sharding.
check out Solr collection Api for more details : https://cwiki.apache.org/confluence/display/solr/Collections+API
Related
I'm trying to set up multiple solr cores (the data for each core is indexed using norconex, crawling entirely separate sites). The schema and solrconfig files are the same for all cores but there is a copy in each of their respective conf folders.
When I run a query in the admin UI for core 1, I'm getting a mix of results from info indexed to cores 2 and 3 as well. How do I keep them entirely separate? It was my understanding that having separate cores would do this by default?
I've tried clearing all documents from cores 2 and 3, but core 1 still pulls up their docs. Thanks for any help anyone can provide.
This should not be happening. So, something has gone wrong. Possible options, from most likely and down:
You are - accidentally - indexing into single core (as mentioned in comments). This is most likely. Perhaps you got URL wrong or the software is using some old convention of naming the core through URL parameters. Try to intercept the URL actually used for indexing and see how they are different when software thinks it indexes into different cores. The core name should be in the URL itself (e.g. http://server:8983/solr/core1).
You have created a SolrCloud collection but are trying to index into individual cores of that collection. You should be able to check that in Admin UI and usually the core names are quite noticeably specific.
You have created an alias that spans multiple cores and are querying that instead of individual cores.
You have accidentally pointed several of your cores to the same data directory.
You did not say what happens when you query core2. If it does not have any documents, then first outcome is most likely. If it does, there may be other issues in play.
The issue you're describing above sounds like it could be that you have cores 1 through 3 on the same shard. That means that they would be replicas of each other and have the same data. If core1 were to be killed and replaced with another core, then data from the other cores would be replicated to the new core when the new core was added to the collection.
If you want subsets of documents in three separate cores (the physical locations), then those cores need to live in three separate shards (the logical locations). This can be accomplished using routing.
The compositeId router will let you send documents or queries to specific shards. The documentation shows an example of using data from a company field as part of the routing key value like this: "IBM!12345"
The exclamation point is a separator to break the key into the various parts used for creating the shard hash value. This allows sending "IBM" data to one shard, and "YOYODYNE" can be sent to another shard.
If "YOYODYNE" had way more documents than "IBM", then you might want to spread documents for "YOYODYNE" across multiple shards. The documentation says to use something like this:
Another use case could be if the customer "IBM" has a lot of documents
and you want to spread it across multiple shards. The syntax for such
a use case would be: shard_key/num!document_id where the /num is the
number of bits from the shard key to use in the composite hash.
So IBM/3!12345 will take 3 bits from the shard key and 29 bits from the unique doc id, spreading the tenant over 1/8th of the shards
in the collection. Likewise if the num value was 2 it would spread the
documents across 1/4th the number of shards. At query time, you
include the prefix(es) along with the number of bits into your query
with the route parameter (i.e., q=solr&route=IBM/3!) to direct
queries to specific shards.
I want to establish a SolrCloud clsuter for over 10 millions of news articles. After reading this article: Shards and Indexing Data in SolrCloud, I have a plan as follows:
Add prefix ED2001! to document ID where ED means some newspaper source and 2001 is the year part in published date of news article, i.e. I want to put all news articles of specific news paper source published in specific year to a shard.
Create collection with router.name set to compositeID.
Add documents?
Query Collection?
Practically, I got some questions:
How to add doucments based on this plan? Do I have to specify special parameters when updating the collection/core?
Is this called "custom sharding"? If not, what is "custom sharding"?
Is auto sharding a better choice for my case since there's a shard-splitting feature for auto sharding when the shard is too big?
Can I query without _router_ parameter?
EDIT # 2015/9/2:
This is how I think SolrCloud will do: "The amount of news articles of specific newspaper source of specific year tends to be around a fix number, e.g. Every year ED has around 80,000 articles, so each shard's size won't increase dramatically. For the next year's news articles of ED, I only have to add prefix 'ED2016!' to document ID, SolrCloud will create a new shard for me (which contains all ED2016 articles), and later the Leader will spread the replica of this new shard to other nodes (per replica per node other than leader?)". Am I right? If yes, it seems no need for shard-splitting.
Answer-1: If have the schema (structure) of the document then you can provide the same in schema.xml configuration or you can use Solr's schema-less mode for indexing the document. The schema-less mode will automatically identify the fields in your document and index them. The configuration of schema-less mode is little different then schema based configuration mode in solr. Afterwards, you need to send the documents to solr for indexing using curl or solrj java api. Essentially, solr provides rest end points for all the different operations. You can write the client in any language which suits you better.
Answer-2: What you have mentioned in your plan, use of compositeId, is called custom sharding. Because you are deciding to which shard a particular document should go.
Answer-3: I would suggest to go with auto-sharding feature if are not certain how much data you need to index at present and in future. As the index size increases you can split the shards and scale the solr horizontally.
Answer-4: I went through the solr documentation, did not find anywhere mentioning _route_ as mandatory parameter. But in some situations, this may improve query performance because it overcomes network latency when querying all the shards.
Answer-5: The meaning of auto-sharding is routing the document to a shards, based on the hash range assigned while creating the shards. It does not create the new shards automatically, just by specifying a new prefix for compositeId. So once the index grows large enough in size, you might need to split it. Check here for more.
This is actually a guide to answer my own question:
I kinda understand some concepts:
"custom sharding" IS NOT "custom hashing".
Solr averagely splits hash values as default hashing behavior.
compositeId router applies "custom hashing" cause it changes default hashing behavior by prefixing shard_key/num-of-bits.
Implicit router applies "custom sharding" since we need to manually specify which shards our docs will be sent to.
compositeId router still is auto sharding since it's Solr who see the shard_key prefix and route the docs to specific shards.
compositeId router needs to specify numShards parameter (possbily because Solr needs to distribute various hash value space ranges for each of the shard).
So obviously my strategy doesn't work since I need to always add in new year's news articles to Solr and there's no way I can predict how many shards in advance. So to speak, Implicit router seems a possible choice for me (We create shard we need and add docs to shard we intend to).
What will happen if the node's physical space of one of the shard in SolrCloud is full? Will the index request to those nodes or that shard will redirect to other shards having space?
The short answer is not easily, not automatically because a specific shard is full. The reason being the 32-bit hash range is split evenly between each shard, Solr uses the murmur hash algorithm, which keeps the number of documents in each shard balanced (roughly), so most of your nodes will start hitting same limitations almost at the same time, so you need to monitor your indexes and plan for it ahead or after. You have two options in this context First, Custom hashing allows you to route documents to specific shards based on some common field value, such as tenant ID. Another example of this would be routing documents based on category.The biggest concern when using custom hashing is that it may create unbalanced shards in
your cluster. The second options is Shard splitting , allows you to split an existing shard into two subshards. To do shard splitting , Use the SPLITSHARD action of the collections API to split an existing shard into two subshards. Issue a "hard" commit after the split process completes to make the new subshards active. Unload the original shard from the cluster.
But if you still choose to force document to a specific shard becuase you know other shard is full, you can do it this way: Solr 4.5 has added the ability to specify the router implementation with the router.name parameter. If you use the "compositeId" router, you can send documents with a prefix in the document ID which will be used to calculate the hash Solr uses to determine the shard a document is sent to for indexing. The prefix can be anything you'd like it to be (it doesn't have to be the shard name, for example), but it must be consistent so Solr behaves consistently. For example, if you wanted to co-locate documents for a customer, you could use the customer name or ID as the prefix. If your customer is "IBM", for example, with a document with the ID "12345", you would insert the prefix into the document id field: "IBM!12345". The exclamation mark ('!') is critical here, as it defines the shard to direct the document to.
You can read more about it here: https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud
Using the following query when I create the collection I set two shards for the collection10.
/solr/admin/collections?action=CREATE&name=collection10&numShards=2&replicationFactor=2
But what is my requirement is, I have to add 3rd shard dynamically after 10000 documents has been indexed in first two shards.
Is it possible to add shards dynamically once we started the collection and indexing at existing shards? If it possible means how to add shards dynamically once after we started the collection?
And also, is it possible to add replicas dynamically once we started the collection.For example I set replicationFactor=2, then later I need to start the new replication for the already started collection. Is it possible to do? If it so, how to do it?
One solution to the problem is to use the "implicit router" when creating your Collection.
Solr does supports the ability to add New Shards (or DELETE existing shards) to your index (whenever you want) via the "implicit router" configuration (CREATE COLLECTION API).
Lets say - you have to index all "Audit Trail" data of your application into Solr. New Data gets added every day. You might most probably want to shard by year.
You could do something like the below during the initial setup of your collection:
admin/collections?
action=CREATE&
name=AuditTrailIndex&
router.name=implicit&
shards=2010,2011,2012,2013,2014&
router.field=year
The above command:
a) Creates 5 shards - one each for the current and the last 4 years 2010,2011,2012,2013,2014
b) Routes data to the correct shard based on the value of the "year" field (specified as router.field)
In December 2014, you might add a new shard in preparation for 2015 using the CREATESHARD API (part of the Collections API) - Do something like:
/admin/collections?
action=CREATESHARD&
shard=2015&
collection=AuditTrailIndex
The above command creates a new shard on the same collection.
When its 2015, all data will get automatically indexed into the "2015" shard assuming your data has the "year" field populated correctly to 2015.
In 2015, if you think you don't need the 2010 shard (based on your data retention requirements) - you could always use the DELETESHARD API to do so:
/admin/collections?
action=DELETESHARD&
shard=2015&
collection=AuditTrailIndex
P.S. This solution only works if you used the "implicit router" when creating your collection. Does NOT work when you use the default "compositeId router" - i.e. collections created with the numshards parameter.
This feature is truly a gamechanger - allows shards to be added dynamically based on growing demands of your business.
Is this feature available in Elastic Search. If not, I am sure they will in time.
Now it's possible with Solr 4.4.0 (feature introduced in 4.3):
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-SplitaShard
Currently you cannot add new shards once the collection is made
https://issues.apache.org/jira/browse/SOLR-3755
We have a requirement that documents that we currently index in SOLR may periodically need to be PARTIALLY UPDATED. The updates can either be
a. add new fields
b. update the content of existing fields.
Some of the fields in our schema are stored, others are not.
SOLR 4 does allow this but all the fields must be stored. See Update a new field to existing document and http://solr.pl/en/2012/07/09/solr-4-0-partial-documents-update/
Questions:
1. Is there a way that SOLR can achieve this. We've tried SOLR JOINs in the past but it wasn't the right fit for all our use cases.
On the other hand, can elastic search , linkedin's senseidb or other text search engines achieve this ?
For now, we manage by re-indexing the affected documents when they need to be indexed
Thanks
Solr has the limitation of stored fields, that's correct. The underlying lucene always requires to delete the old document and index the new one. In fact lucene segments are write-once, it never goes back to modify the existing ones, thus it only markes documents as deleted and deletes them for real when a merge happens.
Search servers on top of lucene try to work around this problem by exposing a single endpoint that's able to delete the old document and reindex the new one automatically, but there must be a way to retrieve the old document somehow. Solr can do that only if you store all the fields.
Elasticsearch works around it storing the source documents by default, in a special field called _source. That's exactly the document that you sent to the search engine in the first place, while indexing. This is by the way one of the features that make elasticsearch similar to NoSQL databases. The elasticsearch Update API allows you to update a document in two ways:
Sending a new partial document that will be merged with the existing one (still deleting the old one and indexing the result of the merge
Executing a script on the existing document and indexing the result after deleting the old one
Both options rely on the presence of the _source field. Storing the source can be disabled, if you disable it you of course lose this great feature.