Solr node's physical space utilization in a shard - solr

What will happen if the node's physical space of one of the shard in SolrCloud is full? Will the index request to those nodes or that shard will redirect to other shards having space?

The short answer is not easily, not automatically because a specific shard is full. The reason being the 32-bit hash range is split evenly between each shard, Solr uses the murmur hash algorithm, which keeps the number of documents in each shard balanced (roughly), so most of your nodes will start hitting same limitations almost at the same time, so you need to monitor your indexes and plan for it ahead or after. You have two options in this context First, Custom hashing allows you to route documents to specific shards based on some common field value, such as tenant ID. Another example of this would be routing documents based on category.The biggest concern when using custom hashing is that it may create unbalanced shards in
your cluster. The second options is Shard splitting , allows you to split an existing shard into two subshards. To do shard splitting , Use the SPLITSHARD action of the collections API to split an existing shard into two subshards. Issue a "hard" commit after the split process completes to make the new subshards active. Unload the original shard from the cluster.
But if you still choose to force document to a specific shard becuase you know other shard is full, you can do it this way: Solr 4.5 has added the ability to specify the router implementation with the router.name parameter. If you use the "compositeId" router, you can send documents with a prefix in the document ID which will be used to calculate the hash Solr uses to determine the shard a document is sent to for indexing. The prefix can be anything you'd like it to be (it doesn't have to be the shard name, for example), but it must be consistent so Solr behaves consistently. For example, if you wanted to co-locate documents for a customer, you could use the customer name or ID as the prefix. If your customer is "IBM", for example, with a document with the ID "12345", you would insert the prefix into the document id field: "IBM!12345". The exclamation mark ('!') is critical here, as it defines the shard to direct the document to.
You can read more about it here: https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud

Related

How do I separate solr cores so their results don't mix?

I'm trying to set up multiple solr cores (the data for each core is indexed using norconex, crawling entirely separate sites). The schema and solrconfig files are the same for all cores but there is a copy in each of their respective conf folders.
When I run a query in the admin UI for core 1, I'm getting a mix of results from info indexed to cores 2 and 3 as well. How do I keep them entirely separate? It was my understanding that having separate cores would do this by default?
I've tried clearing all documents from cores 2 and 3, but core 1 still pulls up their docs. Thanks for any help anyone can provide.
This should not be happening. So, something has gone wrong. Possible options, from most likely and down:
You are - accidentally - indexing into single core (as mentioned in comments). This is most likely. Perhaps you got URL wrong or the software is using some old convention of naming the core through URL parameters. Try to intercept the URL actually used for indexing and see how they are different when software thinks it indexes into different cores. The core name should be in the URL itself (e.g. http://server:8983/solr/core1).
You have created a SolrCloud collection but are trying to index into individual cores of that collection. You should be able to check that in Admin UI and usually the core names are quite noticeably specific.
You have created an alias that spans multiple cores and are querying that instead of individual cores.
You have accidentally pointed several of your cores to the same data directory.
You did not say what happens when you query core2. If it does not have any documents, then first outcome is most likely. If it does, there may be other issues in play.
The issue you're describing above sounds like it could be that you have cores 1 through 3 on the same shard. That means that they would be replicas of each other and have the same data. If core1 were to be killed and replaced with another core, then data from the other cores would be replicated to the new core when the new core was added to the collection.
If you want subsets of documents in three separate cores (the physical locations), then those cores need to live in three separate shards (the logical locations). This can be accomplished using routing.
The compositeId router will let you send documents or queries to specific shards. The documentation shows an example of using data from a company field as part of the routing key value like this: "IBM!12345"
The exclamation point is a separator to break the key into the various parts used for creating the shard hash value. This allows sending "IBM" data to one shard, and "YOYODYNE" can be sent to another shard.
If "YOYODYNE" had way more documents than "IBM", then you might want to spread documents for "YOYODYNE" across multiple shards. The documentation says to use something like this:
Another use case could be if the customer "IBM" has a lot of documents
and you want to spread it across multiple shards. The syntax for such
a use case would be: shard_key/num!document_id where the /num is the
number of bits from the shard key to use in the composite hash.
So IBM/3!12345 will take 3 bits from the shard key and 29 bits from the unique doc id, spreading the tenant over 1/8th of the shards
in the collection. Likewise if the num value was 2 it would spread the
documents across 1/4th the number of shards. At query time, you
include the prefix(es) along with the number of bits into your query
with the route parameter (i.e., q=solr&route=IBM/3!) to direct
queries to specific shards.

Custom sharding or auto Sharding on SolrCloud?

I want to establish a SolrCloud clsuter for over 10 millions of news articles. After reading this article: Shards and Indexing Data in SolrCloud, I have a plan as follows:
Add prefix ED2001! to document ID where ED means some newspaper source and 2001 is the year part in published date of news article, i.e. I want to put all news articles of specific news paper source published in specific year to a shard.
Create collection with router.name set to compositeID.
Add documents?
Query Collection?
Practically, I got some questions:
How to add doucments based on this plan? Do I have to specify special parameters when updating the collection/core?
Is this called "custom sharding"? If not, what is "custom sharding"?
Is auto sharding a better choice for my case since there's a shard-splitting feature for auto sharding when the shard is too big?
Can I query without _router_ parameter?
EDIT # 2015/9/2:
This is how I think SolrCloud will do: "The amount of news articles of specific newspaper source of specific year tends to be around a fix number, e.g. Every year ED has around 80,000 articles, so each shard's size won't increase dramatically. For the next year's news articles of ED, I only have to add prefix 'ED2016!' to document ID, SolrCloud will create a new shard for me (which contains all ED2016 articles), and later the Leader will spread the replica of this new shard to other nodes (per replica per node other than leader?)". Am I right? If yes, it seems no need for shard-splitting.
Answer-1: If have the schema (structure) of the document then you can provide the same in schema.xml configuration or you can use Solr's schema-less mode for indexing the document. The schema-less mode will automatically identify the fields in your document and index them. The configuration of schema-less mode is little different then schema based configuration mode in solr. Afterwards, you need to send the documents to solr for indexing using curl or solrj java api. Essentially, solr provides rest end points for all the different operations. You can write the client in any language which suits you better.
Answer-2: What you have mentioned in your plan, use of compositeId, is called custom sharding. Because you are deciding to which shard a particular document should go.
Answer-3: I would suggest to go with auto-sharding feature if are not certain how much data you need to index at present and in future. As the index size increases you can split the shards and scale the solr horizontally.
Answer-4: I went through the solr documentation, did not find anywhere mentioning _route_ as mandatory parameter. But in some situations, this may improve query performance because it overcomes network latency when querying all the shards.
Answer-5: The meaning of auto-sharding is routing the document to a shards, based on the hash range assigned while creating the shards. It does not create the new shards automatically, just by specifying a new prefix for compositeId. So once the index grows large enough in size, you might need to split it. Check here for more.
This is actually a guide to answer my own question:
I kinda understand some concepts:
"custom sharding" IS NOT "custom hashing".
Solr averagely splits hash values as default hashing behavior.
compositeId router applies "custom hashing" cause it changes default hashing behavior by prefixing shard_key/num-of-bits.
Implicit router applies "custom sharding" since we need to manually specify which shards our docs will be sent to.
compositeId router still is auto sharding since it's Solr who see the shard_key prefix and route the docs to specific shards.
compositeId router needs to specify numShards parameter (possbily because Solr needs to distribute various hash value space ranges for each of the shard).
So obviously my strategy doesn't work since I need to always add in new year's news articles to Solr and there's no way I can predict how many shards in advance. So to speak, Implicit router seems a possible choice for me (We create shard we need and add docs to shard we intend to).

How to add a node to SolrCloud dynamically without SPLITSHARD?

I have setup SolrCloud with 4 shards. I added 8 nodes to the SolrCloud(4 Leaders and 4 Replicas). Each node is running in different machine. But later I identified that my data is growing more and more(daily 4 million files) so that my 4 shards are not sufficient. So, I want to add one more shard to this SolrCloud dynamically. When I add a new node that is created as replica,that is not what I want. When I search for this in Google, the answers I got is use Collection API SPLITSHARD. If I use SPLITSHARD that will split the already existed shard. But here my requirement is to add new shard to this SolrCloud. How to do this?
Any suggestion will be appreciated. Thanks in advance.
The answer is buried in the SolrCloud docs. See https://cwiki.apache.org/confluence/display/solr/Nodes,+Cores,+Clusters+and+Leaders the section 'Resizing a Cluster'
Basically the process is
Split a Shard - now you will have two shards on that one machine
Setup a replica of this new shard on your new machine
Remove the new shard from the original machine. ZooKeeper will promote the replica to the leader for that shard.
Setup a replica for that new shard
Very kludgy and manual process.
SolrCloud isn't very "Cloudy" i.e. elastic.
When you create the collection at the first time you make a very important decision, which is the sharding technique. Solr provides two different ways, implicit, or compositeId.
if you set it to compositeId, this means you want solr to calculate the shard based on a field of your choice (or the id by default), Solr will calculate a 32-bit integer hash key based on that field, and allocate a range for each shard. You also need to specify the number of shards in advance. So, solr will allocate a range of the 32-bit integer values for each shard, and according to the hash value it will route the document to the proper shard. For example if you set it to 4 shards, and the hash key happens to be in the first quarter of the 32-bit range, then it goes to first shard, and so on...
With this way you cannot change the number of shards later on. Because that will break the whole structure, you can still split one range into two separate sub-ranges. But you cannot just extend existing structure.
Second way, which is implicit, you don't have to specify the number of shards in advance, but you do the sharding manually in your application, and provide a field that has the name of the shard so, solr can route the document directly without calculating any thing. In this way, you can create as many shards in the future without affecting existing shards, you will simply create a new shard by name, and your application will start populating future documents with the new name.
So, in your situation, if you already chose compositeId, you cannot add shards, you can only split existing ones. If you think your shards will change much in the future, I'd suggest you re-build your cloud using implicit sharding.
check out Solr collection Api for more details : https://cwiki.apache.org/confluence/display/solr/Collections+API

ElasticSearch - workaround of unique constraint

I am thinking about some smart workaround of "no unique constraint" problem in ElasticSearch.
I can't use _id to store my unique field, because I am using _id for other purpose.
I crawl Internet pages and store them in ElasticSearch index. My rule is, that url must be unique (only one document with given url in index) so as ElasticSearch doesn't allow to set unique constraint on one field, I must query index before inserting new page to check if there is already site with given url.
So adding new page to document looks like that:
Query(match) index in ES to check if there is document with given url field.
If not, I insert new document.
The solution has two disadvantages:
I must execute extra query to check if there is already document with given url. It slows down inserting process and generates extra load.
If I try to add 2 documents with the same url in short amount of time and the index doesn't refresh before adding second document, the second query returns, that there is no document with given url and finally I have two documents with the same url
So I am looking for something else. Please tell me if you have any idea or please tell me what do you think about such solutions:
Solution 1
To use other database system (or maybe another ES index with url in _id) where I will store only urls and I will query it to check if there is already url
Solution 2
2. To queue documents before inserting and to disable index refreshing when other process will process the queue and add queued documents to index.
You've hit upon one of the things that Elasticsearch does not do well (secondary indexes and constraints) when compared to some other NoSQL solutions. In addition to Solution 1 and Solution 2 I'd suggest you look at Elasticsearch Rivers:
Rivers
A river is a pluggable service running within elasticsearch cluster
pulling data (or being pushed with data) that is then indexed into the
cluster.
For example, you could use the MongoDB river and then insert your data into MongoDB. MongoDB supports secondary unique indexes so you could prevent insertion of duplicate urls. The River will then take care of pushing the data to Elasticsearch in realtime.
https://github.com/richardwilly98/elasticsearch-river-mongodb
ES supports CouchDB officially and there are a number of other databases that have rivers too -

List all document keys in a Solr index for the purpose of database synchronisation

I need to synchronize a Solr index with a database table. At any given time, the Solr index may need to have documents added or removed. The nature of the database prevents the Data Import Handler's Delta Import functionality from being able to detect changes.
My proposed solution was to retrieve a list of all primary keys of the database table and all unique keys of the Solr index (which contain the same integer value) and compare these lists. I would use SolrJ for this.
However, to get all Solr documents requires the infamous approach of hard-coding the maximum integer value as the result count limit. Using this approach seems to be frowned upon. Does my situation have cause to ignore this advice, or is there another approach?
You can execute two queries to list all keys from solr in one batch: first with rows=0, you will get a number of hits, second with that number as rows parameter. Its not very optmimal solution, but works.
Second possibility is to store update date in solr index, and fetch only changed documents from last synchronisation.

Resources