I want to establish a SolrCloud clsuter for over 10 millions of news articles. After reading this article: Shards and Indexing Data in SolrCloud, I have a plan as follows:
Add prefix ED2001! to document ID where ED means some newspaper source and 2001 is the year part in published date of news article, i.e. I want to put all news articles of specific news paper source published in specific year to a shard.
Create collection with router.name set to compositeID.
Add documents?
Query Collection?
Practically, I got some questions:
How to add doucments based on this plan? Do I have to specify special parameters when updating the collection/core?
Is this called "custom sharding"? If not, what is "custom sharding"?
Is auto sharding a better choice for my case since there's a shard-splitting feature for auto sharding when the shard is too big?
Can I query without _router_ parameter?
EDIT # 2015/9/2:
This is how I think SolrCloud will do: "The amount of news articles of specific newspaper source of specific year tends to be around a fix number, e.g. Every year ED has around 80,000 articles, so each shard's size won't increase dramatically. For the next year's news articles of ED, I only have to add prefix 'ED2016!' to document ID, SolrCloud will create a new shard for me (which contains all ED2016 articles), and later the Leader will spread the replica of this new shard to other nodes (per replica per node other than leader?)". Am I right? If yes, it seems no need for shard-splitting.
Answer-1: If have the schema (structure) of the document then you can provide the same in schema.xml configuration or you can use Solr's schema-less mode for indexing the document. The schema-less mode will automatically identify the fields in your document and index them. The configuration of schema-less mode is little different then schema based configuration mode in solr. Afterwards, you need to send the documents to solr for indexing using curl or solrj java api. Essentially, solr provides rest end points for all the different operations. You can write the client in any language which suits you better.
Answer-2: What you have mentioned in your plan, use of compositeId, is called custom sharding. Because you are deciding to which shard a particular document should go.
Answer-3: I would suggest to go with auto-sharding feature if are not certain how much data you need to index at present and in future. As the index size increases you can split the shards and scale the solr horizontally.
Answer-4: I went through the solr documentation, did not find anywhere mentioning _route_ as mandatory parameter. But in some situations, this may improve query performance because it overcomes network latency when querying all the shards.
Answer-5: The meaning of auto-sharding is routing the document to a shards, based on the hash range assigned while creating the shards. It does not create the new shards automatically, just by specifying a new prefix for compositeId. So once the index grows large enough in size, you might need to split it. Check here for more.
This is actually a guide to answer my own question:
I kinda understand some concepts:
"custom sharding" IS NOT "custom hashing".
Solr averagely splits hash values as default hashing behavior.
compositeId router applies "custom hashing" cause it changes default hashing behavior by prefixing shard_key/num-of-bits.
Implicit router applies "custom sharding" since we need to manually specify which shards our docs will be sent to.
compositeId router still is auto sharding since it's Solr who see the shard_key prefix and route the docs to specific shards.
compositeId router needs to specify numShards parameter (possbily because Solr needs to distribute various hash value space ranges for each of the shard).
So obviously my strategy doesn't work since I need to always add in new year's news articles to Solr and there's no way I can predict how many shards in advance. So to speak, Implicit router seems a possible choice for me (We create shard we need and add docs to shard we intend to).
Related
I'm currently trying to figure out if Solr is the right tool for me. I have the following setup:
There is the primary document type "blog". Then there are two additional document types "user" and "category". Both of these are parents of the "blog" document type.
Now when searching the "blog" documents, I not only want to search in those fields (e.g. title and content), but also in the parent fields (user>name and category>name.
Of course, I could just flatten that down to a single document for Solr, which would ease the search a lot. The downside to this is though, that when e.g. a user updates their name, I have to run through all blog posts of them and update the documents for that in Solr, instead of just updating a single document.
This becomes even worse when the user has another parent, on which I need to search as well.
Do you have any recommendations about how to handle this use case? Maybe my Google foo is just not good enough, but what I found (block joins, etc.) don't seem to do the trick.
The absolutely most performant and easiest solution would be to flatten everything to a single document. It turns out that these relations aren't updated as often as people think, and that searches are performed more often than the documents update. And even if one of the values that are identical across a large set of documents change, reindexing from the most recent documents (for a blog) and then going backwards will appear rather performant for most users. The assumes that you have to actually search the values and don't just need the values - which you could look up from secondary storage when displaying an item (and just store the never changing id in the document).
Another option is to divide this into a multi-search problem. One collection for blog posts, one collection for users and one collection for categories. You then search through each of the collections for the relevant data and merge it in your search model. You can also use [Streaming Expressions] to hand off most of this processing to a Solr cluster for you.
The reason why I always recommend flattening if possible is that most features in Solr (and Lucene) are written for a flat document structure, and allows you to fully leverage the features available. Since Lucene by design is a flat document store, most other features require special care to support blockjoins and parent/child relationships, and you end up experimenting a lot to get the correct queries and feature set you want (if possible). If the documents are flat, it just works.
On a Solr installation with 2+ shards, when is the data returned by the LukeRequestHandler distributed across the shards? I ask because I want to be able to detect new (previously unseen) dynamic fields within a short amount of time after they are added.
Example desired sequence of events:
Assume dynamic field *_s
Query Luke and receive list of dynamic fields
Add document with field example_s
Query Luke and receive same list as before but with additional example_s in result (this currently doesn't happen)
Query collection for example_s:* and match the document added above
I am aware that newly added documents become immediately searchable even before being hard committed, but I am looking for a way to have that info appear in Luke too.
Info on the following would be useful:
Does Luke query all shards at request time, or just one? It would appear to only query one at random.
Exactly when does knowledge of previously unseen dynamic fields become distributed across all shards (equivalently, available to Luke)?
Can I configure the delay/trigger for this supposed Luke propagation in order to minimize the delay between addition of a document with a new dynamic field on an arbitrary shard and the moment it becomes visible in Luke responses on every other shard?
See https://issues.apache.org/jira/browse/SOLR-8127
Never.
As indicated by responses on the linked ticket, the Luke request handler isn't at a high enough level to understand multiple shards. Luke provides information about an index, not a collection, and certainly not a cluster.
You need to query each shard directly. This can be done by using the exact core path /solr/collection_shard1_replica1/admin/luke
I have setup SolrCloud with 4 shards. I added 8 nodes to the SolrCloud(4 Leaders and 4 Replicas). Each node is running in different machine. But later I identified that my data is growing more and more(daily 4 million files) so that my 4 shards are not sufficient. So, I want to add one more shard to this SolrCloud dynamically. When I add a new node that is created as replica,that is not what I want. When I search for this in Google, the answers I got is use Collection API SPLITSHARD. If I use SPLITSHARD that will split the already existed shard. But here my requirement is to add new shard to this SolrCloud. How to do this?
Any suggestion will be appreciated. Thanks in advance.
The answer is buried in the SolrCloud docs. See https://cwiki.apache.org/confluence/display/solr/Nodes,+Cores,+Clusters+and+Leaders the section 'Resizing a Cluster'
Basically the process is
Split a Shard - now you will have two shards on that one machine
Setup a replica of this new shard on your new machine
Remove the new shard from the original machine. ZooKeeper will promote the replica to the leader for that shard.
Setup a replica for that new shard
Very kludgy and manual process.
SolrCloud isn't very "Cloudy" i.e. elastic.
When you create the collection at the first time you make a very important decision, which is the sharding technique. Solr provides two different ways, implicit, or compositeId.
if you set it to compositeId, this means you want solr to calculate the shard based on a field of your choice (or the id by default), Solr will calculate a 32-bit integer hash key based on that field, and allocate a range for each shard. You also need to specify the number of shards in advance. So, solr will allocate a range of the 32-bit integer values for each shard, and according to the hash value it will route the document to the proper shard. For example if you set it to 4 shards, and the hash key happens to be in the first quarter of the 32-bit range, then it goes to first shard, and so on...
With this way you cannot change the number of shards later on. Because that will break the whole structure, you can still split one range into two separate sub-ranges. But you cannot just extend existing structure.
Second way, which is implicit, you don't have to specify the number of shards in advance, but you do the sharding manually in your application, and provide a field that has the name of the shard so, solr can route the document directly without calculating any thing. In this way, you can create as many shards in the future without affecting existing shards, you will simply create a new shard by name, and your application will start populating future documents with the new name.
So, in your situation, if you already chose compositeId, you cannot add shards, you can only split existing ones. If you think your shards will change much in the future, I'd suggest you re-build your cloud using implicit sharding.
check out Solr collection Api for more details : https://cwiki.apache.org/confluence/display/solr/Collections+API
What will happen if the node's physical space of one of the shard in SolrCloud is full? Will the index request to those nodes or that shard will redirect to other shards having space?
The short answer is not easily, not automatically because a specific shard is full. The reason being the 32-bit hash range is split evenly between each shard, Solr uses the murmur hash algorithm, which keeps the number of documents in each shard balanced (roughly), so most of your nodes will start hitting same limitations almost at the same time, so you need to monitor your indexes and plan for it ahead or after. You have two options in this context First, Custom hashing allows you to route documents to specific shards based on some common field value, such as tenant ID. Another example of this would be routing documents based on category.The biggest concern when using custom hashing is that it may create unbalanced shards in
your cluster. The second options is Shard splitting , allows you to split an existing shard into two subshards. To do shard splitting , Use the SPLITSHARD action of the collections API to split an existing shard into two subshards. Issue a "hard" commit after the split process completes to make the new subshards active. Unload the original shard from the cluster.
But if you still choose to force document to a specific shard becuase you know other shard is full, you can do it this way: Solr 4.5 has added the ability to specify the router implementation with the router.name parameter. If you use the "compositeId" router, you can send documents with a prefix in the document ID which will be used to calculate the hash Solr uses to determine the shard a document is sent to for indexing. The prefix can be anything you'd like it to be (it doesn't have to be the shard name, for example), but it must be consistent so Solr behaves consistently. For example, if you wanted to co-locate documents for a customer, you could use the customer name or ID as the prefix. If your customer is "IBM", for example, with a document with the ID "12345", you would insert the prefix into the document id field: "IBM!12345". The exclamation mark ('!') is critical here, as it defines the shard to direct the document to.
You can read more about it here: https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud
We have a requirement that documents that we currently index in SOLR may periodically need to be PARTIALLY UPDATED. The updates can either be
a. add new fields
b. update the content of existing fields.
Some of the fields in our schema are stored, others are not.
SOLR 4 does allow this but all the fields must be stored. See Update a new field to existing document and http://solr.pl/en/2012/07/09/solr-4-0-partial-documents-update/
Questions:
1. Is there a way that SOLR can achieve this. We've tried SOLR JOINs in the past but it wasn't the right fit for all our use cases.
On the other hand, can elastic search , linkedin's senseidb or other text search engines achieve this ?
For now, we manage by re-indexing the affected documents when they need to be indexed
Thanks
Solr has the limitation of stored fields, that's correct. The underlying lucene always requires to delete the old document and index the new one. In fact lucene segments are write-once, it never goes back to modify the existing ones, thus it only markes documents as deleted and deletes them for real when a merge happens.
Search servers on top of lucene try to work around this problem by exposing a single endpoint that's able to delete the old document and reindex the new one automatically, but there must be a way to retrieve the old document somehow. Solr can do that only if you store all the fields.
Elasticsearch works around it storing the source documents by default, in a special field called _source. That's exactly the document that you sent to the search engine in the first place, while indexing. This is by the way one of the features that make elasticsearch similar to NoSQL databases. The elasticsearch Update API allows you to update a document in two ways:
Sending a new partial document that will be merged with the existing one (still deleting the old one and indexing the result of the merge
Executing a script on the existing document and indexing the result after deleting the old one
Both options rely on the presence of the _source field. Storing the source can be disabled, if you disable it you of course lose this great feature.