Currently I have a zookeeper multi solr server, single shard setup. Unique ids are generated automatically by solr.
I now have a zookeeper mult solr server, multi shard requirement. I need to be able to route updates to a specific shard.
After reading http://searchhub.org/2013/06/13/solr-cloud-document-routing/ I am concerned that I cannot allow solr to generate random unique ids if I want to route updates to a specific shard.
Cannot anyone confirm this for me and perhaps give an explanation of the best approach.
Thanks
There is no way you can route your documents to a particular shard since it is being managed by the zookeeper.
Solution to your problem is that you should create two collections instead of two shards. Use your 1st collection with two servers and 2nd collection can use the third server and then you can send your updates to particular servers.The design should look like
collection1---->shard1---->server1,server2
collection2---->shard1----->server3
This way you can separate your indexes as per your requirement.
Related
We are planning to setup a Solr cluster which will have around 30 machines. We have a zookeeper ensemble of 3 nodes which will be managing Solr.
We will have new production data every few days, which is going to be quite different from the one that is in Prod. Since the data difference is
quite large, we are planning to use hadoop to create the entire Solr index dump and copy these binaries to each machine and maybe do some kinda core swap.
I am still new to Solr and was wondering if this is a good idea. I could http post my data to the prod cluster, but each update could span multiple documents.
I am not sure how this will impact the read traffic while the write happens.
Any pointers ?
Thanks
I am not sure i completely understand your explanations, but it seems to me that you would like to migrate to a new solr cloud environment with zero down time.
First, you need to know how many shards you want, how many replicas, etc.
You need to deploy the solr nodes, then you need to use the collection admin API to create the collection as desired (https://cwiki.apache.org/confluence/display/solr/Collections+API).
After all this you should be ready to add content to your new solr environment.
You can use Hadoop to populate the new solr cloud, say for instance by using solrj. Or you can use the data import handler to migrate data from another solr (or a relational database, etc).
It is very important how you create your solr cloud in terms of document routing, because it controls in which shard your document will be stored.
This is why it is not a good idea to copy raw index data to a solr node, as you may mess up the routing.
I found these explanations very useful about routing: https://lucidworks.com/blog/solr-cloud-document-routing/
Building an application. Right now we have one Solr server. But we would like to design the app so that it can support multiple Solr shard in future if we outgrow the indexing needs.
What are keys things to keep in mind when developing an application that can support multiple shards in future?
we stored the solr URL /solr/ in a DB. Which is used to execute queries against solr. There is one URL for Updates and one URL for Searches in the DB
If we add shards to the solr environment at a future date, will the process for using the shards be as simple as updating the URLs in the DB? Or are there other things that need to be updated. We are using SolrJ
e.g. change the SolrSearchBaseURL in DB to:
https://solr2/solr/select?shards=solr1/solr,solr2/solr&indent=true&q={search_query}
And updating the SolrUpdateBaseURL in DB to
https://solr2/solr/
?
Basically, what you are describing has already been implemented in SolrCloud. There the ZooKeeper maintains the state of your search cluster (which shards in what collections, shard replicas, leader and slave nodes and more). It can handle the load on indexing and querying sides by using hashing.
You could, in principle, get by (at least in the beginning of your cluster growth) with the system you have developed. But think about replicating, adding load balancers, external cache servers (like e.g. varnish): in the long run you would end up implementing smth like SolrCloud yourself.
Having said that, there are some caveats to using hash based indexing and hence searching. If you want to implement logical partitioning of you data (say, by date) at this point there is no way to this but making a custom code. There is some work projected around this though.
Currently I had 2 different schema set (setA/ and setB/) sitting under multicore/ folder in a jetty solr path /opt/solr/example/multicore.
If I wanna create shads for each schema, how should I go about it?
Thanks,
Two shards will have the same configuration, but different documents. So you make a copy of your configuration on a new server, then put half the documents on each server.
The Solr page on distributed search gives a little bit of information about querying across multiple shards.
I am using SOLR 1.3.0 for performing a distributed search over already existing lucene indices. The question is, is there any way in which I could find from which shard did a result come up after the search?
P.S : I am using the REST api.
For Solr sharding -
Documents must have a unique key and the unique key must be stored
(stored="true" in schema.xml)
I think the logic should be already there on your side, by which you are feeding the data to the shards, as the ids need to be unique.
e.g. the simplest is the odd even combination, but you may have some complex ones by which you distribute the data into the shards.
You may be able to get some information using debugQuery=on, but if this is something that you'll query often I'd add a specific stored field for the shard name.
PS: Solr doesn't have a REST API.
I don't understand in Solr wiki, whether Solr takes one schema.xml, or can have multiple ones.
I took the schema from Nutch and placed it in Solr, and later tried to run examples from Solr. The message was clear that there was error in schema.
If I have a Solr, am I stuck to a specific schema? If not, where is the information for using multiple ones?
From the Solr Wiki - SchemaXml page:
The schema.xml file contains all of the details about which fields
your documents can contain, and how those fields should be dealt with
when adding documents to the index, or when querying those fields.
Now you can only have one schema.xml file per instance/index within Solr. You can implement multiple instances/indexes within Solr by using the following strategies:
Running Multiple Indexes - please see this Solr Wiki page for more details.
There are various strategies to take when you want to manage multiple "indexes" in a Single Servlet Container
Running Multiple Cores within a Solr instance. - Again, see the Solr Wiki page for more details...
Multiple cores let you have a single Solr instance with separate
configurations and indexes, with their own config and schema for very
different applications, but still have the convenience of unified
administration. Individual indexes are still fairly isolated, but you
can manage them as a single application, create new indexes on the fly
by spinning up new SolrCores, and even make one SolrCore replace
another SolrCore without ever restarting your Servlet Container.