I'd like to set up SolrCloud with one collection consisting of three different shards.
I understand that since a collection represents a single logical index, it must have a single schema. I'm wondering, however, if each shard can have a different solrconfig?
Despite a fair amount of searching, I haven't seen any examples where a collection consists of a single schema but multiple solrconfig's. The SolrCloud tutorials I've worked through all init the collection with one bootstrapping config:
java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar
However, there are some elements in SolrCloud documentation that leads me to believe a SolrCloud set up with a single schema yet different solrconfig files for each shard might be possible. From "Solr Glossary":
"Collection: In Solr, one or more documents grouped together in a single logical index. A collection must have a single schema, but can be spread across multiple cores."
If a collection must have a single schema, but can consist of multiple cores, is that an indication that these different cores can have different solrconfig's? If so, how can this be set up?
Any help would be much appreciated.
Collection is a logical container for the same configuration. You cannot have cores with different configuration in single collection.
In general, you may query several collections (see SolrCloud wiki for that), if those collections have same schema. This will work only if both collections reside on the same zookeeper cluster. Give it a try.
Related
I am trying to set up a solr collection that extends across multiple servers. If I am correct in understanding things, I am able to set up a collection, which consists of shards. Those shards consist of replicas, which is correspond to cores. Please correct any holes in my understanding of this.
Ok.
So I've got solr set up and am able to create a collection on machine one by doing this.
bin/solr create_collection -c test_collection -shards 2 -replicationFactor 2 -d server/solr/configsets/basic_configs/conf
This appears to do something right, I am able to check the health and see something. I input
bin/solr healthcheck -c test_collection
and I see the shard information.
Now what I want to do, and this is the part I am stuck on, is to take this collection that I have created, and extend it across multiple servers. I'm not sure if I understand how this works correctly, but I think what I want to do is put shard1 on machine1, and shard2 on machine2.
I can't really figure out how to do this based on the documentation, although I am pretty sure this is what SolrCloud is meant to solve. Can someone give me a nudge in the right direction with this...? Either a way to extend the collection across multiple servers or a reason for not doing so.
When you say -shards 2, you're saying that you want your collection to be split across two servers already. -replicationFactor 2 says that you want those shards present on at least two servers as well.
A shard is a piece of the collection - without a shard, you won't have access to all the documents. The replicationFactor indicates how many copies should be made available of the same shard (or "partition" which some times is used to represent the piece of the index) in the collection, so two shards with two replicas will end up with four "cores" distributed across the available servers (these "cores" are managed internally by Solr).
Start a set of new SolrCloud instances in the same cluster and you should see that the documents are spread across your nodes as expected.
As said before, the shards are pieces of the collection (data) in actual servers.
When you ran the command, you've asked that your collection will be split into 2 machines - at that point in time.
Once you add more machines to the mix, (by registering them to the same zookeeper), you can use the collection API to manage and add them to the fold as well.
https://cwiki.apache.org/confluence/display/solr/Collections+API
You can split shards into 2 (or more) new shards.
You can create new shards, or delete shards.
The question of course - is how do the documents split among the shards?
When you create the collection, you can define a router.name
router.name - The router name that will be used.
The router defines how documents will be distributed among the shards.
The value can be either implicit, which uses an internal default hash,
or compositeId, which allows defining the specific shard to assign documents to.
When using the 'implicit' router, the shards parameter is required.
When using the 'compositeId' router, the numShards parameter is required.
For more information, see also the section Document Routing.
What this means - is that you can define the number of shards (like you did) or go to a totally different approach which distinguishes shards by a prefix in the document id.
For more information about the second approach see: https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud#ShardsandIndexingDatainSolrCloud-DocumentRouting
I am using solr version 3.0.1, and I am about to change to solr 4.6.0.
Usually I just use solr without defining core (I think solr 3.0.1 doesn't have core yet).
And now I want to upgrade my solr to version 4.6.0, there is something new on it.
So i have 3 questions:
What exactly solr core is?
When i should use solr core?
Is it right that each solr core is like a table in a (relational) database? That is, can I save different type of data in different core?
Thanks in advance.
A core is basically an index with a given schema and will hold a set of documents.
You should use different cores for different collections of documents, it doesn't mean you should store different kind of documents in different indexes.
Some examples:
you could have same documents in different languages stored on different cores and select the core based on configured language;
you could have different type of documents stored in different cores to organize them physically separated;
but at the same time you could have different documents stored on the same index and differentiate them by a field value;
it really depends on your use-case.
You have to think up-front about what type of queries you are going to execute against you Solr index. You then lay down your schema of a core or several cores accordingly.
If you for example execute some JOIN queries on your relational DB, those won't be very efficient (if at all possible) with lots of documents in the SOLR index, because it is NoSQL world (here read as: non-relational). In such a case you might need to duplicate your data from several DB tables into one core's schema.
As Francisco has already mentioned physically core is represented as an independent entity with its own schema, config and index data.
One caution with multi-core setup: all the cores configured under the same container instance will hence share the same JVM. This means you should be careful with the amount of data you store on those cores. Lucene, which is an indexing engine inside Solr, has really neat and fast (de)compression algorithms (in versions 4.x) so disk can leave for longer, but JVM heap is something to care about.
The goodies of cores coupled with the Solr admin UI are things like:
core reload after schema / solrconfig changes
core hot swap (if you have a live core serving queries you can hot swap it with a new core with same data and some modifications)
core index optimization
core renaming
Can I use different format (schema.xml) for a document (like car document)?
So that I can use different index to query the same class of documents differently?
(OK.. I can use two instances of Solr, but.. that's the only way? )
Only one schema is possible for a Core.
You can always have different Cores within the same solr with the Multicore configuration.
However, if you have the same entity and want to query it differently, you can have the same schema.xml to hold the values different fields and different field types (Check copyfield) and have different query handler to have weighted queries depending upon the needs.
As far as I know you can only have one schema file per Solr core.
Each core uses its own schema file so if you want to have two different schema files then either set -up a 2nd Solr core or run another instance of Solr
In SOLR, what is multicore?
Is it a way to create multiple tables (inside a single solr app) with their own set of schema files, or is it about creating different databases (inside a single solr app)?
If we want to create multiple tables (with their respective schema.xml files) for solr web app then what is the best way to do this, or how can we achieve this in SOLR?
Solr Multicore is basically a set up for allowing Solr to host multiple cores.
These Cores which would host a complete different set of unrelated entities.
You can have a separate Core for each table as well.
For e.g. If you have collections for Documents, People, Stocks which are completely unrelated entities you would want to host then in different collections
Multicore setup would allow you to
Host unrelated entities separately so that they don't impact each other
Having a different configuration for each core with different behavior
Performing activities on each core differently (Update data, Load, Reload, Replication)
keep the size of the core in check and configure caching accordingly
I don't understand in Solr wiki, whether Solr takes one schema.xml, or can have multiple ones.
I took the schema from Nutch and placed it in Solr, and later tried to run examples from Solr. The message was clear that there was error in schema.
If I have a Solr, am I stuck to a specific schema? If not, where is the information for using multiple ones?
From the Solr Wiki - SchemaXml page:
The schema.xml file contains all of the details about which fields
your documents can contain, and how those fields should be dealt with
when adding documents to the index, or when querying those fields.
Now you can only have one schema.xml file per instance/index within Solr. You can implement multiple instances/indexes within Solr by using the following strategies:
Running Multiple Indexes - please see this Solr Wiki page for more details.
There are various strategies to take when you want to manage multiple "indexes" in a Single Servlet Container
Running Multiple Cores within a Solr instance. - Again, see the Solr Wiki page for more details...
Multiple cores let you have a single Solr instance with separate
configurations and indexes, with their own config and schema for very
different applications, but still have the convenience of unified
administration. Individual indexes are still fairly isolated, but you
can manage them as a single application, create new indexes on the fly
by spinning up new SolrCores, and even make one SolrCore replace
another SolrCore without ever restarting your Servlet Container.