I'm using SOLR 4.0. I need to make 4 different indexes for searching, let's say, first is a list of students in a university, second is a list of products being sold on an online marketplace and so on. What I mean here is that they all hold completely different types of data.
Currently I'm running 4 instances of solr on 4 different ports each having a single collection serving one type of data. The problem is that running 4 instances of solr takes up a lot of memory space.
How can I run all 4 collections in a single solr instance? While searching, maybe I can specify in the url the collection that I'm interested in.
You can create multiple cores within a single Solr instance. There is a CoreAdmin API for such purposes.
It has a CREATE action which creates a new core and registers it. Here is the sample create core request:
http://localhost:8983/solr/admin/cores?action=CREATE&name=coreX&instanceDir=path/to/dir&config=config_file_name.xml&dataDir=data
Bear in mind that CREATE call must be able to find a configuration, or it will not succeed.
You can read documentation from here: https://cwiki.apache.org/confluence/display/solr/CoreAdmin+API#CoreAdminAPI-CREATE
Related
So we have multiple solr instances located in different data centers. Each solr instance has the same collections and schemas, but the data we store in them are different (we only store EU customers in the solr instance located in the EU and we only store US customer data in the solr instance located in the US, etc...).
I'm looking for a way to run a query across all the solr instances in each data center and get a combined result (ie: The final result will contain both EU and US data). I don't want to query each solr instance separately and combine the results on my side since I would like to still be able to use solr's sorting and other query parameters on the final result set.
Does solr have something built in that will help me achieve this? or maybe a third party tool I could use?
There are a few ways - you can manually use the sharding parameter. First, fetch the set of cores and hosts for each collection through CLUSTERSTATUS in the Collections API (or directly from Zookeeper).
Another option is to use the Solr Streaming Expressions API. There are a few limitations to consider when using the API, and the result set will be formatted differently from the regular query result. The search stream source allows you to give it the zkHost parameter, telling the function what Zookeeper it should contact to get to know where the collection lives and what nodes answers for the collection. After that you'll have to add stream decorators and filters to get the result you want.
I am trying to set up a solr collection that extends across multiple servers. If I am correct in understanding things, I am able to set up a collection, which consists of shards. Those shards consist of replicas, which is correspond to cores. Please correct any holes in my understanding of this.
Ok.
So I've got solr set up and am able to create a collection on machine one by doing this.
bin/solr create_collection -c test_collection -shards 2 -replicationFactor 2 -d server/solr/configsets/basic_configs/conf
This appears to do something right, I am able to check the health and see something. I input
bin/solr healthcheck -c test_collection
and I see the shard information.
Now what I want to do, and this is the part I am stuck on, is to take this collection that I have created, and extend it across multiple servers. I'm not sure if I understand how this works correctly, but I think what I want to do is put shard1 on machine1, and shard2 on machine2.
I can't really figure out how to do this based on the documentation, although I am pretty sure this is what SolrCloud is meant to solve. Can someone give me a nudge in the right direction with this...? Either a way to extend the collection across multiple servers or a reason for not doing so.
When you say -shards 2, you're saying that you want your collection to be split across two servers already. -replicationFactor 2 says that you want those shards present on at least two servers as well.
A shard is a piece of the collection - without a shard, you won't have access to all the documents. The replicationFactor indicates how many copies should be made available of the same shard (or "partition" which some times is used to represent the piece of the index) in the collection, so two shards with two replicas will end up with four "cores" distributed across the available servers (these "cores" are managed internally by Solr).
Start a set of new SolrCloud instances in the same cluster and you should see that the documents are spread across your nodes as expected.
As said before, the shards are pieces of the collection (data) in actual servers.
When you ran the command, you've asked that your collection will be split into 2 machines - at that point in time.
Once you add more machines to the mix, (by registering them to the same zookeeper), you can use the collection API to manage and add them to the fold as well.
https://cwiki.apache.org/confluence/display/solr/Collections+API
You can split shards into 2 (or more) new shards.
You can create new shards, or delete shards.
The question of course - is how do the documents split among the shards?
When you create the collection, you can define a router.name
router.name - The router name that will be used.
The router defines how documents will be distributed among the shards.
The value can be either implicit, which uses an internal default hash,
or compositeId, which allows defining the specific shard to assign documents to.
When using the 'implicit' router, the shards parameter is required.
When using the 'compositeId' router, the numShards parameter is required.
For more information, see also the section Document Routing.
What this means - is that you can define the number of shards (like you did) or go to a totally different approach which distinguishes shards by a prefix in the document id.
For more information about the second approach see: https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud#ShardsandIndexingDatainSolrCloud-DocumentRouting
We are planning to setup a Solr cluster which will have around 30 machines. We have a zookeeper ensemble of 3 nodes which will be managing Solr.
We will have new production data every few days, which is going to be quite different from the one that is in Prod. Since the data difference is
quite large, we are planning to use hadoop to create the entire Solr index dump and copy these binaries to each machine and maybe do some kinda core swap.
I am still new to Solr and was wondering if this is a good idea. I could http post my data to the prod cluster, but each update could span multiple documents.
I am not sure how this will impact the read traffic while the write happens.
Any pointers ?
Thanks
I am not sure i completely understand your explanations, but it seems to me that you would like to migrate to a new solr cloud environment with zero down time.
First, you need to know how many shards you want, how many replicas, etc.
You need to deploy the solr nodes, then you need to use the collection admin API to create the collection as desired (https://cwiki.apache.org/confluence/display/solr/Collections+API).
After all this you should be ready to add content to your new solr environment.
You can use Hadoop to populate the new solr cloud, say for instance by using solrj. Or you can use the data import handler to migrate data from another solr (or a relational database, etc).
It is very important how you create your solr cloud in terms of document routing, because it controls in which shard your document will be stored.
This is why it is not a good idea to copy raw index data to a solr node, as you may mess up the routing.
I found these explanations very useful about routing: https://lucidworks.com/blog/solr-cloud-document-routing/
What are the pros and cons of having multiple Solr applications for completely different searches comparing to having a single Solr application but have different searches setup as separate cores?
What is the Solr's preferred method? Is having a single Solr application with multicore setup (for various search indexes) is always a right way?
There is no preferred method. It depends on what you are trying to solve. So by nature, can handle multiple cores on the single Solr instance or can have cores across Solr application servers , can handle the collection (in solrcloud).
Having said that, usually you go for
1) Single core on a Solr instance if your data is fairly small - few million documents.
2) You go for multiple solr instances with a single core on each if you want to shard your data incase of billions of documents and want to get better indexing and query performance.
3) You go for multiple cores on single or multiple solr instances if you have multitenancy separating, example a core for each customer or a for catalog another core for skus.
It depends on your use case, the volume of data and query response times etc.
I am new to Solr, and am trying to figure out the best way to index and search our catalogs.
We have to index multiple manufactures and each manufacturer has a different catalog per country. Each catalog for each manufacture per country is about 8GB of data.
I was thinking it might be easier to have an index per manufacture per country and have some way to tell Solr in the URL which index to search from.
Is that the best way of doing this? If so, how would I do it? Where should I start looking? If not, what would be the best way?
I am using Solr 3.5
In general there are two ways of solving this:
Split each catalog into its own core, running a large multi core setup. This will keep each index physically separated from each other, and will allow you to use different properties (language, etc) and configuration for each core. This might be practical, but will require quite a bit of overhead if you plan on searching through all the core at the same time. It'll be easy to split the different cores into running on different servers later - simply spin the cores up on a different server.
Run everything in a single core - if all the attributes and properties of the different catalogs are the same, add two fields - one containing the manufacturer and one containing the country. Filter on these values when you need to limit the hits to a particular country or manufacturer. It'll allow you to easily search the complete index, and scalability can be implemented by replication or something like SolrCloud (coming in 4.0). If you need multilanguage support you'll have to have a field for each language with the settings you need for that language (such as stemming).
There are a few tidbits of information about this on the Solr wiki, but my suggestion is to simply try one of the methods and see if that solves your issue. Moving to the other solution shouldn't be too much work. The simplest implementation is to keep everything in the same index.