So we have multiple solr instances located in different data centers. Each solr instance has the same collections and schemas, but the data we store in them are different (we only store EU customers in the solr instance located in the EU and we only store US customer data in the solr instance located in the US, etc...).
I'm looking for a way to run a query across all the solr instances in each data center and get a combined result (ie: The final result will contain both EU and US data). I don't want to query each solr instance separately and combine the results on my side since I would like to still be able to use solr's sorting and other query parameters on the final result set.
Does solr have something built in that will help me achieve this? or maybe a third party tool I could use?
There are a few ways - you can manually use the sharding parameter. First, fetch the set of cores and hosts for each collection through CLUSTERSTATUS in the Collections API (or directly from Zookeeper).
Another option is to use the Solr Streaming Expressions API. There are a few limitations to consider when using the API, and the result set will be formatted differently from the regular query result. The search stream source allows you to give it the zkHost parameter, telling the function what Zookeeper it should contact to get to know where the collection lives and what nodes answers for the collection. After that you'll have to add stream decorators and filters to get the result you want.
Related
I am building a user-facing search engine for movies, music and art where users perform free-text queries (like Google) and get the desired results. Right now, I have movies, music and art data indexed separately on different cores and they do not share a similar schema. For ease of maintenance, I would prefer having them in separate cores as it is now.
Till date, I have been performing my queries individually on each core, but I want to expand this capability to perform a single query that runs across multiple cores/indexes. Say I run a query by the name of the artist and the search engine returns me all the relevant movies, music and art work they have done. Things get tricky here.
Based on my research, I see that there are two options in this case.
Create a fourth core, add shards attribute that points to my other three cores. Redirect all my queries to this core to return required results.
Create a hybrid index merging all three schema's and perform queries on this index.
With the first option, the downside I see is that the keys need to be unique across my schema for this to work. I am going to have the key artistName across all my cores, this isn't going to help me.
I really prefer keeping my schema separately, so I do not want to delve into the second option as such. Is there a middle ground here? What would be considered best practice in this case?
Linking other SO questions that I referred here:
Best way to search multiple solr core
Solr Search Across Multiple Cores
Search multiple SOLR core's and return one result set
I am of the opinion that you should not be doing search across multiple core.
Solr or Nosql databases are not meant for it. These database are preferred when we want to achieve faster response which is not possible with the RDBMS as it involves the joins.
The joins in the RDBMS slower's the performance of your query as the data grows in size.
To achieve the faster response we try to convert the data into flat document and stores it in the NoSQL database like MongoDB, Solr etc..
You should covert your data into such a way that, it should be part of single document.
If above option is not possible then create individual cores and retrieve the specific data from specific core with multiple calls.
You can also check for creating parent child relation document in solr.
Use solr cloud option with solr streaming expression.
Every option has its pros and cons. It all depends on your requirement and what you can compromise.
I'm using SOLR 4.0. I need to make 4 different indexes for searching, let's say, first is a list of students in a university, second is a list of products being sold on an online marketplace and so on. What I mean here is that they all hold completely different types of data.
Currently I'm running 4 instances of solr on 4 different ports each having a single collection serving one type of data. The problem is that running 4 instances of solr takes up a lot of memory space.
How can I run all 4 collections in a single solr instance? While searching, maybe I can specify in the url the collection that I'm interested in.
You can create multiple cores within a single Solr instance. There is a CoreAdmin API for such purposes.
It has a CREATE action which creates a new core and registers it. Here is the sample create core request:
http://localhost:8983/solr/admin/cores?action=CREATE&name=coreX&instanceDir=path/to/dir&config=config_file_name.xml&dataDir=data
Bear in mind that CREATE call must be able to find a configuration, or it will not succeed.
You can read documentation from here: https://cwiki.apache.org/confluence/display/solr/CoreAdmin+API#CoreAdminAPI-CREATE
We are using apache solr to implement search in our application.
The search will be such that the user can search for employees, offices or both. We need to have auto suggest feature and search for the same.
My question is how do i import data from two tables without using a join(As offices and tables are not related directly) in db-data-config file. I tried using two entities but it gave me an error saying the unique key needed to be the same.
Also how do i configure the fields of these two entities in the schema.xml file
Please help
You should be perfectly ok with single core and multiple entities.
You just need to have some discriminator that you append to ID column in your database (if it's numeric and you want to use it as identity in Solr).
You would also like to have some column that keeps your data type and declare fields from all tables in Solr document.
Keep in mind that Solr schema is not the same as SQL schema. You can have many fields declared in schema.xml but only use few of them in your documents. It costs nothing. Only fields you actually set are stored.
I've been loading data for many data types with different schema to Solr in my previous project. Let me know if you need some examples, I'll try to find them.
More info about data import in solr:
http://wiki.apache.org/solr/DataImportHandler
It sounds like what you have are two different types of document that you want to index in Solr. To do this I believe you will need to set up a multi-core solr instance with separate schema.xml files for each one.
For more information see this question:
what is SOLR multicore exactly
And here:
https://wiki.apache.org/solr/CoreAdmin
Currently I have a zookeeper multi solr server, single shard setup. Unique ids are generated automatically by solr.
I now have a zookeeper mult solr server, multi shard requirement. I need to be able to route updates to a specific shard.
After reading http://searchhub.org/2013/06/13/solr-cloud-document-routing/ I am concerned that I cannot allow solr to generate random unique ids if I want to route updates to a specific shard.
Cannot anyone confirm this for me and perhaps give an explanation of the best approach.
Thanks
There is no way you can route your documents to a particular shard since it is being managed by the zookeeper.
Solution to your problem is that you should create two collections instead of two shards. Use your 1st collection with two servers and 2nd collection can use the third server and then you can send your updates to particular servers.The design should look like
collection1---->shard1---->server1,server2
collection2---->shard1----->server3
This way you can separate your indexes as per your requirement.
Can I use different format (schema.xml) for a document (like car document)?
So that I can use different index to query the same class of documents differently?
(OK.. I can use two instances of Solr, but.. that's the only way? )
Only one schema is possible for a Core.
You can always have different Cores within the same solr with the Multicore configuration.
However, if you have the same entity and want to query it differently, you can have the same schema.xml to hold the values different fields and different field types (Check copyfield) and have different query handler to have weighted queries depending upon the needs.
As far as I know you can only have one schema file per Solr core.
Each core uses its own schema file so if you want to have two different schema files then either set -up a 2nd Solr core or run another instance of Solr