I am not being able to decide that out of the two ways of creating collection in solr cloud which one I should go for.
I want that - I should be able to add/create shard to a existing collection on the fly so that I can scale up the cluster as and when the index grows. Since this is possible only in collection created through implicit routing so I am planning to use it.
I just want to know if I have collection (created through implicit routing) then how it will perform in terms of query time? Will it be same as when compared to collection created through solr default routing? Is there any drawbacks in terms performance?
Solr query time is set by the slowest shard response time.
When you use implicit routing you responsible for the number of document in each shard, and if your routing strategy is poor you would end up with unbalanced shards which will perform slower.
When you use Solr default strategy Solr decide where to send the documents according to docId.hash() % (#shards), usually that shards are balanced and you will get better performance.
Both strategies are good depends on your use case, I would choose implicit routing in case of multi tenancy (shard per tenant) of if would need to add shards each month/day. Usually I go with the default routing and scale up by multipling the number of nodes x 2 (I know it's a costly solution).
I suggested another scale out option in the following JIRA SOLR-5025 you are welcome to add your comments or vote: https://issues.apache.org/jira/browse/SOLR-5025
Related
I am building a user-facing search engine for movies, music and art where users perform free-text queries (like Google) and get the desired results. Right now, I have movies, music and art data indexed separately on different cores and they do not share a similar schema. For ease of maintenance, I would prefer having them in separate cores as it is now.
Till date, I have been performing my queries individually on each core, but I want to expand this capability to perform a single query that runs across multiple cores/indexes. Say I run a query by the name of the artist and the search engine returns me all the relevant movies, music and art work they have done. Things get tricky here.
Based on my research, I see that there are two options in this case.
Create a fourth core, add shards attribute that points to my other three cores. Redirect all my queries to this core to return required results.
Create a hybrid index merging all three schema's and perform queries on this index.
With the first option, the downside I see is that the keys need to be unique across my schema for this to work. I am going to have the key artistName across all my cores, this isn't going to help me.
I really prefer keeping my schema separately, so I do not want to delve into the second option as such. Is there a middle ground here? What would be considered best practice in this case?
Linking other SO questions that I referred here:
Best way to search multiple solr core
Solr Search Across Multiple Cores
Search multiple SOLR core's and return one result set
I am of the opinion that you should not be doing search across multiple core.
Solr or Nosql databases are not meant for it. These database are preferred when we want to achieve faster response which is not possible with the RDBMS as it involves the joins.
The joins in the RDBMS slower's the performance of your query as the data grows in size.
To achieve the faster response we try to convert the data into flat document and stores it in the NoSQL database like MongoDB, Solr etc..
You should covert your data into such a way that, it should be part of single document.
If above option is not possible then create individual cores and retrieve the specific data from specific core with multiple calls.
You can also check for creating parent child relation document in solr.
Use solr cloud option with solr streaming expression.
Every option has its pros and cons. It all depends on your requirement and what you can compromise.
I was wondering which scenario (or the combination) would be better for my application. From the aspect of performance, scalability and high availability.
Here is my application:
Suppose I am going to have more than 10m documents and it grows every day. (probably in 1 years it reaches to more than 100m docs. I want to use Solr as tool for indexing these documents but the problem is I have some data fields that could change frequently. (not too much but it could change)
Scenarios:
1- Using SolrCloud as database for all data. (even the one that could be changed)
2- Using SolrCloud as database for static data and using RDBMS (such as oracle) for storing dynamic fields.
3- Using The integration of SolrCloud and Hadoop (HDFS+MapReduce) for all data.
Best regards.
I'm not sure how SolrCloud works with DIH (you might face situation when indexing will happen only on one instance).
On the other hand I would store data in RDBMS, because from time to time you will need to reindex Solr to add some new functionality to the index.
At the end of the day I would use DB + Solr (all the fields) with either Hadoop (have not used it yet) or some other piece of software to post data into the SolrCloud.
Currently I have a zookeeper multi solr server, single shard setup. Unique ids are generated automatically by solr.
I now have a zookeeper mult solr server, multi shard requirement. I need to be able to route updates to a specific shard.
After reading http://searchhub.org/2013/06/13/solr-cloud-document-routing/ I am concerned that I cannot allow solr to generate random unique ids if I want to route updates to a specific shard.
Cannot anyone confirm this for me and perhaps give an explanation of the best approach.
Thanks
There is no way you can route your documents to a particular shard since it is being managed by the zookeeper.
Solution to your problem is that you should create two collections instead of two shards. Use your 1st collection with two servers and 2nd collection can use the third server and then you can send your updates to particular servers.The design should look like
collection1---->shard1---->server1,server2
collection2---->shard1----->server3
This way you can separate your indexes as per your requirement.
I would like to know the performance difference for Cassandra's secondary index vs. DSE's solr indexing placed on CF's.
We have a few CF's that we did not place secondary indices on because we were under the impression that secondary indices would (eventually) cause significant performance issues for heavy read/write CF's. We are trying to turn to Solr to allow for searching these CF's but it looks like loading an index schema modifies the CF's to have secondary indices on the columns of interest.
Would like to know if Solr indexing is different than Cassandra's secondary indexing? And, will it eventually cause slow queries (inserts/reads) for CFs w/ large data sets and heavy read/writes? If so, would you advise custom indexing (which we wanted to avoid)? Btw -- we're also using (trying to use) Solr for its spatial searching.
Thanks for any advice/links you can give.
UPDATE: To better understand why I’m asking these questions and to see if I am asking the right question(s) – description of our use case:
We’re collecting sensor events – many! We are storing them in both a time series CF (EventTL) and skinny CF (Event). Because we are writing (inserting and updating) heavily in the Event CF, we are not placing any secondary indices. Our queries right now are limited to single events via Event or time range of events through EventTL (unless we create additional fat CF’s to allow range queries on other properties of the events).
That’s where DSE (Solr+Cassandra) might help us. We thought that leveraging Solr searching would allow us to avoid creating extra fat CF’s to allow searches on other properties of the events AND allow us to search on multiple properties at once (location + text/properties). However, looking at how the definition of the Event CF changes after adding an index schema for Event via Solr shows that secondary indices were created. This leads to the question of whether these indices will create issues for inserting/updating rows in Event (eventually). We require being able to insert new events ‘quickly’ – because events can potentially come in at 1000+ per sec.
Would like to know if Solr indexing is different than Cassandra's secondary indexing?
DSE Search uses the Cassandra secondary indexing API.
And, will it eventually cause slow queries (inserts/reads) for CFs w/ large data sets and heavy
read/writes?
Lucene and Solr capacity planning is a good idea prior to exceeding the optimal performance threshold of a given server cluster.
If so, would you advise custom indexing (which we wanted to avoid)? Btw -- we're also
(trying to use) Solr for it's spatial searching.
DSE Search queries are as fast as Apache Solr queries.
Since your use case is spatial search, I don't think Cassandra's secondary index feature will work for you. Here's a fairly concise article on secondary indexes that you may find useful: http://www.datastax.com/docs/1.1/ddl/indexes
You should be able to do this with Solr.
Here's a post that should be relevant for you:
http://digbigdata.com/geospatial-search-cassandra-datastax-enterprise/
Building an application. Right now we have one Solr server. But we would like to design the app so that it can support multiple Solr shard in future if we outgrow the indexing needs.
What are keys things to keep in mind when developing an application that can support multiple shards in future?
we stored the solr URL /solr/ in a DB. Which is used to execute queries against solr. There is one URL for Updates and one URL for Searches in the DB
If we add shards to the solr environment at a future date, will the process for using the shards be as simple as updating the URLs in the DB? Or are there other things that need to be updated. We are using SolrJ
e.g. change the SolrSearchBaseURL in DB to:
https://solr2/solr/select?shards=solr1/solr,solr2/solr&indent=true&q={search_query}
And updating the SolrUpdateBaseURL in DB to
https://solr2/solr/
?
Basically, what you are describing has already been implemented in SolrCloud. There the ZooKeeper maintains the state of your search cluster (which shards in what collections, shard replicas, leader and slave nodes and more). It can handle the load on indexing and querying sides by using hashing.
You could, in principle, get by (at least in the beginning of your cluster growth) with the system you have developed. But think about replicating, adding load balancers, external cache servers (like e.g. varnish): in the long run you would end up implementing smth like SolrCloud yourself.
Having said that, there are some caveats to using hash based indexing and hence searching. If you want to implement logical partitioning of you data (say, by date) at this point there is no way to this but making a custom code. There is some work projected around this though.