I would like to know the performance difference for Cassandra's secondary index vs. DSE's solr indexing placed on CF's.
We have a few CF's that we did not place secondary indices on because we were under the impression that secondary indices would (eventually) cause significant performance issues for heavy read/write CF's. We are trying to turn to Solr to allow for searching these CF's but it looks like loading an index schema modifies the CF's to have secondary indices on the columns of interest.
Would like to know if Solr indexing is different than Cassandra's secondary indexing? And, will it eventually cause slow queries (inserts/reads) for CFs w/ large data sets and heavy read/writes? If so, would you advise custom indexing (which we wanted to avoid)? Btw -- we're also using (trying to use) Solr for its spatial searching.
Thanks for any advice/links you can give.
UPDATE: To better understand why I’m asking these questions and to see if I am asking the right question(s) – description of our use case:
We’re collecting sensor events – many! We are storing them in both a time series CF (EventTL) and skinny CF (Event). Because we are writing (inserting and updating) heavily in the Event CF, we are not placing any secondary indices. Our queries right now are limited to single events via Event or time range of events through EventTL (unless we create additional fat CF’s to allow range queries on other properties of the events).
That’s where DSE (Solr+Cassandra) might help us. We thought that leveraging Solr searching would allow us to avoid creating extra fat CF’s to allow searches on other properties of the events AND allow us to search on multiple properties at once (location + text/properties). However, looking at how the definition of the Event CF changes after adding an index schema for Event via Solr shows that secondary indices were created. This leads to the question of whether these indices will create issues for inserting/updating rows in Event (eventually). We require being able to insert new events ‘quickly’ – because events can potentially come in at 1000+ per sec.
Would like to know if Solr indexing is different than Cassandra's secondary indexing?
DSE Search uses the Cassandra secondary indexing API.
And, will it eventually cause slow queries (inserts/reads) for CFs w/ large data sets and heavy
read/writes?
Lucene and Solr capacity planning is a good idea prior to exceeding the optimal performance threshold of a given server cluster.
If so, would you advise custom indexing (which we wanted to avoid)? Btw -- we're also
(trying to use) Solr for it's spatial searching.
DSE Search queries are as fast as Apache Solr queries.
Since your use case is spatial search, I don't think Cassandra's secondary index feature will work for you. Here's a fairly concise article on secondary indexes that you may find useful: http://www.datastax.com/docs/1.1/ddl/indexes
You should be able to do this with Solr.
Here's a post that should be relevant for you:
http://digbigdata.com/geospatial-search-cassandra-datastax-enterprise/
Related
I am building a user-facing search engine for movies, music and art where users perform free-text queries (like Google) and get the desired results. Right now, I have movies, music and art data indexed separately on different cores and they do not share a similar schema. For ease of maintenance, I would prefer having them in separate cores as it is now.
Till date, I have been performing my queries individually on each core, but I want to expand this capability to perform a single query that runs across multiple cores/indexes. Say I run a query by the name of the artist and the search engine returns me all the relevant movies, music and art work they have done. Things get tricky here.
Based on my research, I see that there are two options in this case.
Create a fourth core, add shards attribute that points to my other three cores. Redirect all my queries to this core to return required results.
Create a hybrid index merging all three schema's and perform queries on this index.
With the first option, the downside I see is that the keys need to be unique across my schema for this to work. I am going to have the key artistName across all my cores, this isn't going to help me.
I really prefer keeping my schema separately, so I do not want to delve into the second option as such. Is there a middle ground here? What would be considered best practice in this case?
Linking other SO questions that I referred here:
Best way to search multiple solr core
Solr Search Across Multiple Cores
Search multiple SOLR core's and return one result set
I am of the opinion that you should not be doing search across multiple core.
Solr or Nosql databases are not meant for it. These database are preferred when we want to achieve faster response which is not possible with the RDBMS as it involves the joins.
The joins in the RDBMS slower's the performance of your query as the data grows in size.
To achieve the faster response we try to convert the data into flat document and stores it in the NoSQL database like MongoDB, Solr etc..
You should covert your data into such a way that, it should be part of single document.
If above option is not possible then create individual cores and retrieve the specific data from specific core with multiple calls.
You can also check for creating parent child relation document in solr.
Use solr cloud option with solr streaming expression.
Every option has its pros and cons. It all depends on your requirement and what you can compromise.
I am developing a web application where I want to use Solr for search only and keep my data on another Database.
I will be having 2 databases: one Relational (Sql Server) and the other will be a copy of it on the NoSQL Solr database.
I'll be searching for specific fields in the solr documents e.g(by id,name,type and join queries) i.e NOT full text search.
I know Solr strength is in full text search by creating inverted index on the documents data, now i want to know does it also helps in my case by creating another type of index on my documents which make normal searching faster than sql server index?
Yes, it will help you.
You need to consider what is your requirement. What is your preference?
If you have the solr as another additional option which will be used for the searching the application data, you need to consider that you have to constantly update the solr. You will need additional infrastructure and all.
If the performance is your main criteria and you don't want to put any search load on your RDBMS then you can add the solr to your system. Also consider how big your data is in the RDBMS. Because RDBMS system are also enough strong to support searching data.
Considering all the above aspects you can take the decision.
I was wondering which scenario (or the combination) would be better for my application. From the aspect of performance, scalability and high availability.
Here is my application:
Suppose I am going to have more than 10m documents and it grows every day. (probably in 1 years it reaches to more than 100m docs. I want to use Solr as tool for indexing these documents but the problem is I have some data fields that could change frequently. (not too much but it could change)
Scenarios:
1- Using SolrCloud as database for all data. (even the one that could be changed)
2- Using SolrCloud as database for static data and using RDBMS (such as oracle) for storing dynamic fields.
3- Using The integration of SolrCloud and Hadoop (HDFS+MapReduce) for all data.
Best regards.
I'm not sure how SolrCloud works with DIH (you might face situation when indexing will happen only on one instance).
On the other hand I would store data in RDBMS, because from time to time you will need to reindex Solr to add some new functionality to the index.
At the end of the day I would use DB + Solr (all the fields) with either Hadoop (have not used it yet) or some other piece of software to post data into the SolrCloud.
I have a solr core with 100K-1000k documents.
I have a scenario where I need to add or set a field value on most document.
Doing it through Solr takes too much time.
I was wondering if there is a way to do such task with Lucene library and access the Solr index directly (with less overhead).
If needed, I can shutdown the core, run my code and reload the core afterwards (hoping it will take less time than doing it with Solr).
It will be great to hear if someone already done such a thing and what are the major pitfalls in the way.
Similar problem has been discussed multiple times in Lucene Java mailing list. The underlying problem is that you can not update document in Lucene (and hence Solr).
Instead, you need to delete the document and insert a new one. This obviously adds overhead of analyzing, merging index segments, etc. Yet, the specified amount of documents isn't something major and should not take days (have you tried updating Solr with multiple threads?).
You can of course try doing this via Lucene and see if this makes any difference, but you need to be absolutely sure you will be using the same analyzers as Solr does.
I have a scenario where I need to add or set a field value on most document.
If you have to do it often, maybe you need to look at things like ExternalFileField. There are limitations, but it may be better than hacking around Solr's infrastructure by going directly to Lucene.
As part of a refactoring project I'm moving our quering end to ElasticSearch. Goal is to refactor the indexing-end to ES as well in the end, but this is pretty involved and the indexing part is running stable so this has less priority.
This leads to a situation where a Lucene index is created / indexed using Solr and queried using Elasticsearch. To my understanding this should be possible since ES and SOlR both create Lucene-compatable indexes.
Just to be sure, besides some housekeeping in ES to point to the correct index, is there any unforseen trouble I should be aware of when doing this?
You are correct, Lucene index is part of elasticsearch index. However, you need to consider that elasticsearch index also contains elasticsearch-specific index metadata, which will have to be recreated. The most tricky part of the metadata is mapping that will have to be precisely matched to Solr schema for all fields that you care about, and it might not be easy for some data types. Moreover, elasticsearch expects to find certain internal fields in the index. For example, it wouldn't be able to function without _uid field indexed and stored for every record.
At the end, even if you will overcome all these hurdles you might end up with fairly brittle solution and you will not be able to take advantage of many advanced elasticsearch features. I would suggest looking into migrating indexing portion first.
Have you seen ElasticSearch Mock Solr Plugin? I think it might help you in the migration process.