What to be aware of when querying an index with Elasticsearch when indexing with SOLR? - solr

As part of a refactoring project I'm moving our quering end to ElasticSearch. Goal is to refactor the indexing-end to ES as well in the end, but this is pretty involved and the indexing part is running stable so this has less priority.
This leads to a situation where a Lucene index is created / indexed using Solr and queried using Elasticsearch. To my understanding this should be possible since ES and SOlR both create Lucene-compatable indexes.
Just to be sure, besides some housekeeping in ES to point to the correct index, is there any unforseen trouble I should be aware of when doing this?

You are correct, Lucene index is part of elasticsearch index. However, you need to consider that elasticsearch index also contains elasticsearch-specific index metadata, which will have to be recreated. The most tricky part of the metadata is mapping that will have to be precisely matched to Solr schema for all fields that you care about, and it might not be easy for some data types. Moreover, elasticsearch expects to find certain internal fields in the index. For example, it wouldn't be able to function without _uid field indexed and stored for every record.
At the end, even if you will overcome all these hurdles you might end up with fairly brittle solution and you will not be able to take advantage of many advanced elasticsearch features. I would suggest looking into migrating indexing portion first.
Have you seen ElasticSearch Mock Solr Plugin? I think it might help you in the migration process.

Related

How lucene works with Neo4j

I am new to Neo4j and Solr/Lucene. i have read that we can use lucene query in Neo4j how does this works? What is the use of using lucene query in Neo4j.?
And also i need a suggestion. I need to write an application to search and analyse the data. which might help me Neo4j Or Solr?
Neo4J uses lucene as part of its legacy indexing. Right now, Neo4J supports several kinds of indexes, like creating labels on nodes, and indexes on node properties.
But before neo4j supported those new features, it primarily (and still) used Lucene for indexing. Most developers would create lucene indexes on particular node properties, to enable them to use lucene's query syntax to find nodes within a cypher query.
For example, if you created an index according to the documentation, you could then search the index for particular values like this:
IndexHits<Node> hits = actors.get( "name", "Keanu Reeves" );
Node reeves = hits.getSingle();
It's lucene behind the scenes that's actually doing that finding.
In cypher, it might look like this:
start n=node:node_auto_index('name:M* OR name:N*')
return n;
In this case, you're searching a particular index for all nodes that have a name property that starts either with an "M" or an "N". What's inside of that single quote expression there is just a query according to the lucene query syntax.
OK, so that's how Neo4J uses lucene. In recent versions, I only use these "legacy indexes" for fulltext indexing, which is where lucene's strength is. If I just want fast equality checks (where name="Neo") then I use regular neo4j schema indexes.
As for Solr, I haven't seen it used in conjunction with neo4j - maybe someone will jump in and provide a counter-example, but usually I think of Solr as running on top of a big lucene index, and in the case of neo4j, it's kind of in the middle there, and I'm not sure running Solr would be a good fit.
As for you needing to write an application to search and analyze data, I can't give you a recommendation - either Neo4J or Solr might help, depending on your application and what you want to do. In generalities, use neo4j when you need to express and search graphs. Use Solr more when you need to organize and search large volumes of text documents.

Manipulate Solr index with lucene

I have a solr core with 100K-1000k documents.
I have a scenario where I need to add or set a field value on most document.
Doing it through Solr takes too much time.
I was wondering if there is a way to do such task with Lucene library and access the Solr index directly (with less overhead).
If needed, I can shutdown the core, run my code and reload the core afterwards (hoping it will take less time than doing it with Solr).
It will be great to hear if someone already done such a thing and what are the major pitfalls in the way.
Similar problem has been discussed multiple times in Lucene Java mailing list. The underlying problem is that you can not update document in Lucene (and hence Solr).
Instead, you need to delete the document and insert a new one. This obviously adds overhead of analyzing, merging index segments, etc. Yet, the specified amount of documents isn't something major and should not take days (have you tried updating Solr with multiple threads?).
You can of course try doing this via Lucene and see if this makes any difference, but you need to be absolutely sure you will be using the same analyzers as Solr does.
I have a scenario where I need to add or set a field value on most document.
If you have to do it often, maybe you need to look at things like ExternalFileField. There are limitations, but it may be better than hacking around Solr's infrastructure by going directly to Lucene.

Cassandra's secondary index Vs DSE solr indexing

I would like to know the performance difference for Cassandra's secondary index vs. DSE's solr indexing placed on CF's.
We have a few CF's that we did not place secondary indices on because we were under the impression that secondary indices would (eventually) cause significant performance issues for heavy read/write CF's. We are trying to turn to Solr to allow for searching these CF's but it looks like loading an index schema modifies the CF's to have secondary indices on the columns of interest.
Would like to know if Solr indexing is different than Cassandra's secondary indexing? And, will it eventually cause slow queries (inserts/reads) for CFs w/ large data sets and heavy read/writes? If so, would you advise custom indexing (which we wanted to avoid)? Btw -- we're also using (trying to use) Solr for its spatial searching.
Thanks for any advice/links you can give.
UPDATE: To better understand why I’m asking these questions and to see if I am asking the right question(s) – description of our use case:
We’re collecting sensor events – many! We are storing them in both a time series CF (EventTL) and skinny CF (Event). Because we are writing (inserting and updating) heavily in the Event CF, we are not placing any secondary indices. Our queries right now are limited to single events via Event or time range of events through EventTL (unless we create additional fat CF’s to allow range queries on other properties of the events).
That’s where DSE (Solr+Cassandra) might help us. We thought that leveraging Solr searching would allow us to avoid creating extra fat CF’s to allow searches on other properties of the events AND allow us to search on multiple properties at once (location + text/properties). However, looking at how the definition of the Event CF changes after adding an index schema for Event via Solr shows that secondary indices were created. This leads to the question of whether these indices will create issues for inserting/updating rows in Event (eventually). We require being able to insert new events ‘quickly’ – because events can potentially come in at 1000+ per sec.
Would like to know if Solr indexing is different than Cassandra's secondary indexing?
DSE Search uses the Cassandra secondary indexing API.
And, will it eventually cause slow queries (inserts/reads) for CFs w/ large data sets and heavy
read/writes?
Lucene and Solr capacity planning is a good idea prior to exceeding the optimal performance threshold of a given server cluster.
If so, would you advise custom indexing (which we wanted to avoid)? Btw -- we're also
(trying to use) Solr for it's spatial searching.
DSE Search queries are as fast as Apache Solr queries.
Since your use case is spatial search, I don't think Cassandra's secondary index feature will work for you. Here's a fairly concise article on secondary indexes that you may find useful: http://www.datastax.com/docs/1.1/ddl/indexes
You should be able to do this with Solr.
Here's a post that should be relevant for you:
http://digbigdata.com/geospatial-search-cassandra-datastax-enterprise/

Document tagging

I have very huge solr index. I want to tag all documents with terms which better represent that document like this. Does this type of clustering results is also come under document tagging?
Which approach is better, Index time Document tagging or Query time document tagging like carrot2 ?
Query time has the obvious drawback that this makes the query more expensive.
However, the clustering results at query time are supposedly better, because at that time, more information has been seen and user feedback can be incorporated.
Note that technically, this is probably more frequent pattern mining than cluster analysis.
Maybe you should just try this variant of frequent pattern mining on your whole data set. You might not even need to store which documents were tagged which way - the solr engine should already be optimized to retrieve them again when needed.
I understood from your question that you want to know how to implement something similar to carrot2 faceting using solr.
IMO you can add a multivalued field tag to your documents (see this Stack Overflow Question for an example) with the cluster names for that doc, and then build facets using that field as explained in Solr wiki here and here.

Which is better enabling indexing on RDBMS or Lucene Indexing

I have an application which uses traditional Database for all of its data , i need to develop a search functionality, i did small prototype with lucene and results are gr8 , now the bigger question arises , for each of users add/delete/update operations i need to update db and the Lucene index too , will I get similar search performance if i just enable indexing on few fields in traditional db instead of moving to Lucene ? is it worth the effort ?.
It depends entirely on the size of the corpus and on the type and frequency of updates.
A separated full-text search solution like lucene gives you much more flexibility when tweaking relevance, and by decoupling the updates of the rdbm and the full-text index gives you more options when trying to optimize performance.
If your never played with Lucene, I would greatly recommend you to use some more high-level solution, like Solr (or websolr), Sphinx, ElasticSearch or IndexTank. Lucene is very low level.

Resources