Which is better enabling indexing on RDBMS or Lucene Indexing - database

I have an application which uses traditional Database for all of its data , i need to develop a search functionality, i did small prototype with lucene and results are gr8 , now the bigger question arises , for each of users add/delete/update operations i need to update db and the Lucene index too , will I get similar search performance if i just enable indexing on few fields in traditional db instead of moving to Lucene ? is it worth the effort ?.

It depends entirely on the size of the corpus and on the type and frequency of updates.
A separated full-text search solution like lucene gives you much more flexibility when tweaking relevance, and by decoupling the updates of the rdbm and the full-text index gives you more options when trying to optimize performance.
If your never played with Lucene, I would greatly recommend you to use some more high-level solution, like Solr (or websolr), Sphinx, ElasticSearch or IndexTank. Lucene is very low level.

Related

usage of solr in ecommerce area what to index and what not to index

I work in an e-commerce area dealing with a huge [300,000] no of products and related data like attributes, price, promotions, collections.
We are thinking to index all the information in solr and query from solr instead of any DB hits. We are thinking of going this way to provide good performance, easy sorting, and faceting features [some example]. We also have delta indexing for near real-time accuracy.
Is this a good decision or it should be a mix of solr and DB. Need some suggestions.
It all depends on your requirement. Solr is not a full replacement to your RDBMS.
You can migrate the relevant or important data to solr from RDBMS using the DIH feature of solr.
Whatever searching you want to execute you can perform on the solr server by HTTP requests.
You can also achieve sorting and faceting with solr.
And yes Delta indexing is also possible in solr.
Your decision of migrating the search relevant data is correct in order to improve the search performance of your application.
Yes it should be mix of RDBMS and solr for your application.

How lucene works with Neo4j

I am new to Neo4j and Solr/Lucene. i have read that we can use lucene query in Neo4j how does this works? What is the use of using lucene query in Neo4j.?
And also i need a suggestion. I need to write an application to search and analyse the data. which might help me Neo4j Or Solr?
Neo4J uses lucene as part of its legacy indexing. Right now, Neo4J supports several kinds of indexes, like creating labels on nodes, and indexes on node properties.
But before neo4j supported those new features, it primarily (and still) used Lucene for indexing. Most developers would create lucene indexes on particular node properties, to enable them to use lucene's query syntax to find nodes within a cypher query.
For example, if you created an index according to the documentation, you could then search the index for particular values like this:
IndexHits<Node> hits = actors.get( "name", "Keanu Reeves" );
Node reeves = hits.getSingle();
It's lucene behind the scenes that's actually doing that finding.
In cypher, it might look like this:
start n=node:node_auto_index('name:M* OR name:N*')
return n;
In this case, you're searching a particular index for all nodes that have a name property that starts either with an "M" or an "N". What's inside of that single quote expression there is just a query according to the lucene query syntax.
OK, so that's how Neo4J uses lucene. In recent versions, I only use these "legacy indexes" for fulltext indexing, which is where lucene's strength is. If I just want fast equality checks (where name="Neo") then I use regular neo4j schema indexes.
As for Solr, I haven't seen it used in conjunction with neo4j - maybe someone will jump in and provide a counter-example, but usually I think of Solr as running on top of a big lucene index, and in the case of neo4j, it's kind of in the middle there, and I'm not sure running Solr would be a good fit.
As for you needing to write an application to search and analyze data, I can't give you a recommendation - either Neo4J or Solr might help, depending on your application and what you want to do. In generalities, use neo4j when you need to express and search graphs. Use Solr more when you need to organize and search large volumes of text documents.

Cassandra's secondary index Vs DSE solr indexing

I would like to know the performance difference for Cassandra's secondary index vs. DSE's solr indexing placed on CF's.
We have a few CF's that we did not place secondary indices on because we were under the impression that secondary indices would (eventually) cause significant performance issues for heavy read/write CF's. We are trying to turn to Solr to allow for searching these CF's but it looks like loading an index schema modifies the CF's to have secondary indices on the columns of interest.
Would like to know if Solr indexing is different than Cassandra's secondary indexing? And, will it eventually cause slow queries (inserts/reads) for CFs w/ large data sets and heavy read/writes? If so, would you advise custom indexing (which we wanted to avoid)? Btw -- we're also using (trying to use) Solr for its spatial searching.
Thanks for any advice/links you can give.
UPDATE: To better understand why I’m asking these questions and to see if I am asking the right question(s) – description of our use case:
We’re collecting sensor events – many! We are storing them in both a time series CF (EventTL) and skinny CF (Event). Because we are writing (inserting and updating) heavily in the Event CF, we are not placing any secondary indices. Our queries right now are limited to single events via Event or time range of events through EventTL (unless we create additional fat CF’s to allow range queries on other properties of the events).
That’s where DSE (Solr+Cassandra) might help us. We thought that leveraging Solr searching would allow us to avoid creating extra fat CF’s to allow searches on other properties of the events AND allow us to search on multiple properties at once (location + text/properties). However, looking at how the definition of the Event CF changes after adding an index schema for Event via Solr shows that secondary indices were created. This leads to the question of whether these indices will create issues for inserting/updating rows in Event (eventually). We require being able to insert new events ‘quickly’ – because events can potentially come in at 1000+ per sec.
Would like to know if Solr indexing is different than Cassandra's secondary indexing?
DSE Search uses the Cassandra secondary indexing API.
And, will it eventually cause slow queries (inserts/reads) for CFs w/ large data sets and heavy
read/writes?
Lucene and Solr capacity planning is a good idea prior to exceeding the optimal performance threshold of a given server cluster.
If so, would you advise custom indexing (which we wanted to avoid)? Btw -- we're also
(trying to use) Solr for it's spatial searching.
DSE Search queries are as fast as Apache Solr queries.
Since your use case is spatial search, I don't think Cassandra's secondary index feature will work for you. Here's a fairly concise article on secondary indexes that you may find useful: http://www.datastax.com/docs/1.1/ddl/indexes
You should be able to do this with Solr.
Here's a post that should be relevant for you:
http://digbigdata.com/geospatial-search-cassandra-datastax-enterprise/

What to be aware of when querying an index with Elasticsearch when indexing with SOLR?

As part of a refactoring project I'm moving our quering end to ElasticSearch. Goal is to refactor the indexing-end to ES as well in the end, but this is pretty involved and the indexing part is running stable so this has less priority.
This leads to a situation where a Lucene index is created / indexed using Solr and queried using Elasticsearch. To my understanding this should be possible since ES and SOlR both create Lucene-compatable indexes.
Just to be sure, besides some housekeeping in ES to point to the correct index, is there any unforseen trouble I should be aware of when doing this?
You are correct, Lucene index is part of elasticsearch index. However, you need to consider that elasticsearch index also contains elasticsearch-specific index metadata, which will have to be recreated. The most tricky part of the metadata is mapping that will have to be precisely matched to Solr schema for all fields that you care about, and it might not be easy for some data types. Moreover, elasticsearch expects to find certain internal fields in the index. For example, it wouldn't be able to function without _uid field indexed and stored for every record.
At the end, even if you will overcome all these hurdles you might end up with fairly brittle solution and you will not be able to take advantage of many advanced elasticsearch features. I would suggest looking into migrating indexing portion first.
Have you seen ElasticSearch Mock Solr Plugin? I think it might help you in the migration process.

Is it advisable to use Lucene for this?

I have a huge XML file, about 2GB in size, containing Resumes. There are thousands of resumes in this file, tagged properly. Right now I am using XPATH to query it. So is it advisable to use Lucene for the same instead of XPATH?
Depends upon what your requirements are. If you need full-text searching and all other great features of a full-blown search engine, Lucene is the way to go. I would recommend Solr which builds on top of lucene and provides a much better API and abstraction.
Like everything else technology related, it depends.
What Lucene gives you that you're not getting with XPath is the power of a full-text engine that supports among other things ranking and the ability to phrase queries, wildcard queries etc.
Based on your use-case I would say that at full-text search engine makes sense. That's not to say that vanilla Lucene is the best way to go (there are for example other alternatives that build on Lucene).
2GB seems to be pretty less for which I would contruct my own inverted index (a minimal one) :) However no problem in using Lucene/Solr though. Go ahead. It will help you once your records starts doubling. However at this scale (2GB) or even much larger many real life stuff is working on databases full text searches using SQL like keyword.

Resources