How to best search against a DB with Lucene? - database

I am looking into mechanisms for better search capabilities against our database. It is currently a huge bottleneck (causing long-lasting queries that are hurting our database performance).
My boss wanted me to look into Solr, but on closer inspection, it seems we actually want some kind of DB integration mechanism with Lucene itself.
From the Lucene FAQ, they recommend Hibernate Search, Compass, and DBSight.
As a background of our current technology stack, we are using straight JSPs on Tomcat, no Hibernate, no other frameworks on top of it... just straight Java, JSP, and JDBC against a DB2 database.
Given that, it seems Hibernate Search might be a bit more difficult to integrate into our system, though it might be nice to have the option of using Hibernate after such an integration.
Does anyone have any experiences they can share with using one of these tools (or other similar Lucene based solutions) that might help in picking the right tool?
It needs to be a FOSS solution, and ideally will manage updating Lucene with changes from the database automagicly (though efficiently), without extra effort to notify the tool when changes have been made (otherwise, it seems rolling my own Lucene solution would be just as good). Also, we have multiple application servers with just 1 database (+failover), so it would be nice if it is easy to use the solution from all application servers seamlessly.
I am continuing to inspect the options now, but it would be really helpful to utilize other people's experiences.

When you say "search against a DB", what do you mean?
Relational databases and information retrieval systems use very different approaches for good reason. What kind of data are you searching? What kind of queries do you perform?
If I were going to implement an inverted index on top of a database, as Compass does, I would not use their approach, which is to implement Lucene's Directory abstraction with BLOBs. Rather, I'd implement Lucene's IndexReader abstraction.
Relational databases are quite capable of maintaining indexes. The value that Lucene brings in this context is its analysis capabilities, which are most useful for unstructured text records. A good approach would leverage the strengths of each tool.
As updates are made to the index, Lucene creates more segments (additional files or BLOBs), which degrade performance until a costly "optimize" procedure is used. Most databases will amortize this cost over each index update, giving you more stable performance.

I have had good experiences with Compass. It has really good integration with hibernate and can mirror data changes made through hibernate and jdbc directly to the Lucene indexes though its GPS devices http://www.compass-project.org/docs/1.2.2/reference/html/gps-jdbc.html.
Maintaining the Lucene indexes on all your application servers may be an issue. If you have multiple App servers updating the db, then you may hit some issues with keeping the index in sync with all the changes. Compass may have an alternate mechanism for handling this now.
The Alfresco Project (CMS) also uses Lucene and have a mechanism for replicating Lucene index changes between servers that may be useful in handling these issues.
We started using Compass before Hibernate Search was really off the ground so I cannot offer any comparison with it.

LuSql http://code.google.com/p/lusql/ allows you to load the contents of a JDBC-accessible database into Lucene, making it searchable. It is highly optimized and multi-threaded. I am the author of LuSql and will be coming out with a new version (re-architected with a new plugable architecture) in the next month.

For a pure performance boost with searching Lucene will certainly help out a lot. Only index what you care about/need and you should be good. You could use Hibernate or some other piece if you like but I don't think it is required.

Well, it seems DBSight doesn't meet the FOSS requirement, so unless it is an absolutely stellar solution, it is not an option for me right now...

Related

Using Neo4j and Lucene in a distributed system

I am looking into Neo4j as a stripped-down document store. A key aspect of document storage is search, and I know Neo4j includes full text search via legacy indices provided by Lucene.
I would be very interested in hearing the limitations of Neo4j search capabilities in a distributed environment. Does it provide a distributed index? In what ways is it inferior to Solr or ElasticSearch? How far can I take it before I must install Solr?
-- EDIT --
We are trying to integrate two distinct search efforts. The first is standard text content search. For instance, using the Enron emails, we want to search for every email that matches "bananas" or "going to the store" and get those document bodies in response. This is where people often turn to Solr.
The second case is more complicated, we have attached a great deal of meta-data to each document. We may have decided that "these" emails were the result of late-night drunk-dialing. Now I want to search for all emails that may have been the result of late-night drunk-dialing. For this kind of meta-data, we believe a graph database is in order.
In a perfect world, I can use one platform to perform both queries. I appreciate that Neo4j (nor OrientDB, Arango, etc) are designed as full text search databases, but I'm trying to understand the limitations thereof.
In terms of volume, we are dealing at a very large scale with batch-style nightly updates. The data is content heavy, with some documents running into hundreds of pages of text, but mostly on the order of a page or two.
I once worked on a health social network where we needed some sort of search and connection search functionalities we first went on neo4j we were very impressed by the cypher query language we could get and express any request however when you throw there billion of nodes you start to pay the price and we started considering another graph db, this time we've made a lot of research, tests and OrientDB was clearly the winner, OrientDB is highly scalable but the thing is that you have to code by yourself, your "search algorithm" if you want to do some advanced things (what is the common point between this two nodes) otherwise you have the SQL like query language (i don't know/remember if he has a name) but you can do some interesting stuff with it
So in conclusion i would definitely go on OrientDB
Neo4j can provide a "distributed index" in the sense that the high availability cluster can make your index available on more than one machine, but I'm pretty sure that's not what you're after. Related to this issue is a different answer I wrote about graph partitioning, and what it takes to distribute a really large number of nodes/relationships across multiple machines. (It's not terribly simple)
Solr and Lucene do two different things (although Solr is built on top of Lucene). I think solr and neo4j are not comparable because they're trying to do completely different things. This site isn't about software recommendations so I can't tell you what you should use other than to say you should read up on solr and neo4j, and figure out which set of functionality you want. As far as I know, this is an exclusive decision as I'm not aware of people integrating solr with neo4j.
Your question is very difficult to answer, I'd recommend expanding on what you are trying to do and what you have tried, you'll probably get better responses.

Recommended Setup for BigData Application

I am currently working on a long term project that will need to support:
Lots of fast Read/Write operations via RESTful Services
An Analytics Engine continually reading and making sense of data
It is vital that the performance of the Analytics Engine not be affected by the volume of Reads/Writes coming from the API calls.
Because of that, I'm thinking that I may have to use a "front-end" database and some sort of "back-end" data warehouse. I would also need to have something like Elastic Search or Solr indexing the data stored in the data warehouse.
The Questions:
Is this a Recommended Setup? What would the alternative be?
If so...
I'm considering either Hive or Pig for the data-warehousing, and Elastic Search or Solr as a Search Engine. Which combination is known to work better together?
And finally...
I'm seriously considering Cassandra as the "fron-end" database. What is the relation between Cassandra and Hadoop, and when/why should they be put to work together instead of having just Cassandra?
Please note, my intention is NOT to start a debate about which of these is better, but to understand how can they be put to work better more efficiently. If it makes any difference, the main code is being written in Scala and Java.
I truly appreciate your help. I'm basically learning as I go and all comments will be very helpful.
Thank you.
First let's talk about Cassandra
This is a NoSQL database with eventual consistency which basically means for you that different nodes into a Cassandra cluster may have different 'snapshots' of data in the case that there is an inter cluster communication/availability problem. The data eventually will be consistent however.
Since you consider it as a 'frontend' database what you need to understand is how you will model your data. Cassandra can take advantage of indexes however you still need to defined upfront your access pattern.
Normally there is no relation between Cassandra and Hadoop (except that both are written in Java) however the Datastax distribution (enterprise version) has Hadoop support directly from Cassandra.
As a general workflow you will read/write most current data (let's say - last 24 hours) from your 'small' database that enough performance (Cassandra has excellent support for it) and you would move anything older than X (older than 24 hours) to a 'long term storage' such as Hadoop where you can run all sort of Map Reduce etc.
In regards to the text search it really depends what you need - Elastic Search is sort of competition to Solr and reverse. You can see yourself how they compare here http://solr-vs-elasticsearch.com/
As for your third question,
I think Cassandra is more like a database to save data.
Hadoop is responsible to provide a compution model to let you analyze your large data in
Cassandra.
So it is very helpful to combine Cassandra with Hadoop.
Also have other ways you can consider, such as combine with mongo and hadoop,
for mongo has support mongo-connector between hadoop and it's data.
Also if you have some search requirements , you can also use solr, directly generated index from mongo.

Which database for good, standard full-text searching abilities

I've been using MySQL happily for many years but have now come across an issue using it and wondered where I should go from here. My issue is full-text indexing. MySQL just doesn't perform well using this feature with large tables, unless you use a third-party plugin like Lucene etc.
I don't mind paying for the database but would prefer a free service. I don't have a DB administration team that can maintain it, it's just me, so it has to be simple to maintain, develop and scale. I develop on a Windows IIS7 environment, usually in ASP.NET and Classic ASP. My application will probably have a maximum of 10 million rows in the full-text table, so not huge but fairly hefty.
I could quite easily grab Lucene and use that with MySQL, but I would really like to know which DB performs best using full-text indexing, straight out of the box, so-to-speak?
Any suggestions or experiences would be marvelous.
cant help a lot other than saying that full text on sql server is amazing. It seems complicated at first (because of catalogs, indexes end everything) but once you give it a go you'll see thats quite simple to implement. This website shows an example with screens
you also have several functions to manipulate (search) the data (the saurus, stoplists, etc..)
Postgres.
In addition, there are Ruby gems to take advantage of it easily.
I worked at a place that used Oracle for full-text search, and they were happy with that until they found Lucene -- now they are switching to Lucene.
I've heard good things about Postgres' full-text search, but I've never seen it in action.
Lucene.NET is a straight .NET port of Lucene, and performs well.

Backend for Web Development using Clojure/ClojureScript

I'm familiar with developing desktop apps in Clojure (written a multithreaded interactive visualization system). However, I'm fairly new to Web development using Clojure.
I plan to use Clojure on the server for handling logic; and ClojureScript for handing client side work. However, I don't know what to use for my database server. Should I use something like Monogodb? or Hadoop? Or .... ?
The app is something very simple; a basic forum. Total number of concurrent users will be < 100 at a given time. One thing that is important to me is the ability to easily backup / data consistency -- it's very very important to me that I can easily make daily backups (and not lose all the data.)
Thanks!
You can use many databases; if the database has an API for Java, you should be good to go. MySQL, MongoDB, Postgres, Hadoop… and more.
For a nice overview of the webstack in Clojure, check out brehaut's article on the matter.
For getting up and running quickly with Clojure and ClojureScript, try ClojureScriptOne.
There are many ways to write what you want to write; if you're already familiar with Clojure, it shouldn't be too hard to get going.
Haven't used it myself, but Datomic ( http://datomic.com/ ) looks great for anyone coming from Clojure.
Datomic is an amazing database, and I'd highly recommend it. It has many features which set it apart from other database systems:
Like Clojure's data structures, it's persistent, meaning that by default, adding new facts to the database doesn't delete old facts, allowing you to query the state of the database at a previous point in time, enhancing audit-ability and assistance in debugging.
The underlying Entity Attribute Value (EAV/triple) data model (at least partly inspired by RDF & the Semantic Web), is extremely flexible, allowing you to express arbitrary graph structures and effortlessly deal with polymorphism.
The query language is flavor of Datalog, a sort of pattern matching based query language strictly more expressive than SQL and the like in that it can do recursive queries, making it particularly well suited for dealing with graph data/queries.
In addition to Datalog queries, there's a pull api, which let's you pull data out of the database more simply using a GraphQL like expression which specifies the shape of a document-like structure you'd like to pull out of the database. These queries can even be used from within the :find clause of a Datalog query.
You can use Clojure functions from within your queries.
The indexing system is very smart and more or less automatic, in stark contrast with the work that typically goes into tuning SQL databases for performance.
Transactions go through a different API/function call than queries, meaning that the number one security risk identified by OWASP (SQL injection) is literally impossible in Datomic.
The transactor/read-replica design makes it super easy to scale reads/queries, while keeping pressure off the transactor.
It's fun as hell.
One of the things worth pointing out here is that by embracing the EAV data model and datalog/pull queries, Datomic ends up having structural flexibility closer to that of a NoSQL database, while still being fundamentally relational, and even more expressive in it's relational queries than SQL.
It's amazing and you should absolutely give it a shot. It will melt your brain a little. In the good way.
It's also worth noting that it's popularity has inspired a number of successful open source projects, so the underlying approach is not going anywhere any time soon:
DataScript: In memory clj/cljs partial implementation
Datahike: Fork of DataScript which queries over on disk indices, meaning you don't have to keep everything in memory to query
Mentat: Mozilla project trying to make a Datomic-alike for a Mozilla project

Database Sharding Support in Propel

Just wonder how good is Propel's support for database sharding? I am thinking about creating my application in PHP, using MySQL as the database server and Propel as the ORM.
I figure out that it may be good to keep the architecture scalable right from the start, just in case my application takes off.
What's your take?
I think that's a very bad idea. Assuming that you need to shard your data is not a good assumption. You don't know, in advance, how you're going to want to scale. Sharding is a very complicated business and needs to be avoided if at all possible. This is an obscene case of premature optimisation.
I agree with MarkR that it's too early to be worrying about sharding, but I disagree that it should be avoided if at all possible. I'd say go with the ORM that seems to fit your style and language choice -- and Propel is probably the right one in your case. Even if your application takes off in a big way, sharding probably won't be necessary -- you can easily pull off 25 million records with a MySQL-based DBMS and some decent caching techniques, so just focus on making your queries fast and design for easy memcache-integration, and you'll be a happy camper even when your app takes off.
Good luck with it!
Propel supports sharding out of the box through connections. check an example here:
http://groups.google.com/group/propel-users/browse_thread/thread/4d19c0668aa17452
Some database sharding middleware can help you.
For example: Apache ShardingSphere (https://shardingsphere.apache.org/document/current/en/features/sharding/)
There are 2 adaptors of ShardingSphere, JDBC and Proxy. JDBC is the java driver to connect database which no fit for PHP, but Proxy is just like the database (MySQL or Postgres).
ShardingSphere-Proxy can isolate the data sharding and business logic. It is better to use third party middleware to handle common problems.

Resources