facet like functionality in NoSQL systems - solr

Does anyone know / can someone point to an nosql db, which would support faceting, like in Apache SOLR, off the shelf?
I have read, that in Sphinx they don't support facet search out of the box, but one can implement it in a form of plugin.
Upd: I'm only interested in enterprise level systems.

CouchDB (Eearlang) and RavenDB (.NET) both are based on Lucene, so it should be possible to make them both support Faceted Search. RavenDB already partially supports that.
And Sphinx isn't a NoSQL DB.

Turnes out, that sphinx does have facetted search feature. It has similar features of a "linguistic" pipeline, distributed search, tokenization etc as the ones in Apache SOLR/Lucene. It is an interesting option to Apache SOLR in the sense, that it is written in C++, yet is language independent on client side, same way as SOLR. It is OS as SOLR/Lucene, so customizing code is possible.

Related

Alternative Solandra for cassandra

Anyone knows an alternative to Solandra in Cassandra?
I can't use "like" clause, and in my case i'll use always.
Thanks.
Datastax provides a "tweaked" version of Apache Solr (which saves data directly into Cassandra instead of flat files) to do real-time full-text search. It's called Datastax Enterprise Solution. Of course is not free.
As an alternative, you can couple Cassandra with an Elastic Search cluster but it's kind of heavy just for text search.
Last but not least, try to implement yourself a full text search using Lucene as engine and some hand-made Cassandra tables for storage, good luck though.
You have 3 options to bring advanced search capabilities with Cassandra:
Datastax Solr as already mentioned
Elassandra = ElasticSearch on Cassandra. https://github.com/strapdata/elassandra and http://www.strapdata.com/ It's a good product, we use it in my company. The community edition is free and the latest release combine Cassandra 3.11 with Elasticsearch 5.5. You will see on their website that there is some free trial hosted solution you could use to test.
Stratio lucene plugin
https://github.com/Stratio/cassandra-lucene-index It's free, it works, we also use it in my company. It's just a jar to drop in the Cassandra lib directory.
Of course, for very basic search needs, you can have a look at SASI too.

How to use Lucene with SQLite Database?

How can i use Lucene to index my SQLite Database for Full-Text Search ?
You can use Hibernate-Lucene bridge, but this will require you to use Hibernate.
Lucene is "nothing more" than java libraries. That means, you have to use those libraries "with something".
One way is the Hibernate-Lucene bridge, as mindas wrote.
An other way (which I'm using) is solr.
You can use solr to index your SQLite Database.
But: you have to send the (full-text-)search-request to solr to run such a search. As far as i know, there is nor lucene integration for SQLite Databases.
You will need an proxy (like solr) and an application, which merges the world of SQLite and solr (or an other lucene "proxy") together.

Is there a project that integrates CouchDb and Solr?

I would like to be able to search a CouchDB database using Solr. Are there any projects that provide such an integration?
I am also aware of CouchDB-Lucene. Is there a way to hook Solr into that?
Thanks!
It would make more sense to roll your own, given how wasy it easy. First you need to decide what kind of SOLR schema to use and how to map your CouchDB documents onto that schema. Then simple iterate through all the documents in a db Pagination in CouchDB? and generate SOLR <add> documents.
People do this all the time with all kinds of data sources. Since SOLR is essentially searching a single table, the hard work is often figuring out how to map your database format onto a single table. Read up on what you can do with the SOLR schema, and you may be surprised at how easy this is.
There is a CouchDB integration for ElasticSearch available, apart from feeding ElasticSearch with JSON on your own. Both work with schema-less JSON, so it's very easy to integrate them.
In terms of features, ElasticSearch would offer a comparable set to Solr (in addition to some unique features, of course.)
According to this
http://wiki.apache.org/couchdb/Related_Projects
there was a CouchDB-Solr2 project (scroll down to the end), which is no longer maintained.

Hadoop to create an Index and Add() it to distributed SOLR... is this possible? Should I use Nutch? ..Cloudera?

Can I use a MapReduce framework to create an index and somehow add it to a distributed Solr?
I have a burst of information (logfiles and documents) that will be transported over the internet and stored in my datacenter (or Amazon). It needs to be parsed, indexed, and finally searchable by our replicated Solr installation.
Here is my proposed architecture:
Use a MapReduce framework (Cloudera, Hadoop, Nutch, even DryadLinq) to prepare those documents for indexing
Index those documents into a Lucene.NET / Lucene (java) compatible file format
Deploy that file to all my Solr instances
Activate that replicated index
If that above is possible, I need to choose a MapReduce framework. Since Cloudera is vendor supported and has a ton of patches not included in the Hadoop install, I think it may be worth looking at.
Once I choose the MatpReduce framework, I need to tokenize the documents (PDF, DOCx, DOC, OLE, etc...), index them, copy the index to my Solr instances, and somehow "activate" them so they are searchable in the running instance. I believe this methodolgy is better that submitting documents via the REST interface to Solr.
The reason I bring .NET into the picture is because we are mostly a .NET shop. The only Unix / Java we will have is Solr and have a front end that leverages the REST interface via Solrnet.
Based on your experience, how does
this architecture look? Do you see
any issues/problems? What advice can
you give?
What should I not do to lose faceting search? After reading the Nutch documentation, I believe it said that it does not do faceting, but I may not have enough background in this software to understand what it's saying.
Generally, you what you've described is almost exactly how Nutch works. Nutch is an crawling, indexing, index merging and query answering toolkit that's based on Hadoop core.
You shouldn't mix Cloudera, Hadoop, Nutch and Lucene. You'll most likely end up using all of them:
Nutch is the name of indexing / answering (like Solr) machinery.
Nutch itself runs using a Hadoop cluster (which heavily uses it's own distributed file system, HDFS)
Nutch uses Lucene format of indexes
Nutch includes a query answering frontend, which you can use, or you can attach a Solr frontend and use Lucene indexes from there.
Finally, Cloudera Hadoop Distribution (or CDH) is just a Hadoop distribution with several dozens of patches applied to it, to make it more stable and backport some useful features from development branches. Yeah, you'd most likely want to use it, unless you have a reason not to (for example, if you want a bleeding edge Hadoop 0.22 trunk).
Generally, if you're just looking into a ready-made crawling / search engine solution, then Nutch is a way to go. Nutch already includes a lot of plugins to parse and index various crazy types of documents, include MS Word documents, PDFs, etc, etc.
I personally don't see much point in using .NET technologies here, but if you feel comfortable with it, you can do front-ends in .NET. However, working with Unix technologies might feel fairly awkward for Windows-centric team, so if I'd managed such a project, I'd considered alternatives, especially if your task of crawling & indexing is limited (i.e. you don't want to crawl the whole internet for some purpose).
Have you looked at Lucandra https://github.com/tjake/Lucandra a Cassandra based back end for Lucense/Solr which you can use Hadoop to populate the Cassandra store with the index of your data.

How to best search against a DB with Lucene?

I am looking into mechanisms for better search capabilities against our database. It is currently a huge bottleneck (causing long-lasting queries that are hurting our database performance).
My boss wanted me to look into Solr, but on closer inspection, it seems we actually want some kind of DB integration mechanism with Lucene itself.
From the Lucene FAQ, they recommend Hibernate Search, Compass, and DBSight.
As a background of our current technology stack, we are using straight JSPs on Tomcat, no Hibernate, no other frameworks on top of it... just straight Java, JSP, and JDBC against a DB2 database.
Given that, it seems Hibernate Search might be a bit more difficult to integrate into our system, though it might be nice to have the option of using Hibernate after such an integration.
Does anyone have any experiences they can share with using one of these tools (or other similar Lucene based solutions) that might help in picking the right tool?
It needs to be a FOSS solution, and ideally will manage updating Lucene with changes from the database automagicly (though efficiently), without extra effort to notify the tool when changes have been made (otherwise, it seems rolling my own Lucene solution would be just as good). Also, we have multiple application servers with just 1 database (+failover), so it would be nice if it is easy to use the solution from all application servers seamlessly.
I am continuing to inspect the options now, but it would be really helpful to utilize other people's experiences.
When you say "search against a DB", what do you mean?
Relational databases and information retrieval systems use very different approaches for good reason. What kind of data are you searching? What kind of queries do you perform?
If I were going to implement an inverted index on top of a database, as Compass does, I would not use their approach, which is to implement Lucene's Directory abstraction with BLOBs. Rather, I'd implement Lucene's IndexReader abstraction.
Relational databases are quite capable of maintaining indexes. The value that Lucene brings in this context is its analysis capabilities, which are most useful for unstructured text records. A good approach would leverage the strengths of each tool.
As updates are made to the index, Lucene creates more segments (additional files or BLOBs), which degrade performance until a costly "optimize" procedure is used. Most databases will amortize this cost over each index update, giving you more stable performance.
I have had good experiences with Compass. It has really good integration with hibernate and can mirror data changes made through hibernate and jdbc directly to the Lucene indexes though its GPS devices http://www.compass-project.org/docs/1.2.2/reference/html/gps-jdbc.html.
Maintaining the Lucene indexes on all your application servers may be an issue. If you have multiple App servers updating the db, then you may hit some issues with keeping the index in sync with all the changes. Compass may have an alternate mechanism for handling this now.
The Alfresco Project (CMS) also uses Lucene and have a mechanism for replicating Lucene index changes between servers that may be useful in handling these issues.
We started using Compass before Hibernate Search was really off the ground so I cannot offer any comparison with it.
LuSql http://code.google.com/p/lusql/ allows you to load the contents of a JDBC-accessible database into Lucene, making it searchable. It is highly optimized and multi-threaded. I am the author of LuSql and will be coming out with a new version (re-architected with a new plugable architecture) in the next month.
For a pure performance boost with searching Lucene will certainly help out a lot. Only index what you care about/need and you should be good. You could use Hibernate or some other piece if you like but I don't think it is required.
Well, it seems DBSight doesn't meet the FOSS requirement, so unless it is an absolutely stellar solution, it is not an option for me right now...

Resources