Document search in Lucene/Solr, Whoosh, Sphinx, Xapian - solr

I am comparing Lucene/Solr, Whoosh, Sphinx and Xapian for searching documents in DOC, DOCX, HTML and PDF. Only Solr is documented to have a document parser (Tika) which directly indexes documents. So it seems a clear winner.
But to level the playing field, I like to consider the alternatives. Do the others have direct document indexing (which I may have missed)? If not are they can it be implemented easily? Or is Solr the overwhelming choice?

On Sphinx you're able to convert file using a PHP script through the xmlpipe_command option. Since PHP has a Tika-wrapper, writing the script and the setup itself aren't hard.

Related

Does SOLR support percolation

ElasticSearch has percolator for prospective search. Does SOLR have a similar feature where you define your query upfront? If not, is there an effective way of implementing this myself on top of the existing SOLR features?
besides what BunkerMentality said, it is not hard to build your own percolator, what you need:
Are the queries you want to run easy to model on Lucene only syntax? if so you are good, if not, you need to convert them to Lucene only. Built them, and keep them in memory as Lucene queries
When a doc arrives:
build a MemoryIndex containing only that single doc
run all your queries on the index
I have done this for a system ingesting millions docs a day and it worked fine.
It's listed as an open new feature, SOLR-4587, on Solr JIRA but it doesn't seem like any work has started on it yet.
There is a link in the comments there to a separate project called Luwak that seems to implement some features similar to percolator.
If it is still relevant, you can use this
It's SOLR Update Processor that based on Luwak

Any good guides for writing custom Riak SOLR search analyzers?

In short, I need to search against my Riak buckets via SOLR. The only problem is, is that by default SOLR searches are case-sensitive. After some digging, I see that I need to write a custom SOLR text analyzer schema. Anyone have any good references for writing search analyzer schemas?
And finally, when installing a new schema for an index, is re-indexing all objects in a bucket necessary to show prior results in a search (using new schema)?
RTFM fail.... I swear though, getting to this page was not easy
http://docs.basho.com/riak/latest/dev/advanced/search-schema/#Defining-a-Schema

How to show the contents of files when searching in alfresco

When I am searching for a particular content, it is showing the file which has the content, how can I show the line in which the particular content is there?
I know alfresco uses lucene, can I use lucene highlighter. If yes how to use lucene highlighter in alfresco?
What about solr can I use that?
4.2.e without modifications means that you're using SOLR.
Afaik there is no addon that adds hit-highlighting to Alfresco's Solr search subsystem.
It's on the roadmap.
There are quite some posts regarding hit-lighting in Alfresco based on lucene.
Alfresco 5.2 seems to have this feature. Searched for string is highlighted with context in the search results.

How to extract solr index docs

I need to do some transformations on docs before indexing them in solr. but the texts come from various resources and it's diffcult to do the transformations before indexing because i will have to adapt several programs to parse the files. I'm thinking of indexing them in solr, extract the text fields, do transformations and reindex again.
I tried :
curl 'http://localhost:8983/solr/collection1/select?q=*&rows=20000&wt=xml&indent=true'
but the output is a results xml file while i'm looking for some way to extract the docs with fields like in the posting format. is this possible? how should i do?
Thanks
I would recommend using one of the Solr Clients listed on the Integrating Solr page. This will allow you to use your programming language of choice to extract and transform the Solr documents and then reload them into the index.

How to go about indexing 300,000 text files for search?

I have a static collection of over 300,000 text and html files. I want to be able to search them for words, exact phrases, and ideally regex patterns. I want the searches to be fast.
I think searching for words and phrases can be done by looking up a dictionary of unique words referencing to the files that contain each word, but is there a way to have reasonably fast regex matching?
I don't mind using existing software if such exists.
Consider Lucene http://lucene.apache.org/java/docs/index.html
There are quite a bunch available in the market which will help you achieve what you want, some are open-source and some comes with pricing:
Opensource:
elasticsearch - based on lucene
constellio - based on lucene
Sphinx - based on C++
Solr - built on top of lucene
You can have a look at Microsoft Search Server Express 2010: http://www.microsoft.com/enterprisesearch/searchserverexpress/en/us/technical-resources.aspx
http://blog.webdistortion.com/2011/05/29/open-source-search-engines/

Resources