How to go about indexing 300,000 text files for search?

How to go about indexing 300,000 text files for search? - database

I have a static collection of over 300,000 text and html files. I want to be able to search them for words, exact phrases, and ideally regex patterns. I want the searches to be fast.
I think searching for words and phrases can be done by looking up a dictionary of unique words referencing to the files that contain each word, but is there a way to have reasonably fast regex matching?
I don't mind using existing software if such exists.

Consider Lucene http://lucene.apache.org/java/docs/index.html

There are quite a bunch available in the market which will help you achieve what you want, some are open-source and some comes with pricing:
Opensource:
elasticsearch - based on lucene
constellio - based on lucene
Sphinx - based on C++
Solr - built on top of lucene

You can have a look at Microsoft Search Server Express 2010: http://www.microsoft.com/enterprisesearch/searchserverexpress/en/us/technical-resources.aspx

http://blog.webdistortion.com/2011/05/29/open-source-search-engines/

Related

Does SOLR support percolation

ElasticSearch has percolator for prospective search. Does SOLR have a similar feature where you define your query upfront? If not, is there an effective way of implementing this myself on top of the existing SOLR features?

besides what BunkerMentality said, it is not hard to build your own percolator, what you need:
Are the queries you want to run easy to model on Lucene only syntax? if so you are good, if not, you need to convert them to Lucene only. Built them, and keep them in memory as Lucene queries
When a doc arrives:
build a MemoryIndex containing only that single doc
run all your queries on the index
I have done this for a system ingesting millions docs a day and it worked fine.

It's listed as an open new feature, SOLR-4587, on Solr JIRA but it doesn't seem like any work has started on it yet.
There is a link in the comments there to a separate project called Luwak that seems to implement some features similar to percolator.

If it is still relevant, you can use this
It's SOLR Update Processor that based on Luwak

Any good guides for writing custom Riak SOLR search analyzers?

In short, I need to search against my Riak buckets via SOLR. The only problem is, is that by default SOLR searches are case-sensitive. After some digging, I see that I need to write a custom SOLR text analyzer schema. Anyone have any good references for writing search analyzer schemas?
And finally, when installing a new schema for an index, is re-indexing all objects in a bucket necessary to show prior results in a search (using new schema)?

RTFM fail.... I swear though, getting to this page was not easy
http://docs.basho.com/riak/latest/dev/advanced/search-schema/#Defining-a-Schema

How to get synonyms.txt to be used in solr for different languages

Do you know if I can get synonyms.txt files for all languages supported by SOLR ?
Thanks for your help.
Before we were using Verity that provide a dictionary of synonyms for each language supported but we want maybe to move to Solr/Lucene.
I know that we can provide a custom synonym list that is no what I want. I am looking for a way to have a default dictionary of synonyms for each language supported by Lucene.

There is no 'out of the box' synonym resource provided for all the languages.
At least for some, you have wordnet (which is free) - see the solr wiki for wordnet usage.

A list of synonyms for any language is going to be very specific to the use cases related to a given set of indexed items. For that reason it would not be practical to have any prebuilt language specific versions of these files. Even the synonyms.txt that comes with Solr distribution is only built out enough to show examples of how the synonyms can be constructed.

Document search in Lucene/Solr, Whoosh, Sphinx, Xapian

I am comparing Lucene/Solr, Whoosh, Sphinx and Xapian for searching documents in DOC, DOCX, HTML and PDF. Only Solr is documented to have a document parser (Tika) which directly indexes documents. So it seems a clear winner.
But to level the playing field, I like to consider the alternatives. Do the others have direct document indexing (which I may have missed)? If not are they can it be implemented easily? Or is Solr the overwhelming choice?

On Sphinx you're able to convert file using a PHP script through the xmlpipe_command option. Since PHP has a Tika-wrapper, writing the script and the setup itself aren't hard.

Sunspot / Solr / Lucene : Find similar article

Let's say we have a list of articles that are indexed by sunspot/solr/lucene (or any other search engine).
How can be used to find similar articles with a given article?
Should this be done with a resuming tool, like:
http://www.wordsfinder.com/api_Keyword_Extractor.php, or termextract from http://developer.yahoo.com/yql/console, or http://www.alchemyapi.com/api/demo.html ?

It seems you're looking for the MoreLikeThis feature.

What you are trying to do is very similar to the task I outlined in this answer.
In brief, you need to generate a summary for each document that you can use as the query to compare it with every other. A document summary could be as simple as the top N terms in that document (excluding stop words). You can generate top N terms from a Lucene document pretty easily without using any 3rd party tools, there are plenty examples on SO and the web to do this.