full text search engine in hebrew - solr

I want to try and use Elasticsearch as a full text search engine for a website in Hebrew.
I wanted to know if this Elasticsearch can produce good results for Hebrew and if there are any big websites in Israel that use it as their search engine.
If not ElasticSearch - maybe Apache Solr?
By the way - I'm using Ruby, but can work with Java as well.
Thanks!

Have a look at the ICU plugin for Elasticsearch.
David.

Solr seems to support Hebrew, see links to Language Analysers below:
Solr language analysis in Hebrew
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUTokenizerFactory
Although I am not certain what the options for ElasticSearch are.

Look at hebmorph - http://www.code972.com/blog/hebmorph/
It's a lucene plugin and we've been working with it in http://alpha.gov.il and http://www.guidestar.org.il/

Take a look at Algolia
By design the Algolia engine is language agnostic. Out of the box, it supports all languages / alphabets, including symbol based languages such as Chinese, Japanese and Korean.
Additionally, Algolia handles multi-languages on the same website/app, meaning some users could search in French, and some in English, using the same Algolia account on the background.
The purpose of this guide is to explain how to organize your indices to enable multi-language search.
Taken from here

Related

Choose Lucene or Solr

We need to integrate a search engine in our plataform Catalog management software in Share point. The information is stored in multiple databases and a storage of files ( doc , ppt , pdf .....). Our dev platform is Asp.Net and we have done some pre-liminary work on Lucene, found it to be good. However, we just came to know of Solr.
We need to continue using lucene, but we need to defend her the solr.
Please any help is accepted.
And sorry for my english.
Lucene is a full-text search library used to provide search functionalities to an application. It can't be used as an application by itself. Solr is a complete search engine built around Lucene providing its search functionalities and others. Solr is a web application that can be used by itself without any development around it.
If you need a search engine to be called by your application I recommend you to use Solr.

How to get synonyms.txt to be used in solr for different languages

Do you know if I can get synonyms.txt files for all languages supported by SOLR ?
Thanks for your help.
Before we were using Verity that provide a dictionary of synonyms for each language supported but we want maybe to move to Solr/Lucene.
I know that we can provide a custom synonym list that is no what I want. I am looking for a way to have a default dictionary of synonyms for each language supported by Lucene.
There is no 'out of the box' synonym resource provided for all the languages.
At least for some, you have wordnet (which is free) - see the solr wiki for wordnet usage.
A list of synonyms for any language is going to be very specific to the use cases related to a given set of indexed items. For that reason it would not be practical to have any prebuilt language specific versions of these files. Even the synonyms.txt that comes with Solr distribution is only built out enough to show examples of how the synonyms can be constructed.

Document search in Lucene/Solr, Whoosh, Sphinx, Xapian

I am comparing Lucene/Solr, Whoosh, Sphinx and Xapian for searching documents in DOC, DOCX, HTML and PDF. Only Solr is documented to have a document parser (Tika) which directly indexes documents. So it seems a clear winner.
But to level the playing field, I like to consider the alternatives. Do the others have direct document indexing (which I may have missed)? If not are they can it be implemented easily? Or is Solr the overwhelming choice?
On Sphinx you're able to convert file using a PHP script through the xmlpipe_command option. Since PHP has a Tika-wrapper, writing the script and the setup itself aren't hard.

How to go about indexing 300,000 text files for search?

I have a static collection of over 300,000 text and html files. I want to be able to search them for words, exact phrases, and ideally regex patterns. I want the searches to be fast.
I think searching for words and phrases can be done by looking up a dictionary of unique words referencing to the files that contain each word, but is there a way to have reasonably fast regex matching?
I don't mind using existing software if such exists.
Consider Lucene http://lucene.apache.org/java/docs/index.html
There are quite a bunch available in the market which will help you achieve what you want, some are open-source and some comes with pricing:
Opensource:
elasticsearch - based on lucene
constellio - based on lucene
Sphinx - based on C++
Solr - built on top of lucene
You can have a look at Microsoft Search Server Express 2010: http://www.microsoft.com/enterprisesearch/searchserverexpress/en/us/technical-resources.aspx
http://blog.webdistortion.com/2011/05/29/open-source-search-engines/

Crawler/parser for Xapian

I would like to implement a search engine which should crawl a set of web sites, extract specific information from the pages and create full-text index of that specific information.
It seems to me that Xapian could be a good choice for the search engine library.
What are the options for a crawler/parser to integrate with Xapian?
Would Solr be a better choice than Xapian to integrate with open source crawlers/parsers?
Here's a little comparison between Xapian and Solr.
But if you want to build a crawler, take a look at Nutch. It's extensible with plugins, so you could write a plugin that analyzes the information that you're looking for.
Flax may provide some of what you're looking for.

Resources