HebMorph with solr: how to use stopwords - solr

I am developing an application that supports indexing & searching of multi-language texts, including hebrew, using the "solr" engine.
After lots of searches, I found that HebMorph is the best plugin to use for hebrew language
My problem is that the behavior of HebMorph with hebrew stopwords seems to be different than solr:
Whith solr (any language): when I search for a stopword, the results returned doesn't include any of the stopwords exxisting in query.
Whereas when I search for hebrew terms (after pluging HebMorh in solr following this link, the returned results include all existing stopwords in the query.
1) Is this the normal behavior for HebMorph? If yes, how can I alter it? If no, what should I change?
2) Since HebMorph doesn't support synonyms, (as I read in their documentation that it is a future work). Is there a way to use synonyms for hebrew as other languages the way solr supports it? (i.e. by adding the proper filter in solrconfig and pointing out to the synonyms file)?
Thanks in advance for your help.

I'm the author of HebMorph.
StopWords are indeed supported, but you need to filter them out before the lemmatizer kicks in. Assuming a recent version of HebMorph - your stopwords filter needs to come in right after the tokenizer, which means it needs to take care also of בחל"מ letters attached to the stop-words.
The general advice nowadays, for all languages, is NOT to drop stopwords - at least not in indexing, so I'd recommend not applying a stop-words filter here either.
With regards to synonyms - the root issue is with the HebMorph lemmatizer expanding a word to multiple lemmas at times, which makes the work of applying synonyms a bit more challenging. With the (relatively) new graph based analyzers this is now possible to do so we will likely implement that too and Lucene's Synonym filters will be supported OOTB.
In the commercial version there is already a way to customize word lists and override dictionary definitions, which is useful in an ambiguous language like Hebrew. Many use this as their way of creating synonyms.

Related

Singular/plural keyword search not working

I am facing a problem with singular and plural keyword search.
For example, if I search men, it should return "men" and also "man". However, it is not working.
The easiest way is to use a SynonymFilter with those terms that you're aware of - the hard part is thinking of every alternative.
While you usually use stemming to get the common stem for words, this problem is known as lemmatization - where you're interested in the different forms of a word, and not the common stem.
For Solr your best bet is probably to be to go for something like Solr Lemmatizer by Nicholas Ding.

Solr multilingual search

I'm currently working on a project where we have indexed text content in SOLR. Every content is writen in one specific language (we have 4 differents
european languages) but we would like to add a feature that if the primary search (search text entered by the user) doesn't return much result then we try too look for document in other languages. Thus we would somehow need to translate the query.
Our base is that we can have a mapping list of translated words commonly used in the field of the project.
One solution that came to me was to use synonym search feature. But this might not provide the best results.
Does people have pointers on existing modules that could help us achieving this multilingual search feature? Or conception ideas we cold try to investigate?
Thanks
It seems like multi-lingual search is not a unique problem.
Please take a look
http://lucene.472066.n3.nabble.com/Multilingual-Search-td484201.html
and
Solr index and search multilingual data
those two links suggest to have dedicated fields for each language, but you can also have a field that states language, and you can add filter query (&fq=) for the language you have detected (from user query). This is more scalable solution, I think.
One option would be for you to translate your terms at index time, this could probably be done at Solr level or even before Solr at the application level, and then store the translated texts in different fields so you would have fields like:
text_en: "Hello",
text_fi: "Hei"
Then you can just query text_en:Hello and it would match.
And if you want to score primary language matches higher, you could have a primary_language field and then boost documents where it matches the search language higher.

Solr queries stored within Solr field

I have a set of keywords defined by client requirements stored in a SOLR field. I also have a never ending stream of sentences entering the system.
By using the sentence as the query against the keywords I am able to find those sentences that match the keywords. This is working well and I am pleased. What I have essentially done is reverse the way in which SOLR is normally used by storing the query in Solr and passing the text in as the query.
Now I would like to be able to extend the idea of having just a keyword in a field to having a more fully formed SOLR query in a field. Doing so would allow proximity searching etc. But, of course, this is where life becomes awkward. Placing SOLR query operators into a field will not work as they need to be escaped.
Does anyone know if it might be possible to use the SOLR "query" function or perhaps write a java class that would enable such functionality? Or is the idea blowing just a bit too much against the SOLR winds?
Thanks in advance.
ES has percolate for this - for Solr you'll usually index the document as a single document in a memory based core / index and then run the queries against that (which is what ES at least used to do internally, IIRC).
I would check out the percolate api with ElasticSearch. It would sure be easier using this api than having to write your own in Solr.

lucene Fields vs. DocValues

I'm using and playing with Lucene to index our data and I've come across some strange behaviors concerning DocValues Fields.
So, Could anyone please just explain the difference between a regular Document field (like StringField, TextField, IntField etc.) and DocValues fields
(like IntDocValuesField, SortedDocValuesField (the types seem to have change in Lucene 5.0) etc.) ?
First, why can't I access DocValues using document.get(fieldname)? if so, how can I access them?
Second, I've seen that in Lucene 5.0 some features are changed, for example sorting can only be done on DocValues... why is that?
Third, DocValues can be updated but regular fields cannot (you have to delete and add the whole document)...
Also, and perhaps most important, when should I use DocValues and when regular fields?
Joseph
Most of these questions are quickly answered by either referring to the Solr Wiki or to a web search, but to get the gist of DocValues: they're useful for all the other stuff associated with a modern Search service except for the actual searching. From the Solr Community Wiki:
DocValues are a way of recording field values internally that is more efficient for some purposes, such as sorting and faceting, than traditional indexing.
...
DocValue fields are now column-oriented fields with a document-to-value mapping built at index time. This approach promises to relieve some of the memory requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster.
This should also answer why Lucene 5 requires DocValues for sorting - it's a lot more efficient than the previous approach.
The reason for this is that the storage format is turned around from the standard format when gathering data for these operations, where the application previously have to go through each document to find the values, it can now look up the values and find the corresponding documents instead. Which is very useful when you already have a list of documents that you need to perform an intersection on.
If I remember correctly, updating a DocValue-based field involves yanking the document out from the previous token list, and then re-inserting it into the new location, compared to the previous approach where it would change loads of dependencies (and reindexing was the only viable strategy).
Use DocValues for fields that need any of the properties mentioned above, such as sorting / faceting / etc.

Solr Spell Checker Language Support

Does Solr spell checker gives suggestion for other languages?
The solr spellchecker is based solely on what you have indexed, not based on some dictionary of "correct" words.
So yes, it supports whatever language you index your stuff in.
Solr's best practice of handling multiple languages per index is to have a separate set of fields per language. So you'd have fields named text_en, title_en, etc. for English and text_de, title_de, tec. for German. A different instance of spellchecker component must be used for a field. (Usually, *_en fields will be combined to one field, say textSpell_en, using copyField operator.) Now the question is, can Solr allows multiple instances of spellchek components? I think it does but I don't know for sure. Has anyone done this?

Resources