Does Solr spell checker gives suggestion for other languages?
The solr spellchecker is based solely on what you have indexed, not based on some dictionary of "correct" words.
So yes, it supports whatever language you index your stuff in.
Solr's best practice of handling multiple languages per index is to have a separate set of fields per language. So you'd have fields named text_en, title_en, etc. for English and text_de, title_de, tec. for German. A different instance of spellchecker component must be used for a field. (Usually, *_en fields will be combined to one field, say textSpell_en, using copyField operator.) Now the question is, can Solr allows multiple instances of spellchek components? I think it does but I don't know for sure. Has anyone done this?
Related
I am developing an application that supports indexing & searching of multi-language texts, including hebrew, using the "solr" engine.
After lots of searches, I found that HebMorph is the best plugin to use for hebrew language
My problem is that the behavior of HebMorph with hebrew stopwords seems to be different than solr:
Whith solr (any language): when I search for a stopword, the results returned doesn't include any of the stopwords exxisting in query.
Whereas when I search for hebrew terms (after pluging HebMorh in solr following this link, the returned results include all existing stopwords in the query.
1) Is this the normal behavior for HebMorph? If yes, how can I alter it? If no, what should I change?
2) Since HebMorph doesn't support synonyms, (as I read in their documentation that it is a future work). Is there a way to use synonyms for hebrew as other languages the way solr supports it? (i.e. by adding the proper filter in solrconfig and pointing out to the synonyms file)?
Thanks in advance for your help.
I'm the author of HebMorph.
StopWords are indeed supported, but you need to filter them out before the lemmatizer kicks in. Assuming a recent version of HebMorph - your stopwords filter needs to come in right after the tokenizer, which means it needs to take care also of בחל"מ letters attached to the stop-words.
The general advice nowadays, for all languages, is NOT to drop stopwords - at least not in indexing, so I'd recommend not applying a stop-words filter here either.
With regards to synonyms - the root issue is with the HebMorph lemmatizer expanding a word to multiple lemmas at times, which makes the work of applying synonyms a bit more challenging. With the (relatively) new graph based analyzers this is now possible to do so we will likely implement that too and Lucene's Synonym filters will be supported OOTB.
In the commercial version there is already a way to customize word lists and override dictionary definitions, which is useful in an ambiguous language like Hebrew. Many use this as their way of creating synonyms.
I'm currently working on a project where we have indexed text content in SOLR. Every content is writen in one specific language (we have 4 differents
european languages) but we would like to add a feature that if the primary search (search text entered by the user) doesn't return much result then we try too look for document in other languages. Thus we would somehow need to translate the query.
Our base is that we can have a mapping list of translated words commonly used in the field of the project.
One solution that came to me was to use synonym search feature. But this might not provide the best results.
Does people have pointers on existing modules that could help us achieving this multilingual search feature? Or conception ideas we cold try to investigate?
Thanks
It seems like multi-lingual search is not a unique problem.
Please take a look
http://lucene.472066.n3.nabble.com/Multilingual-Search-td484201.html
and
Solr index and search multilingual data
those two links suggest to have dedicated fields for each language, but you can also have a field that states language, and you can add filter query (&fq=) for the language you have detected (from user query). This is more scalable solution, I think.
One option would be for you to translate your terms at index time, this could probably be done at Solr level or even before Solr at the application level, and then store the translated texts in different fields so you would have fields like:
text_en: "Hello",
text_fi: "Hei"
Then you can just query text_en:Hello and it would match.
And if you want to score primary language matches higher, you could have a primary_language field and then boost documents where it matches the search language higher.
I have a set of keywords defined by client requirements stored in a SOLR field. I also have a never ending stream of sentences entering the system.
By using the sentence as the query against the keywords I am able to find those sentences that match the keywords. This is working well and I am pleased. What I have essentially done is reverse the way in which SOLR is normally used by storing the query in Solr and passing the text in as the query.
Now I would like to be able to extend the idea of having just a keyword in a field to having a more fully formed SOLR query in a field. Doing so would allow proximity searching etc. But, of course, this is where life becomes awkward. Placing SOLR query operators into a field will not work as they need to be escaped.
Does anyone know if it might be possible to use the SOLR "query" function or perhaps write a java class that would enable such functionality? Or is the idea blowing just a bit too much against the SOLR winds?
Thanks in advance.
ES has percolate for this - for Solr you'll usually index the document as a single document in a memory based core / index and then run the queries against that (which is what ES at least used to do internally, IIRC).
I would check out the percolate api with ElasticSearch. It would sure be easier using this api than having to write your own in Solr.
I have a website with multiple languages and synonyms. Synonyms are defined in a txt file like "xxx, yyy, zzz".
Now in one language xxx and yyy mean the same thing but in another language they mean totally different things. So in the other language I get a mix of results.
How to tell solr that this "xxx, yyy, zzz" relationship exists only for products with language value of "1" and "xxx, www, qqq" relationship exists for products with value "2"?
This could of course be done, when I would but the products to different servers. But maybe there are alternative methods?
At the moment we use solr 3.5 but we want to change that in the future anyway, so if it can't be done on 3.5 can it be done it later versions?
You could have one field (or set of fields) per language (product_name_en, product_name_fr, product_name_es...), define one field type per language, and define one specific SynonymFilterFactory, with a different synonym file, for each field type. Then the query generation takes the language into account to choose which fields to query against.
ElasticSearch has Mapping Types to, according to the docs:
Mapping types are a way to divide the documents in an index into
logical groups. Think of it as tables in a database.
Is there an equivalent in Solr for this?
I have seen that some people include a new field in the documents and later on they use this new field to limit the search to a certain type of documents, but as I understand it, they have to share the schema and (I believe) ElasticSearch Mapping Type doesn't. So, is there an equivalent?
Or, maybe a better question,
If I have a multiple document types and I want to limit searches to a certain document type, which one should offer a better solution?
I hope this question has any sense since I'm new to both of them.
Thanks!
You can configure multicore solr:
http://wiki.apache.org/solr/CoreAdmin
Maybe something has changed since solr 4.0 and it's easier now, i didn't look at it since i have switched to elasticsearch. Personally i find elasticsearch indexes/types system much better than that.
In Solr 4+.
If you are planning to do faceting or any other calculations across multiple types than create a single schema with a differentiator field. Then, on your business/mapping/client layer just define only the fields you actually want to look at. Use custom search handlers with 'fl' field to only return the fields relevant to that object. Of course, that means that all those single-type-only fields cannot be compulsory.
If your document types are completely disjoint, you can create a core/collection per type, each with its own definition file. You have full separation, but still have only one Solr server to maintain.
I have seen that some people include a new field in the documents and later on they use this new field to limit the search to a certain type of documents, but as I understand it, they have to share the schema and (I believe) ElasticSearch Mapping Type doesn't.
You can exactly do this in Solr. Add a field and use it to filter.
It is correct that Mapping Types in ElasticSearch do not have to share the same schema but under the hood ElasticSearch uses only ONE schema for all Mapping Types. So technical it makes to difference. In fact the MappingType is mapped to an internal schema field.