I have a website with multiple languages and synonyms. Synonyms are defined in a txt file like "xxx, yyy, zzz".
Now in one language xxx and yyy mean the same thing but in another language they mean totally different things. So in the other language I get a mix of results.
How to tell solr that this "xxx, yyy, zzz" relationship exists only for products with language value of "1" and "xxx, www, qqq" relationship exists for products with value "2"?
This could of course be done, when I would but the products to different servers. But maybe there are alternative methods?
At the moment we use solr 3.5 but we want to change that in the future anyway, so if it can't be done on 3.5 can it be done it later versions?
You could have one field (or set of fields) per language (product_name_en, product_name_fr, product_name_es...), define one field type per language, and define one specific SynonymFilterFactory, with a different synonym file, for each field type. Then the query generation takes the language into account to choose which fields to query against.
Related
I'm currently working on a project where we have indexed text content in SOLR. Every content is writen in one specific language (we have 4 differents
european languages) but we would like to add a feature that if the primary search (search text entered by the user) doesn't return much result then we try too look for document in other languages. Thus we would somehow need to translate the query.
Our base is that we can have a mapping list of translated words commonly used in the field of the project.
One solution that came to me was to use synonym search feature. But this might not provide the best results.
Does people have pointers on existing modules that could help us achieving this multilingual search feature? Or conception ideas we cold try to investigate?
Thanks
It seems like multi-lingual search is not a unique problem.
Please take a look
http://lucene.472066.n3.nabble.com/Multilingual-Search-td484201.html
and
Solr index and search multilingual data
those two links suggest to have dedicated fields for each language, but you can also have a field that states language, and you can add filter query (&fq=) for the language you have detected (from user query). This is more scalable solution, I think.
One option would be for you to translate your terms at index time, this could probably be done at Solr level or even before Solr at the application level, and then store the translated texts in different fields so you would have fields like:
text_en: "Hello",
text_fi: "Hei"
Then you can just query text_en:Hello and it would match.
And if you want to score primary language matches higher, you could have a primary_language field and then boost documents where it matches the search language higher.
Currently we use Apache Solr to index English language data. There are over 60 million documents that we index. In addition to English, we will are in process to index data in 20 additional languages. The main requirement here being search across all languages and not just one language. The search field name should remain the same.
We have come up with 2 main designs -
Option 1 : Index language data into it's own collection. for e.g. collection_1_en, collection_1_de . Then search across collections . Here we have control over the analyzers used.
Option 2 : Use a single collection, Declare an new field in schema.xml say name_en, name_de etc and then use copyfield to copy the value OR. programmatically determine the language(using language code) and add it to appropriate field
Which one would be the best approach w.r.t Performance? or is there a better approach to handle this scenario.
EDIT : Here data is not translated. i.e field_en data is not a translation of field_de e.g. names of people, company etc.
I gooeled and search for the title, there was a lot of results returned on how to create QUERY for hierarchy/nested fields but no clear answer as to how it would be defined in schema.xml.
Let me be very specific, say I have json records of following format (very simplified version) :
Office string
city string
zipcode string
Home
city string
zipcode string
City string
If I just want to index/store home.city then how would I define that in the "field" in schema.xml?
The schema has to be the union of all the fields as one collection has only one real definition which includes everything.
So: city, zipcode, and probably type to differentiate. Plus whatever Solr requires for parent/child relationship management (id, _root_, _version_).
If the fields are different, then you need to make sure that the fields that only happen in one type and not another are optional.
That's assuming you are indexing child-records as separate documents. If you want to merge them all in one parent document, then you need to do some folding of the content on the client. ElasticSearch gives you a slightly better interface for that, though - under the covers - the issues of a single real definition are still the same (they come from Lucene, which both use).
Solr does not support nested field. If you are looking for
a search engine with the above feature you can try out elastic search. Elastic search also have lucence at its core and it offers lot more than what solr has to offer as far as scalaibility, full text search features, auto sharding, easy import export of data is concerned.
ElasticSearch has Mapping Types to, according to the docs:
Mapping types are a way to divide the documents in an index into
logical groups. Think of it as tables in a database.
Is there an equivalent in Solr for this?
I have seen that some people include a new field in the documents and later on they use this new field to limit the search to a certain type of documents, but as I understand it, they have to share the schema and (I believe) ElasticSearch Mapping Type doesn't. So, is there an equivalent?
Or, maybe a better question,
If I have a multiple document types and I want to limit searches to a certain document type, which one should offer a better solution?
I hope this question has any sense since I'm new to both of them.
Thanks!
You can configure multicore solr:
http://wiki.apache.org/solr/CoreAdmin
Maybe something has changed since solr 4.0 and it's easier now, i didn't look at it since i have switched to elasticsearch. Personally i find elasticsearch indexes/types system much better than that.
In Solr 4+.
If you are planning to do faceting or any other calculations across multiple types than create a single schema with a differentiator field. Then, on your business/mapping/client layer just define only the fields you actually want to look at. Use custom search handlers with 'fl' field to only return the fields relevant to that object. Of course, that means that all those single-type-only fields cannot be compulsory.
If your document types are completely disjoint, you can create a core/collection per type, each with its own definition file. You have full separation, but still have only one Solr server to maintain.
I have seen that some people include a new field in the documents and later on they use this new field to limit the search to a certain type of documents, but as I understand it, they have to share the schema and (I believe) ElasticSearch Mapping Type doesn't.
You can exactly do this in Solr. Add a field and use it to filter.
It is correct that Mapping Types in ElasticSearch do not have to share the same schema but under the hood ElasticSearch uses only ONE schema for all Mapping Types. So technical it makes to difference. In fact the MappingType is mapped to an internal schema field.
Does Solr spell checker gives suggestion for other languages?
The solr spellchecker is based solely on what you have indexed, not based on some dictionary of "correct" words.
So yes, it supports whatever language you index your stuff in.
Solr's best practice of handling multiple languages per index is to have a separate set of fields per language. So you'd have fields named text_en, title_en, etc. for English and text_de, title_de, tec. for German. A different instance of spellchecker component must be used for a field. (Usually, *_en fields will be combined to one field, say textSpell_en, using copyField operator.) Now the question is, can Solr allows multiple instances of spellchek components? I think it does but I don't know for sure. Has anyone done this?