Currently we use Apache Solr to index English language data. There are over 60 million documents that we index. In addition to English, we will are in process to index data in 20 additional languages. The main requirement here being search across all languages and not just one language. The search field name should remain the same.
We have come up with 2 main designs -
Option 1 : Index language data into it's own collection. for e.g. collection_1_en, collection_1_de . Then search across collections . Here we have control over the analyzers used.
Option 2 : Use a single collection, Declare an new field in schema.xml say name_en, name_de etc and then use copyfield to copy the value OR. programmatically determine the language(using language code) and add it to appropriate field
Which one would be the best approach w.r.t Performance? or is there a better approach to handle this scenario.
EDIT : Here data is not translated. i.e field_en data is not a translation of field_de e.g. names of people, company etc.
Related
Is there any feature to get same document in different languages?
Here is my use case : If I am in USA then I should get data in english language and if I am in China I should get data in chinese language.
I don't want to feed different documents for different languages.
So if you got N translations of the very same document and you want to index each translation the simplest approach is to index each translation in a separate vespa document. Each language requires different tokenization/language handling (see https://docs.vespa.ai/documentation/linguistics.html). You could do this per field but becomes complex to manage.
Your question does not really tell if you just want to store the data or search it but if you don't really index the data but only want to display the summary you could store the different translations in the same document e.g map where key is language and value is the actual contents.
I'm currently working on a project where we have indexed text content in SOLR. Every content is writen in one specific language (we have 4 differents
european languages) but we would like to add a feature that if the primary search (search text entered by the user) doesn't return much result then we try too look for document in other languages. Thus we would somehow need to translate the query.
Our base is that we can have a mapping list of translated words commonly used in the field of the project.
One solution that came to me was to use synonym search feature. But this might not provide the best results.
Does people have pointers on existing modules that could help us achieving this multilingual search feature? Or conception ideas we cold try to investigate?
Thanks
It seems like multi-lingual search is not a unique problem.
Please take a look
http://lucene.472066.n3.nabble.com/Multilingual-Search-td484201.html
and
Solr index and search multilingual data
those two links suggest to have dedicated fields for each language, but you can also have a field that states language, and you can add filter query (&fq=) for the language you have detected (from user query). This is more scalable solution, I think.
One option would be for you to translate your terms at index time, this could probably be done at Solr level or even before Solr at the application level, and then store the translated texts in different fields so you would have fields like:
text_en: "Hello",
text_fi: "Hei"
Then you can just query text_en:Hello and it would match.
And if you want to score primary language matches higher, you could have a primary_language field and then boost documents where it matches the search language higher.
I have a website with multiple languages and synonyms. Synonyms are defined in a txt file like "xxx, yyy, zzz".
Now in one language xxx and yyy mean the same thing but in another language they mean totally different things. So in the other language I get a mix of results.
How to tell solr that this "xxx, yyy, zzz" relationship exists only for products with language value of "1" and "xxx, www, qqq" relationship exists for products with value "2"?
This could of course be done, when I would but the products to different servers. But maybe there are alternative methods?
At the moment we use solr 3.5 but we want to change that in the future anyway, so if it can't be done on 3.5 can it be done it later versions?
You could have one field (or set of fields) per language (product_name_en, product_name_fr, product_name_es...), define one field type per language, and define one specific SynonymFilterFactory, with a different synonym file, for each field type. Then the query generation takes the language into account to choose which fields to query against.
I have documents in SOLR which consist of fields where the values come from different source systems. The reason why I am doing this is because this document is what I want returned from the SOLR search, including functionality like hit highlighting. As far as I know, if I use join with multiple SOLR documents, there is no way to get what matched in the related documents. My document has fields like:
id => unique entity id
type => entity type
name => entity name
field_1_s => dynamic field from system A
field_2_s => dynamic field from system B
...
Now, my problem comes when data is updated in one of the source systems. I need to update or remove only the fields that correspond to that source system and keep the other fields untouched. My thought is to encode the dynamic field name with the first part of the field name being a 8 character hash representing the source system.. this way they can have common field names outside of the unique source hash. And in this way, I can easily clear out all fields that start with the source prefix, if needed.
Does this sound like something I should be doing, or is there some other way that others have attempted?
In our experience the easiest and least error prone way of implementing something like this is to have a straight forward way to build the resulting document, and then reindex the complete document with data from both subsystems retrieved at time of reindexing. Tracking field names and field removal tend to get into a lot of business rules that live outside of where you'd normally work with them.
By focusing on making the task of indexing a specific document easy and performant, you'll make the system more flexible regarding other issues in the future as well (retrieving all documents with a certain value from Solr, then triggering a reindex for those documents from a utility script, etc.).
That way you'll also have the same indexing flow for your application and primary indexing code, so that you don't have to maintain several sets of indexing code to do different stuff.
If the systems you're querying isn't able to perform when retrieving the number of documents you need, you can add a local cache (in SQL, memcached or something similar) to speed up the process, but that code can be specific to the indexing process. Usually the subsystems will be performant enough (at least if doing batch retrieval depending on the documents that are being updated).
I gooeled and search for the title, there was a lot of results returned on how to create QUERY for hierarchy/nested fields but no clear answer as to how it would be defined in schema.xml.
Let me be very specific, say I have json records of following format (very simplified version) :
Office string
city string
zipcode string
Home
city string
zipcode string
City string
If I just want to index/store home.city then how would I define that in the "field" in schema.xml?
The schema has to be the union of all the fields as one collection has only one real definition which includes everything.
So: city, zipcode, and probably type to differentiate. Plus whatever Solr requires for parent/child relationship management (id, _root_, _version_).
If the fields are different, then you need to make sure that the fields that only happen in one type and not another are optional.
That's assuming you are indexing child-records as separate documents. If you want to merge them all in one parent document, then you need to do some folding of the content on the client. ElasticSearch gives you a slightly better interface for that, though - under the covers - the issues of a single real definition are still the same (they come from Lucene, which both use).
Solr does not support nested field. If you are looking for
a search engine with the above feature you can try out elastic search. Elastic search also have lucence at its core and it offers lot more than what solr has to offer as far as scalaibility, full text search features, auto sharding, easy import export of data is concerned.