Support of different languages in vespa

Support of different languages in vespa - vespa

Is there any feature to get same document in different languages?
Here is my use case : If I am in USA then I should get data in english language and if I am in China I should get data in chinese language.
I don't want to feed different documents for different languages.

So if you got N translations of the very same document and you want to index each translation the simplest approach is to index each translation in a separate vespa document. Each language requires different tokenization/language handling (see https://docs.vespa.ai/documentation/linguistics.html). You could do this per field but becomes complex to manage.
Your question does not really tell if you just want to store the data or search it but if you don't really index the data but only want to display the summary you could store the different translations in the same document e.g map where key is language and value is the actual contents.

Related

Solr multilingual search

I'm currently working on a project where we have indexed text content in SOLR. Every content is writen in one specific language (we have 4 differents
european languages) but we would like to add a feature that if the primary search (search text entered by the user) doesn't return much result then we try too look for document in other languages. Thus we would somehow need to translate the query.
Our base is that we can have a mapping list of translated words commonly used in the field of the project.
One solution that came to me was to use synonym search feature. But this might not provide the best results.
Does people have pointers on existing modules that could help us achieving this multilingual search feature? Or conception ideas we cold try to investigate?
Thanks

It seems like multi-lingual search is not a unique problem.
Please take a look
http://lucene.472066.n3.nabble.com/Multilingual-Search-td484201.html
and
Solr index and search multilingual data
those two links suggest to have dedicated fields for each language, but you can also have a field that states language, and you can add filter query (&fq=) for the language you have detected (from user query). This is more scalable solution, I think.

One option would be for you to translate your terms at index time, this could probably be done at Solr level or even before Solr at the application level, and then store the translated texts in different fields so you would have fields like:
text_en: "Hello",
text_fi: "Hei"
Then you can just query text_en:Hello and it would match.
And if you want to score primary language matches higher, you could have a primary_language field and then boost documents where it matches the search language higher.

solr internationalization for performance

Currently we use Apache Solr to index English language data. There are over 60 million documents that we index. In addition to English, we will are in process to index data in 20 additional languages. The main requirement here being search across all languages and not just one language. The search field name should remain the same.
We have come up with 2 main designs -
Option 1 : Index language data into it's own collection. for e.g. collection_1_en, collection_1_de . Then search across collections . Here we have control over the analyzers used.
Option 2 : Use a single collection, Declare an new field in schema.xml say name_en, name_de etc and then use copyfield to copy the value OR. programmatically determine the language(using language code) and add it to appropriate field
Which one would be the best approach w.r.t Performance? or is there a better approach to handle this scenario.
EDIT : Here data is not translated. i.e field_en data is not a translation of field_de e.g. names of people, company etc.

Need advice on multilingual data storage

This is more of a question for experienced people who've worked a lot with multilingual websites and e-shops. This is NOT a database structure question or anything like that. This is a question on how to store a multilingual website: NOT how to store translations. A multilingual website can not only be translated into multiple languages, but also can have language-specific content. For instance an english version of the website can have a completely different structure than the same website in russian or any other language. I've thought up of 2 storage schemas for such cases:
// NUMBER ONE
table contents // to store some HYPOTHETICAL content
id // content id
table contents_loc // to translate the content
content, // ID of content to translate
lang, // language to translate to
value, // translated content
online // availability flag, VERY IMPORTANT
ADVANTAGES:
- Content can be stored in multiple languages. This schema is pretty common, except maybe for the "online" flag in the "_loc" tables. About that below.
- Every content can not only be translated into multiple languages, but also you could mark online=false for a single language and disable the content from appearing in that language. Alternatively, that record could be removed from "_loc" table to achieve the same functionality as online=false, but this time it would be permanent and couldn't be easily undone. For instance we could create some sort of a menu, but we don't want one or more items to appear in english - so we use online=false on those "translations".
DISADVANTAGES:
- Quickly gets pretty ugly with more complex table relations.
- More difficult queries.
// NUMBER 2
table contents // to store some HYPOTHETICAL content
id, // content id
online // content availability (not the same as in first example)
lang, // language of the content
value, // translated content
ADVANTAGES:
1. Less painful to implement
2. Shorter queries
DISADVANTAGES:
2. Every multilingual record would now have 3 different IDs. It would be bad for eg. products in an e-shop, since the first version would allow us to store different languages under the same ID and this one would require 3 separate records to represent the same product.
First storage option would seem like a great solution, since you could easily use it instead of the second one as well, but you couldn't easily do it the other way around.
The only problem is ... the first structure seems a bit like an overkill (except in cases like product storage)
So my question to you is:
Is it logical to implement the first storage option? In your experience, would anyone ever need such a solution?

The question we ask ourselves is always:
Is the content the same for multiple languages and do they need a relation?
Translatable models
If the answer is yes you need a translatable model. So a model with multiple versions of the same record. So you need a language flag for each record.
PROS: It gives you a structure in which you can see for example which content has not yet been translated.
Separate records per language
But many times we see a different solution as the better one: Just seperate both languages totally. We mostly see this in CMS solutions. The story is not only translated but also different. For example in country 1 they have a different menu structure, other news items, other products and other pages.
PROS: Total flexibility and no unexpected records from other languages.
Example
We see it like writing a magazine: You can write one, then translate to another language. Yes that's possible but in real world we see more and more that the content is structurally different. People don't like to be surprised so you need lots of steps to make sure content is not visible in wrong languages, pages don't get created in duplicate etc.
Sharing logic
So what we do is most time: Share the views, make the buttons, inputs etc. translatable but keep the content seperated. So that every admin can just work in his area. If we need to confirm that some records are available in all languages we can always trick that by creating a link (nicely relational) between them but it is not the standard we use most of the time.
Really translatable records like products
Because we are flexible in creating models etc. we can just use decide how to work with them based on the requirements. I would not try to look for a general solution which works for all because there is none. You need a solution based on your data.

Assuming that you need a translatable model, as it is described by Luc, I would suggest coming up with some sort of special-character-delimited key-value pair format for the value column of the content table. Example:
#en=English Term#de=German Term
You may use UDFs (User Defined Functions in T-SQL) to set/get the appropriate term based on the specified language.
For selecting :
select id, dbo.GetContentInLang(value, #lang)
from content
For updating:
update content
set value = dbo.SetContentInLang(value, #lang, new_content)
where id = #id
The UDFs:
a. do have a performance hit but this also the case for join that you will have to do between the content and content_loc tables
and
b. are somehow difficult to implement but are reusable practically throughout your database.
You can also do the above on the application/UI layer.

Solr - Indexing products with attributes as Key / Value pair

I’m currently looking at developing a solr application to index products on our e-commerce website.
Some example fields in the schema are:
ProductID
ProductName
Description
Price
Categories (multi-value)
Attributes
Attributes are a list of key-value pairs.
For example:
Type = Rose
Position = Full Sun
Position = Shade
Colour = Red
I am going to store the fields, so that my pages can be generated from a search result.
How is it best to represent these?
I was thinking of maybe having some dynamic fields for indexing:
attribute_* for example (attribute_position)
And then “attribute” for stored value (For returning, for displaying) - storing multiple fields
The value of an “attribute” field could be (for example) Position|Full Sun - then let the client handle the displaying?
Are there any better ways of doing this?
As a footnote- I will be using Solrnet as a client for querying (probably not relevant)

First, I would not recommend storing your entire document in your search engine. The only thing you should store in Solr is those things that you wish to search on. Yes, it supports storing more, however, taking advantage of this can cause issues down the road with index size, master/slave replication time, etc. Ideally, the only thing in Solr is things you wish to search/sort on and a document ID that is unique enough to fetch document data with from another source that is optimized for storing .... documents.
However, if you decide to ignore this advice, then you can easily store your name value pairs in a single field. If your name value pairs have a limited character set, you can easily concatenate the name value pairs into a single string. Then, parse them on the way out when you are forming your web page for display. There's no need to come up with a more complex schema to support this. Multiple fields for storing these will only increase your index overhead, which buys you nothing.

Sorting and i18n in Database

I have the following data in my db:
ID Name Region
100 Sam North
101 Dam South
102 Wesson East
...
...
...
Now, the region will be different in different languages. I want the Sorting to happen right, based on the display value rather than the internal value.
Any Ideas? (And yeah, sorting in memory using Java is not an option!)

Unfortunately, if you want to do that in the database, you will have to store the internationalized versions as well in the database (or at least their order). Otherwise, how could the database engine possibly know how to sort?
You will have to create a table with three columns: the English version, the language code and the translated version (with the English version and the language code together being the primary key). Then join to this table in in you query by using the English word and a language code and then sort on the internationalized version.

The best approach to use internationalization is to remove the internationalization from your application and keep it in separated i18n databases. In your application you keep a key that can be used to access those separated databases, normally xml or yml.
As rule of thumb i would suggest:
keep all database in one format, one place
extract internationalization strings from your application
lets your application to pull i18n strings from your i18n from your internationalization database.
You can check the RAILS approach to i18n. It's simply, clean and easy to use.

What do you mean "Display Value"? I take it you are somehow converting that into a different language in java itself, and don't store that value (the localised one, I assume) in the db? If so, you're kind of screwed. You need to get that data into the DB to be able to sort with it there.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight