Tagging content system - with i18n - database

The Idea is to have a tagging system between Users and Content(images, videos, posts)
Kind of like the tagging system here on SO with questions.
I like the achievements system on SO, meaning that after a certain amount of points
a user can start making his/her own tags. Same Idea for my system
My current table design looks like
Tag UserTag User
--- ------- ----
tag_id user_id user_id
tag_name tag_id username
usage_count ....
It brings me to this question.
Q How can you have a tagging system for content in different languages.
Yet at the same time be able to search for the same content with tags in different languages.
Have auto-complete with different languages for the same tag
When i use autocomplete I search for tag names like the characters the user is typing.
E.g. I have a tag named "nightclub" in English
yet in French if they were tagging that the translation would be "discothèque"
Or is there no way of doing this, and just let people make tags in different languages.

Yes you can. But be aware that some words in one language may have several translations in others.
You may have a languages table, a tags table with only a tag_id, and a many to many table with language_id, tag_id, tag_name.
Like I said previously, you might run into problems when people want to make refinements that their own language allows, but other languages can't. To stay in the french example, talking about bread, you may have 'baguette', 'flûte', 'recuit', 'demi-recuit', etc. tags, whereas the english would merely have a 'bread' tag. The mapping between the tags in then significantly complicated. but that's a general translation problem, not only in programming realm.
Regarding your comment : a compromise would be to add a "tag_related_to_tag" table, allowing to make couplings between tags. Users could tell which tag is related to which other in a different language. This would allow the maximum flexibility with the minimum of complexity, but would need some administration (otherwise you might have evil users making very unexpected relationships between tags, breaking the usefulness of the system).
That's something I actually was thinking to implement for a website which has a very narrow field (stoic philosophy) and target public. If the field is too broad, it might be very ineffective.

Interesting question! Just some thoughts (not intended a a complete solution, it's more a set of questions):
A straightforward approach is having an internal tag ID and for each language a localized name.
If no localized name was created yet, you may need to fall back to the tag name in a 'primary' language - usually english - or the language the tag was created in.
Translation needs to be done by a user ho knows both languages, automatic translations are IMO to imprecise. So probably a user right (bound to rewards?) to rename tags.
Are all languages equal, or are tags only created in a "primary language" understood by most users, and translations added separately? (The latter looks less fair, but would probably make some things easier)
You need an ability to merge tags - e.g. when users independently created "discothèque" and "nightclub".
Do I see only the tags that are available in my language, or can I see tags available in other languages that don't have translations to my language? Can I search for tags in other languages?
Is the tag name included in a query string? Will my german query link work when I send it to a friend in the US?
How to resolve disputes regarding the tag meaning? Example: The closest translation to german is "Nachtclub" for night club, and "Diskothek" for "discothèque". But in german, a "Nachtclub" is quite different from a "Diskothek" (though there is some overlap).

Related

What language code should I use to support multiple languages?

I want to go with translation tables as described here in the third example.
This is not hard to implement but what I wonder is basically how do I want to encode those languages? I looked up the ISO 639-3 files that contain the latest languages and their codes but is it a good idea to include them all?
The Language table is supposed to provide all kinds of languages for Stores. Those stores are allowed to decide themselves which languages they want to support. However, my database is going to tell them how many languages there are.
So, is there a "common used" list of languages? I don't think that Facebook and/or Google really support 7866 languages which is the number of languages listed in ISO 639-3.
Or would I use the Language Culture Names codes like en-UK, en-US, de-AT, etc.?
If I understand your question, you're asking if you should make a reference table with 7866 rows, one for each language in ISO 639-3. I don't really see the downside as it's not going to incur much storage or performance cost, compared to just storing a subset.
The real question is which languages you want to actually translate and whether you want to support dialects. If you don't, you can just use languages like en, fr, etc, and you could just store those in the reference table if you wanted to (though the savings are minimal).
If you want dialects, then you would need en-us, en-gb etc. Since these should fall back to the non-dialect version (ie just en), it would be a good idea for your columns to distinguish between the language and the dialect. So you can store "en | us", "en | gb" (possibly "en | null" too) and it will be easy to provide translations as overrides, but with most of them falling back to a default dialect.

Need advice on multilingual data storage

This is more of a question for experienced people who've worked a lot with multilingual websites and e-shops. This is NOT a database structure question or anything like that. This is a question on how to store a multilingual website: NOT how to store translations. A multilingual website can not only be translated into multiple languages, but also can have language-specific content. For instance an english version of the website can have a completely different structure than the same website in russian or any other language. I've thought up of 2 storage schemas for such cases:
// NUMBER ONE
table contents // to store some HYPOTHETICAL content
id // content id
table contents_loc // to translate the content
content, // ID of content to translate
lang, // language to translate to
value, // translated content
online // availability flag, VERY IMPORTANT
ADVANTAGES:
- Content can be stored in multiple languages. This schema is pretty common, except maybe for the "online" flag in the "_loc" tables. About that below.
- Every content can not only be translated into multiple languages, but also you could mark online=false for a single language and disable the content from appearing in that language. Alternatively, that record could be removed from "_loc" table to achieve the same functionality as online=false, but this time it would be permanent and couldn't be easily undone. For instance we could create some sort of a menu, but we don't want one or more items to appear in english - so we use online=false on those "translations".
DISADVANTAGES:
- Quickly gets pretty ugly with more complex table relations.
- More difficult queries.
// NUMBER 2
table contents // to store some HYPOTHETICAL content
id, // content id
online // content availability (not the same as in first example)
lang, // language of the content
value, // translated content
ADVANTAGES:
1. Less painful to implement
2. Shorter queries
DISADVANTAGES:
2. Every multilingual record would now have 3 different IDs. It would be bad for eg. products in an e-shop, since the first version would allow us to store different languages under the same ID and this one would require 3 separate records to represent the same product.
First storage option would seem like a great solution, since you could easily use it instead of the second one as well, but you couldn't easily do it the other way around.
The only problem is ... the first structure seems a bit like an overkill (except in cases like product storage)
So my question to you is:
Is it logical to implement the first storage option? In your experience, would anyone ever need such a solution?
The question we ask ourselves is always:
Is the content the same for multiple languages and do they need a relation?
Translatable models
If the answer is yes you need a translatable model. So a model with multiple versions of the same record. So you need a language flag for each record.
PROS: It gives you a structure in which you can see for example which content has not yet been translated.
Separate records per language
But many times we see a different solution as the better one: Just seperate both languages totally. We mostly see this in CMS solutions. The story is not only translated but also different. For example in country 1 they have a different menu structure, other news items, other products and other pages.
PROS: Total flexibility and no unexpected records from other languages.
Example
We see it like writing a magazine: You can write one, then translate to another language. Yes that's possible but in real world we see more and more that the content is structurally different. People don't like to be surprised so you need lots of steps to make sure content is not visible in wrong languages, pages don't get created in duplicate etc.
Sharing logic
So what we do is most time: Share the views, make the buttons, inputs etc. translatable but keep the content seperated. So that every admin can just work in his area. If we need to confirm that some records are available in all languages we can always trick that by creating a link (nicely relational) between them but it is not the standard we use most of the time.
Really translatable records like products
Because we are flexible in creating models etc. we can just use decide how to work with them based on the requirements. I would not try to look for a general solution which works for all because there is none. You need a solution based on your data.
Assuming that you need a translatable model, as it is described by Luc, I would suggest coming up with some sort of special-character-delimited key-value pair format for the value column of the content table. Example:
#en=English Term#de=German Term
You may use UDFs (User Defined Functions in T-SQL) to set/get the appropriate term based on the specified language.
For selecting :
select id, dbo.GetContentInLang(value, #lang)
from content
For updating:
update content
set value = dbo.SetContentInLang(value, #lang, new_content)
where id = #id
The UDFs:
a. do have a performance hit but this also the case for join that you will have to do between the content and content_loc tables
and
b. are somehow difficult to implement but are reusable practically throughout your database.
You can also do the above on the application/UI layer.

Refactoring a database to add support for internationalization / multiple languages

Are there any proven ways of refactoring a database into supporting multiple versions of entries?
I've got a pretty straight forward database with some tables like:
article(id, title, contents, ...)
...
This obviously works like a charm if you're only going to store one version of each article. I remember asking my client really clearly whether the system should be able to store articles in different languages, really stressing that it would be expensive to add this support later on. You can probably guess what the client said back then..
My current approach will be to create a couple of new tables like:
language(id, code, name)
article_index(id, original_title) <- just to be able to group articles
And then add a foreign key into the original article table:
article(id, title, contents, article_index_id, ...)
I would love to hear your comments to this approach and your experiences on the topic.
This is an approach I've used successfully in the past. Another is to replace all text fields with an identifier (int, guid, whatever you want), and then store translations for all the text fields in a single table, keyed on this identifier plus a language id.
Personally, I have had more success with the first approach (i.e. yours), and have, for instance, found it easier to deal with via an ORM. With an NHibernate ORM on my current project, for instance, I've created what amounts to a language-aware session, that returns the correct set of translations for each object automatically. Consistency in the approach obviously helps here.

Entity names vs table names

We are going to develop a new system over a legacy database(using .NET C#).There are approximately 500 tables. We have choosen to use an ORM tool for our data access layer. The problem is with namig convention for the entities.
The tables in the database have names like TB_CUST - table with customer data, or TP_COMP_CARS - company cars. The first letter of prefix defines the modul and the second letter defines its relations to other tables.
I would like to name the entities more meaningful. Like TB_CUST just Customer or CustomerEntity. Of course there would be a comment pointing to its table name.
But the DBA and programer in one person, dont want names like this. He wants to have the entities names exactly the same to the table names. He is saying that he would have to remember two names and it would be difficult and messy. I have to say his not really familiar with the principles of OOP.
But in case of an entity name like TP_COMP_CARS there should be methods names like Get TP_COMP_CARS or SaveTP_COMP_CARS..I think this is unreadable and ugly.
So please tell me your opinion. Who is right and why.
Thank you in advance
The whole idea of ORM tools, is to avoid the need of remembering database objects.
We usually create a database class with all the table and column names, so no one needs to remember anything, and the ORM should map database "names" to normal entities.
Although it is subjective, in my opinion you are right and he is wrong....
Who is going to work mostly with the new code? That person should decide the naming convention IMHO.
Personally of course I would go for your solution because as has already been mentioned, if you use ORM you don't need to hit the DB directly often.
As a compromise you could use names like TB_CUST where act directly with the database but then use names like Customer for your Data Transfer Objects. Writing good code involves creating an abstraction of any datasources you might be working with. Have GetTB_CUST() throughout your code is a little like having GetTB_CUSTFromThatSQLDatabaseWeHave() dotted around the place.
I personally hate table names like that, but it's a legacy system and I'm sure the DBA doesn't feel like renaming the tables. Renaming the tables could be an option. You would just have to create views representing the old table names so that your legacy system keeps running while you develop your new system. If this is not an option you can use the ORM to map table names to entity names. Or you can abstract your ORM away in a data access layer and define nice entity names in your domain model, having your DAL do the name conversion.
The naming conventions used in two different domains simply don't align. Java, for example, hasa very well defined rules/conventions for Class names and field names, where capitalisation is significant. In general, your application may be ported to a completely different Database with a different naming standard, it's not reasonable to demand alignment of names in Business Logic with names in Database. Consider a slightly more complex mapping, one Entity may not correspond to one Table.
And, really, come on ...
Customer == TB_CUST
That is just not so hard! I'm with you, makes the names meaningful in the code and map in the ORM. The learning for the DBA/Programmer should not be that painful, my guess is that it's one of those things that feels much worse in the anticipation than the reality.
If there are 500 tables in the database - you've already got a challenge keeping those names straight. Hopefully, you've got metadata and some graphical models that describe them more meaningfully.
When you create the next 500 ORM objects - you'll have another challenge. Even if you give them meaningful names it's still too many to really hope that all will be obvious. So, now you've got 2 problems.
If there's no way to link those two sets of 500 tables together - then you've got 3 problems. Think about debugging performance of queries in the ORM, and going to the DBA with names that he doesn't recognize. Think about your carefully crafted names - that then must be ignored when you create reports that hit the database directly.
So, I'd try very hard to use the database names in the ORM. But I would tweak a few things:
if a name is too cryptic to understand - I'd work with the DBA to improve its name. An easy way to transition to better names is through views. Ideally you get rid of the original confusing name eventually tho.
changing from underscores to camelcase, etc shouldn't be considered a change to the name - it's just a change to the separator. So, use the appropriate name for your language.
the database prefixes can probably be dropped - they're actually just attributes of the table name that have been "denormalized" and grafted onto the name. They may be necessary to avoid duplication if they indicate a subsection of the model, but in general may be be as important.
"I have to say his not really familiar with the principles of OOP.
But in case of an entity name like TP_COMP_CARS there should be methods names like Get TP_COMP_CARS or SaveTP_COMP_CARS..I think this is unreadable and ugly.
So please tell me your opinion. Who is right and why."
Which names are given to the things your IT systems manages has absolutely nothing to do with "the principles of OOP".
The same holds for which names are given to "standard" "getter and setter" methods : those are just agreements and conventions, not "principles of OOP".
The issue is a certain kind of "ergonomics" (and thus also the self-documenting value) of the code.
It is true that getTP_COMP_CARS looks ugly (though not, as you claim, "unreadable"). It is also true that if you start adhering to "more modern" naming conventions, then there will have to be someone somewhere who will have to maintain a mapping between the names that are synonymous. (And it is untrue that names such as TP_COMP_CARS are less self-documenting than full "natural-language-words" names : usually such names are constructed from a VERY SMALL set of mnemonic words that are used over and over again with the same meaning, making it more than easy enough for anyone to remember them.)
There is no right and wrong about this. Names like that were the usual convention in the days before the ones we live in now. At least, those names usually had the benefit of being case-insensitive, as opposed to the braindead (because case-sensitive) naming rules that are imposed upon us by so-called "more modern" systems.
Twenty years from now, people will call the naming conventions we use these days "braindead" too.

Internationalization on database level

Could anyone point me to some patterns addressing internationalization on the database level tasks?
The simplest way would be to add a text column for every language for every text column, but that is somehow smelly - really i want to have ability to add supported languages dynamically.
The solution i'm coming to is one main language that is saved in the model and a dictionary entity that gets queried for translations and saved translations to.
All i want is to hear from other people who have done this.
You could create a table that has three columns: target language code, original string, translated string. The index on the table would be on the first two columns, but I wouldn't bind this table to other tables with foreign keys. You'd need to add a join (possibly a left join to account for missing translations) for each of the terms you need to translate in each query you run. However, this will make all your queries very hairy and possibly kill performance as well.
Another thing you need to be aware of is actually translating the terms and maintaining an up-to-date translation table. This is very inconvenient to do directly against the database and is often done by non-technical people.
Usually when localizing an application you'd use something like gettext. The idea behind this suite of tools is to parse can parse source code to extract strings for translation and then create translation files from them. Since this suite has been around for a long time, there are a lot of different utilities based on it that help with the translation task, one of which is Poedit, a nice GUI editor for translating strings into different languages. It might be simpler to generate the unique list of terms as they appear in the database in a format gettext could parse, and do the translation in the application code. This way you'd be able to do the translation of the hard coded strings in the application and the database values using the same technique.

Resources