Detecting text language in LibreOffice calc - multilingual

I want to automate text language detection in LibreOffice Calc.
I have only 4 languages, each language has its own character set.
Languages are not or rarely mixed in cells.
Languages are: English, Hebrew, Arabic, Russian.
As depicted in the picture bellow:
I want to write a formula in column C cell, that will indicate the text language in the corresponding A cell.
I failed to identify any style indicator I can use.
I looked around and found a solution for Microsoft Office VBA.
I hope I do not need to write a macro using this API function getStringType(...)
Thanks.

Assuming all the text in a given cell is using the same script and that all text starts with a letter, testing the first character should be enough. This can be done with:
=UNICODE(A2)
If the number returned is between 65 and 122, the text is in English (this would need to be extended if you need to include characters with diacritical marks (ex.: é, à, ñ, ø, etc.)
The same can be done with the other alphabets. A Unicode character list can be used to determine the range in question. Here is one though you can easily find others that may better suit your purpose

Related

Background c program for keyboard mapping

I have installed a Bramma TTF file in my windows 8 system. Through a windows character map, I was able to find individual character code. Attached below the screenshot of the map. We can see at the right bottom side, the character code for "!" is 0x21. Similarly, I can find all the character code of all other letters.
Now I defined a character mapping for this font with my US based keyboard layout. For example, I mapped physical character 'a' in the keyboard to the character shown in 3rd row and 1st column. [whenever I hit 'a' from the keyboard, the corresponding character has to be displayed]
I would like to write a background C program such that it listens the keyboard hit and as per my previously defined character mapping, my C program should output that mapped character. i.e., when i hit character 'a' from the keyboard it should return the mapped character.
Can any one help me out in solving this problem or else just give me a lead towards the solution.
I'm somewhat familiar with these kind of fonts, they popped up in other questions at SO. The kind of questions from users that tried to deal with the consequences of using such a font. They are rather grave.
The biggest problem is that this font is not Unicode compatible. The actual string that underlies the text that's rendered to the screen is very different, containing characters from the ANSI character set. What goes horribly wrong is when the program that displays these strings saves the data. The data file contains the original strings, a good example is an Excel spreadsheet. This spreadsheet just contains gibberish when it is read by any other program. Especially bad when read by a program on another machine that doesn't have the same font installed. Very, very painful.
You are in fact making this problem worse by even destroying the normal mapping between keyboard to ANSI character. The 1st character in the 3rd row is produced by typing a capital I (eye) on the keyboard.
The message is clear: don't do this. Windows supports Unicode compatible fonts with Indic scripts well, fonts like Sylfaen, Mangal, Latha. All of which are available on my Windows 8 machine, about ten thousand miles away from where they are normally used. It also has Indic keyboard layouts available under the Language applet, I just picked one as an example:
Well, it is your funeral. You don't have to write a C program to translate these keystrokes, you need a custom keyboard layout. It is a DLL. You normally need the DDK to build them, but there is simple tooling available to auto-generate them. It doesn't get any easier than with MKLC, the Microsoft Keyboard Layout Creator. Web page and download link are here.
Probably you should use autohotkey.
With this application, you can listen to a set of keys, & then send a different set of keys.
This can be used as implementation of "autocorrect"
e.g.
:*:btw::By the way `
will autocorrect btw to By the way.
autohotkey supports quite complicated scripts, & many scripts are already available online.
On another note, if you only want english keyboard to print malayalam unicode characters, you may also think of a popular software called baraha
Google's Virtual Keyboard (also works with your physical keyboard)
https://code.google.com/apis/ajax/playground/#virtual_keyboard
http://www.tavultesoft.com/ allows you to create keyboards for MSWindows and the web. Over 1000 keyboards are readily provided. There is a developer and a user version. With the developer version you may create installation programs which install fonts, keyboards, keymaps and documentation.

Removing diacritic symbols from UTF8 string in C

I am writing a C program to search a large number of UTF-8 strings in a database. Some of these strings contain English characters with didactics, such as accents, etc. The search string is entered by the user, so it will most likely not contain such characters. Is there a way (function, library, etc) which can remove these characters from a string, or just perform a didactic-insensitive search? For example, if the user enters the search string "motor", it should match the string "motörhead".
My first attempt was to manually strip out the combining didactic modifiers described here:
http://en.wikipedia.org/wiki/Combining_character
This worked in some cases, but it turns out many of these characters also have specific unicode values. For example, the character "ö" above can be represented by an "o" followed by the combining didactic U+0308, but it can also be represented by the single unicode character U+00F6, and my method only filters the former.
I have also looked into iconv, which can convert from UTF8 to ASCII. However, I may want to localize my program at a future date, and this would no doubt cause problems for languages with non-English characters. Is there a way I can simply strip/convert these accented characters?
Edit: removed typo in question title.
Convert to one of the decomposed normalizations -- probably NFD, but you might want NFKD even -- that makes all diacritics into combining characters that can be stripped.
You will want a library for this. I hear good things about ICU.
Use ICU, create a collator over "root" with strength of PRIMARY (L1) (which only uses base letters, only cares about 'o' and ignores 'ö') then you can use ICU's search functions to match. There's a new functionality search collator that will provide special collators designed for this case, but 'primary strength' will handle this specific case.
Example: "motor == mötor" in the 'collated' section.

How to show arbitary characters in c?

warning C4566: character represented
by universal-charac ter-name '\u2E81'
cannot be represented in the current
code page (936)
Sometimes we need to display text in various languages such as Russian,Japanese and so on.
But seems a single code page can only show characters of 1 single language ,how can I show characters in various languages at the same time?
Since you're (apparently) using VC++, you probably want to switch to the UTF-8 code page. You'll also need to set the font to one that has glyphs for all the code points you care about (many have few if any beyond the first 256).

Non-english alpha-numerics in a text file

C# WinForm application
EDIT: It appears there's concern about foreign language compatibility.
This is a non-issue.
The card game I'm making this utility for is primarily in English. In the future I may support other languages, but everything will still be keyed off the English names, which are a primary key in both the program and the rules of the game.
I can simply add additional tables with the English name, followed by the translated text, and everything should be fine.
.
Part of my program reads input from a text file containing names, and compares it to another list of names.
Sometimes these names have non-english letters, particularly accented "o" and the Latin AE in the input file.
When this text input is compared to names, those non-english characters are causing problems.
I'd like to find a way to overlay these characters with the english counterpart in most cases, such as "[accented o]" -> "o"
.
I'm perfectly content to code a find/replace table (I only expect 12-30 problem characters), but I've got some roadblocks.
1) Hardcoding the find/replace table (in the ".cs" file) gives me errors, because the compiler doesn't like the characters.
Anyone know a trick to fix this, or do I just have to create a Find/Replace text file that would be read before this process?
2) Identifying the letters is frustrating, but I'll only reach the replace logic if a match isn't found.
This occurs when the non-english characters cause a mismatch, or it isn't in the list yet.
I'm not too worried about the inefficiency of a char-by-char check of each unmatched string, as this is a manual update process triggered every three months.
Presumably getting down to the Bianary-code level of a single character should work, but I haven't gotten this to work.
3) The aforementioned [AE] character is used often, and it would be nice to at least allow the use of this character within the program, as I don't intend to replace it like I do the others.
I've loaded [AE] characters into my database with no problems, and searches using "Ae," "AE," and "[AE]" have posed no problem at the SQL-level, so I'm fine with that functionality.
It's just that searching for other non-english characters is less intuitive.
.
So there's my problem, which is actually more of a nuisance than anything serious. Still, any help or advice would be greatly appreciated.
Are you sure these names aren't meant to be different? Are you sure that you want all of "è", "é", "ê", and "ë" to mean the same thing?
Especially in "foreign" names, characters with different diacritical marks are likely intended to be different. After all, to the people whose names those are, these characters are not foreign.

Hebrew support in a C ncurses application

We have a C nurses based application (runs on most flavours of Unix, but we favour RHEL). We've got Unicode support in there, but now we have to provide a Hebrew version of the application. Does anyone know a process we could go through to convert the program? It mainly gets and stores data from Oracle, which can support Hebrew, so there should not be a problem there.
It really is just the display of the text that is the issue.
It is important to know what terminal they are using because that defines how you should write the code. Some terminals support BiDi(ie bidirectional text). That means they automatically turn Hebrew/Arabian text backwards.
It has its own problems, you can check what your app would look like using mlterm.
Basically it reverses the lines that contain hebrew text while keeping what is interpreted as English characters LTR. A Hebrew character printed to 10,70 will appear in 10,10. You can use Unicode LTR RTL to try to force direction for things that break your formatting, but at least on mlterm while they work, they print garbage characters.
If they use regular terminals with unicode support, however, you should roll the characters yourself.
Then of course if it is run on bidirectional terminals the text would be backwards again and the format lost.

Resources