Removing diacritic symbols from UTF8 string in C - c

I am writing a C program to search a large number of UTF-8 strings in a database. Some of these strings contain English characters with didactics, such as accents, etc. The search string is entered by the user, so it will most likely not contain such characters. Is there a way (function, library, etc) which can remove these characters from a string, or just perform a didactic-insensitive search? For example, if the user enters the search string "motor", it should match the string "motörhead".
My first attempt was to manually strip out the combining didactic modifiers described here:
http://en.wikipedia.org/wiki/Combining_character
This worked in some cases, but it turns out many of these characters also have specific unicode values. For example, the character "ö" above can be represented by an "o" followed by the combining didactic U+0308, but it can also be represented by the single unicode character U+00F6, and my method only filters the former.
I have also looked into iconv, which can convert from UTF8 to ASCII. However, I may want to localize my program at a future date, and this would no doubt cause problems for languages with non-English characters. Is there a way I can simply strip/convert these accented characters?
Edit: removed typo in question title.

Convert to one of the decomposed normalizations -- probably NFD, but you might want NFKD even -- that makes all diacritics into combining characters that can be stripped.
You will want a library for this. I hear good things about ICU.

Use ICU, create a collator over "root" with strength of PRIMARY (L1) (which only uses base letters, only cares about 'o' and ignores 'ö') then you can use ICU's search functions to match. There's a new functionality search collator that will provide special collators designed for this case, but 'primary strength' will handle this specific case.
Example: "motor == mötor" in the 'collated' section.

Related

Clarification on Winapi Paths and Filename (W functions and A functions)

I have try to check an importance and reason to use W winapi vs A, (W meaning wide char, A meaning ascii right?)
I have made a simple example, i receive a temp path for current user like this:
CHAR pszUserTempPathA[MAX_PATH] = { 0 };
WCHAR pwszUserTempPathW[MAX_PATH] = { 0 };
GetTempPathA(MAX_PATH - 1, pszUserTempPathA);
GetTempPathW(MAX_PATH - 1, pwszUserTempPathW);
printf("pathA=%s\r\npathW=%ws\r\n",pszUserTempPathA,pwszUserTempPathW);
My current user has a russian name, so its written in cyrillic, printf outputs like this:
pathA=C:\users\Пыщь\Local\Temp
pathW=C:\users\Пыщь\Local\Temp
So both paths are allright, i thought i will receive some error, or a mess of symbols with a GetTempPathA since the current user is a unicode, but i figured out, that cyrillic characters are actually included in extended ascii character set. So i have a question, if i were to use my software, and it will extract data in a temp folder of current user, who is chinese ( assuming he have chinese symbols in user name ), will i get a mess or an error using the GetTempPathA version? Should i always use a W prefixed functions, for a production software, that is working with winapi directly?
First, the -A suffix stands for ANSI, not ASCII. ASCII is a 7-bit character set. ANSI, as Microsoft uses the term, is for an encoding using 8-bit code units (chars) and code pages.
Some people use the terms "extended ASCII" or "high ASCII," but that's not actually a standard and, in some cases, isn't quite the same as ANSI. Extended ASCII is the ASCII character set plus (at most) 128 additional characters. For many ANSI code pages this is identical to extended ASCII, but some code pages accommodate variable length characters (which Microsoft calls multi-byte). Some people consider "extended ASCII" to just mean ISO-Latin-1 (which is nearly identical to Windows-1252).
Anyway, with an ANSI function, your string can include any characters from your current code page. If you need characters that aren't part of your current code page, you're out-of-luck. You'll have to use the wide -W versions.
In modern versions of Windows, you can generally think of the -A functions as wrappers around the -W functions that use MultiByteToWideChar and/or WideCharToMultiByte to convert any strings passing through the API. But the latter conversion can be lossy, since wide character strings might include characters that your multibyte strings cannot represent.
Portable, cross-platform code often stores all text in UTF-8, which uses 8-bit code units (chars) but can represent any Unicode code point, and anytime text needs to go through a Windows API, you'd explicitly convert to/from wide chars and then call the -W version of the API.
UTF-8 is nearly similar to what Microsoft calls a multibyte ANSI code page, except that Windows does not completely support a UTF-8 code page. There is CP_UTF8, but it works only with certain APIs (like WideCharToMultiByte and MultiByteToWideChar). You cannot set your code page to CP_UTF8 and expect the general -A APIs to do the right thing.
As you try to test things, be aware that it's difficult (and sometimes impossible) to get the CMD console window to display characters outside the current code page. If you want to display multi-script strings, you probably should write a GUI application and/or use the debugger to inspect the actual content of the strings.
Of course, you need the wide version. ASCII version can't even technically handle more than 256 distinct characters. Cyrillic is included in the extended ASCII set (if that's your localization) while Chinese isn't and can't due to much larger set of characters needed to represent it. Moreover, you can get mess with Cyrillic as well - it will only work properly if the executing machine has matching localization. So on a machine with non-cyrillic localization the text will be displayed according to whatever is defined by the localization settings.

C Removing Newlines in a Portable and International Friendly Way

Simple question here with a potentially tricky answer: I am looking for a portable and localization friendly way to remove trailing newlines in C, preferably something standards-based.
I am already aware of the following solutions:
Parsing for some combination of \r and \n. Really not pretty when dealing with Windows, *nix and Mac, all which use different sequences to represent a new line. Also, do other languages even use the same escape sequence for a new line? I expect this will blow up in languages that use different glyphs from English (say, Japanese or the like).
Removing trailing n bytes and replacing final \0. Seems like a more brittle way of doing the above.
isspace looks tempting but I need to only match newlines. Other whitespace is considered valid token text.
C++ has a class to do this but it is of little help to me in a pure-C world.
locale.h seems like what I am after but I cannot see anything pertinent to extracting newline tokens.
So, with that, is this an instance that I will have to "roll my own" functionality or is there something that I have missed? Thanks!
Solution
I ended up combining both answers from Weather Vane and Loic, respectively, for my final solution. What worked was to use the handy strcspn function to break on the first newline character as selected from Loic's provided links. Thus, I can select delimiters based on a number of supported locales. Is a good point that there are too many to support generically at this level; I didn't even know that there were several competing encodings for the Cyrillic.
In this way, I can achieve "good enough" multinational support while still using standard library functions.
Since I can only accept one answer, I am selecting Weather Vane's as his was the final invocation I used. That being said, it was really the two answers together that worked for me.
The best one I know is
buffer [ strcspn(buffer, "\r\n") ] = 0;
which is a safe way of dealing with all the combinations of \r and \n - both, one or none.
I suggest to replace one or more whitespace characters with one standard space (US-ASCII 0x20). Considering only ISO-8859-1 characters (https://en.wikipedia.org/wiki/ISO/IEC_8859-1), whitespace consists of any byte in 0x00..0x20 (C0 control characters and space) and 0x7F..0xA0 (delete, C1 control characters and no-break space). Notice that US-ASCII is subset of ISO-8859-1.
But take into account that Windows 1251 (https://en.wikipedia.org/wiki/Windows-1251) assign different, visible (non-control) characters to the range 0x80..0x9F. In this case, those bytes cannot be replaced by spaces without lost of textual information.
Resources for an extensive definition of whitespace characters:
https://en.wikipedia.org/wiki/Unicode_character_property#Whitespace
http://unicode.org/reports/tr23/
http://www.unicode.org/Public/8.0.0/charts/CodeCharts.pdf
Take also onto account that different encodings may be used, most commonly:
ISO-8859-1 (https://en.wikipedia.org/wiki/ISO/IEC_8859-1)
UTF-8 (https://en.wikipedia.org/wiki/UTF-8)
Windows 1251 (https://en.wikipedia.org/wiki/Windows-1251)
But in non-western countries (for instance Russia, Japan), further character encodings are also usual. Numerous encodings exist, but it probably does not make sense to try to support each and every known encoding.
Thus try to define and restrict your use-cases, because implementing it in full generality means a lot of work.
This answer is for C++ users with the same problem.
Matching a newline character for any locale and character type can be done like this:
#include <locale>
template<class Char>
bool is_newline(Char c, std::locale const & loc = std::locale())
{
// Translate character into default locale and character type.
// Then, test against '\n', which is the only newline character there.
return std::use_facet< std::ctype<Char>>(loc).narrow(c, ' ') == '\n';
}
Now, removing all trailing newlines can be done like this:
void remove_trailing_newlines(std::string & str) {
while (!str.empty() && is_newline(*str.rbegin())
str.pop_back();
}
This should be absolutely portable, as it relies only on standard C++ functions.

UTF-8 and ISO 8859-9

I have been reading about UTF-8 and unicode for the last couple of days and when I thought I figured it all, I am confused when I read that UTF-8 and ISO 8859-9 are not compatible.
I have a database that stores data as UTF-8. I have a requirement from a customer to support various ISO 8859-x code pages (i.e. 8859-3, 8859-2, and also ISO 6937). My questions are:
Since my data ingest and database engine type is UTF-8, would it be correct to assume that I am using unicode?
I understand that unicode can support all characters and it is the way to go. However, my customer is an european entity that wants us to use ISO code pages. so my question is how can I support multiple client use cases using existing UTF-8 data? Since ISO 8859-x is not a subset of unicode, do I have to write code to send appropriate character set of ISO 8859-x depending on my use cases? Is that I need to do or there is more to it?
btw, my understanding is that UTF-8 is merely an encoding algorithm to get a numeric value from binary data. if so, how character set is applied? Do I have to write a code to return 8859-x response or all that's needed is to set an appropriate character set value in the response header?
Topic is pretty vast so let me simplify (a lot, even too much) and answer point by point.
Since my data ingest and database engine type is UTF-8, would it be correct to assume that I am using unicode?
Yes, you're using UNICODE and you're storing UNICODE characters (formally called code points) using UTF-8 encoding. Please note that UNICODE defines rules and sets of characters (even if same word is often used as synonym of UTF-16 encoding), the way you encode such characters in a byte stream is another thing.
... However, my customer is an european entity that wants us to use ISO code pages. so my question is how can I support multiple client use cases using existing UTF-8 data?
Of course if you store UNICODE characters (it doesn't matter with which encoding) then you can always convert them to a specific ASCII code page (or to any other encoding). OK this isn't formally always true (because UNICODE doesn't define every possible characters actually in use/used in the past) but I would ignore this point...
... Since ISO 8859-x is not a subset of unicode, do I have to write code to send appropriate character set of ISO 8859-x depending on my use cases?
All characters from ISO 8859 code pages are also available in UNICODE then (from this point of view) it's a subset. Of course encoded values are different so they need to be converted. If you know needed code page for each customer then you can always convert an UNICODE UTF-8 encoded text into an ASCII (with right code page) text.
Is that I need to do or there is more to it?
Just that. Code could be pretty short but you didn't tag your question with any language so I won't provide links/examples. Just for a rudimentary example take a look to this post.
Let me also say one important thing: if they want to consume your data in ASCII with their code page then you have to perform a conversion. If they can consume directly UTF-8 data (or you present them somehow in your own application) then you don't have to worry about code pages (that's why we're using UNICODE) because - no matters encoding - UNICODE character set contains all characters they may need.
btw, my understanding is that UTF-8 is merely an encoding algorithm to get a numeric value from binary data.
Not exactly. You have a table of characters, right? For example A. Now you have to store a numeric value that will be interpreted as A. In ASCII they arbitrary decided that 65 is the numeric value that represents that character. UNICODE is a long list of characters (and rules to combine them), UTF-X are arbitrary representations used to store them as numeric values.
if so, how character set is applied?
"Character set" is a pretty vague sentence. With UNICODE character set you mean all characters available with UNICODE. If you mean code page then (simplifying) it represents a subset of available character set. Imagine you have 8 bit ASCII (then up to 256 symbols), you can't accommodate all characters used in Europe, right? Code pages solve this problem, half of these symbols are always the same and the other half represent different characters according to code page (each "Country" will use a specific code page with its preferred characters).
For an introductory overview about this topic: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets

UTF-8 tuple storage using lowest common technological denominator, append-only

EDIT: Note that due to the way hard drives actually write data, none of the schemes in this list work reliably. Do not use them. Just use a database. SQLite is a good simple one.
What's the most low-tech but reliable way of storing tuples of UTF-8 strings on disk? Storage should be append-only for reliability.
As part of a document storage system I'm experimenting with I have to store UTF-8 tuple data on disk. Obviously, for a full-blown implementation, I want to use something like Amazon S3, Project Voldemort, or CouchDB.
However, at the moment, I'm experimenting and haven't even firmly settled on a programming language yet. I have been using CSV, but CSV tend to become brittle when you try to store outlandish unicode and unexpected whitespace (eg vertical tabs).
I could use XML or JSON for storage, but they don't play nice with append-only files. My best guess so far is a rather idiosyncratic format where each string is preceded by a 4-byte signed integer indicating the number of bytes it contains, and an integer value of -1 indicates that this tuple is complete - the equivalent of a CSV newline. The main source of headaches there is having to decide on the endianness of the integer on disk.
Edit: actually, this won't work. If the program exits while writing a string, the data becomes irrevocably misaligned. Some sort of out-of-band signalling is needed to ensure alignment can be regained after an aborted tuple.
Edit 2: Turns out that guaranteeing atomicity when appending to text files is possible, but the parser is quite non-trivial. Writing said parser now.
Edit 3: You can view the end result at http://github.com/MetalBeetle/Fruitbat/tree/master/src/com/metalbeetle/fruitbat/atrio/ .
I would recommend tab delimiting each field and carriage-return delimiting each record.
Within each string, Replace all characters that would affect the field and record interpretation and rendering. This would include control characters (U+0000–U+001F, U+007F–U+009F), non-graphical line and paragraph separators (U+2028, U=2029), directional control characters (U+202A–U+202E), and the byte order mark (U+FEFF).
They should be replaced with escape sequences of constant length. The escape sequences should begin with a rare (for your application) character. The escape character itself should also be escaped.
This would allow you to append new records easily. It has the additional advantage of being able to load the file for visual inspection and modification into any spreadsheet or word processing program, which could be useful for debugging purposes.
This would also be easy to code, since the file will be a valid UTF-8 document, so standard text reading and writing routines may be used. This also allows you to convert easily to UTF-16BE or UTF-16LE if desired, without complications.
Example:
U+0009 CHARACTER TABULATION becomes ~TB
U+000A LINE FEED becomes ~LF
U+000D CARRIAGE RETURN becomes ~CR
U+007E TILDE becomes ~~~
etc.
There are a couple of reasons why tabs would be better than commas as field delimiters. Commas appear more commonly within normal text strings (such as English text), and would have to be replaced more frequently. And spreadsheet programs (such as Microsoft Excel) tend to handle tab-delimited files much more naturally.
Mostly thinking out loud here...
Really low tech would be to use (for example) null bytes as separators, and just "quote" all null bytes appearing in the output with an additional null.
Perhaps one could use SCSU along with that.
Or it might be worth to look at the gzip format, and maybe ape it, if not using it:
A gzip file consists of a series of "members" (compressed data sets).
[...]
The members simply appear one after another in the file, with no additional information before, between, or after them.
Each of these members can have an optional "filename", comment, or the like, and i believe you can just keep appending members.
Or you could use bencode, used in torrent-files. Or BSON.
See also Wikipedia's Comparison of data serialization formats.
Otherwise i think your idea of preceding each string with its length is probably the simplest one.

Non-english alpha-numerics in a text file

C# WinForm application
EDIT: It appears there's concern about foreign language compatibility.
This is a non-issue.
The card game I'm making this utility for is primarily in English. In the future I may support other languages, but everything will still be keyed off the English names, which are a primary key in both the program and the rules of the game.
I can simply add additional tables with the English name, followed by the translated text, and everything should be fine.
.
Part of my program reads input from a text file containing names, and compares it to another list of names.
Sometimes these names have non-english letters, particularly accented "o" and the Latin AE in the input file.
When this text input is compared to names, those non-english characters are causing problems.
I'd like to find a way to overlay these characters with the english counterpart in most cases, such as "[accented o]" -> "o"
.
I'm perfectly content to code a find/replace table (I only expect 12-30 problem characters), but I've got some roadblocks.
1) Hardcoding the find/replace table (in the ".cs" file) gives me errors, because the compiler doesn't like the characters.
Anyone know a trick to fix this, or do I just have to create a Find/Replace text file that would be read before this process?
2) Identifying the letters is frustrating, but I'll only reach the replace logic if a match isn't found.
This occurs when the non-english characters cause a mismatch, or it isn't in the list yet.
I'm not too worried about the inefficiency of a char-by-char check of each unmatched string, as this is a manual update process triggered every three months.
Presumably getting down to the Bianary-code level of a single character should work, but I haven't gotten this to work.
3) The aforementioned [AE] character is used often, and it would be nice to at least allow the use of this character within the program, as I don't intend to replace it like I do the others.
I've loaded [AE] characters into my database with no problems, and searches using "Ae," "AE," and "[AE]" have posed no problem at the SQL-level, so I'm fine with that functionality.
It's just that searching for other non-english characters is less intuitive.
.
So there's my problem, which is actually more of a nuisance than anything serious. Still, any help or advice would be greatly appreciated.
Are you sure these names aren't meant to be different? Are you sure that you want all of "è", "é", "ê", and "ë" to mean the same thing?
Especially in "foreign" names, characters with different diacritical marks are likely intended to be different. After all, to the people whose names those are, these characters are not foreign.

Resources