UTF-8 -> ASCII in C language - c

I have a simple question that I can't find anywhere over the internet, how can I convert UTF-8 to ASCII (mostly accented characters to the same character without accent) in C using only the standard lib? I found solutions to most of the languages out there, but not for C particularly.
Thanks!
EDIT: Some of the kind guys that commented made me double check what I needed and I exaggerated. I only need an idea on how to make a function that does: char with accent -> char without accent. :)

Take a look at libiconv. Even if you insist on doing it without libraries, you might find an inspiration there.

In general, you can't. UTF-8 covers much more than accented characters.

There's no built in way of doing that. There's really little difference between UTF-8 and ASCII unless you're talking about high level characters, which cannot be represented in ASCII anyway.
If you have a specific mapping you want (such as a with accent -> a) then you should just probably handle that as a string replace operation.

Every decent Unicode support library (not the standard library of course) has a way to decompose a string in KC or KD form. Which separates the diacritics from the letters. Giving you a shot at filtering them out. Not so sure this is worth pursuing, the result is just gibberish to the native language reader and not every letter is decomposable. In other words, junk with question marks.

Since this is homework, I'm guessing your teacher is clueless and doesn't know anything about UTF-8, and probably is stuck in the 1980s with "code pages" and "extended ASCII" (words you should erase from your vocabulary if you haven't already). Your teacher probably wants you to write a 128-byte lookup table that maps CP437 or Windows-1252 bytes in the range 128-255 to similar-looking ASCII letters. It would go something like...
void strip_accents(unsigned char *dest, const unsigned char *src)
{
static const unsigned char lut[128] = { /* mapping here */ };
do {
*dest++ = *src < 128 ? *src : lut[*src];
} while (*src++);
}

Related

C Removing Newlines in a Portable and International Friendly Way

Simple question here with a potentially tricky answer: I am looking for a portable and localization friendly way to remove trailing newlines in C, preferably something standards-based.
I am already aware of the following solutions:
Parsing for some combination of \r and \n. Really not pretty when dealing with Windows, *nix and Mac, all which use different sequences to represent a new line. Also, do other languages even use the same escape sequence for a new line? I expect this will blow up in languages that use different glyphs from English (say, Japanese or the like).
Removing trailing n bytes and replacing final \0. Seems like a more brittle way of doing the above.
isspace looks tempting but I need to only match newlines. Other whitespace is considered valid token text.
C++ has a class to do this but it is of little help to me in a pure-C world.
locale.h seems like what I am after but I cannot see anything pertinent to extracting newline tokens.
So, with that, is this an instance that I will have to "roll my own" functionality or is there something that I have missed? Thanks!
Solution
I ended up combining both answers from Weather Vane and Loic, respectively, for my final solution. What worked was to use the handy strcspn function to break on the first newline character as selected from Loic's provided links. Thus, I can select delimiters based on a number of supported locales. Is a good point that there are too many to support generically at this level; I didn't even know that there were several competing encodings for the Cyrillic.
In this way, I can achieve "good enough" multinational support while still using standard library functions.
Since I can only accept one answer, I am selecting Weather Vane's as his was the final invocation I used. That being said, it was really the two answers together that worked for me.
The best one I know is
buffer [ strcspn(buffer, "\r\n") ] = 0;
which is a safe way of dealing with all the combinations of \r and \n - both, one or none.
I suggest to replace one or more whitespace characters with one standard space (US-ASCII 0x20). Considering only ISO-8859-1 characters (https://en.wikipedia.org/wiki/ISO/IEC_8859-1), whitespace consists of any byte in 0x00..0x20 (C0 control characters and space) and 0x7F..0xA0 (delete, C1 control characters and no-break space). Notice that US-ASCII is subset of ISO-8859-1.
But take into account that Windows 1251 (https://en.wikipedia.org/wiki/Windows-1251) assign different, visible (non-control) characters to the range 0x80..0x9F. In this case, those bytes cannot be replaced by spaces without lost of textual information.
Resources for an extensive definition of whitespace characters:
https://en.wikipedia.org/wiki/Unicode_character_property#Whitespace
http://unicode.org/reports/tr23/
http://www.unicode.org/Public/8.0.0/charts/CodeCharts.pdf
Take also onto account that different encodings may be used, most commonly:
ISO-8859-1 (https://en.wikipedia.org/wiki/ISO/IEC_8859-1)
UTF-8 (https://en.wikipedia.org/wiki/UTF-8)
Windows 1251 (https://en.wikipedia.org/wiki/Windows-1251)
But in non-western countries (for instance Russia, Japan), further character encodings are also usual. Numerous encodings exist, but it probably does not make sense to try to support each and every known encoding.
Thus try to define and restrict your use-cases, because implementing it in full generality means a lot of work.
This answer is for C++ users with the same problem.
Matching a newline character for any locale and character type can be done like this:
#include <locale>
template<class Char>
bool is_newline(Char c, std::locale const & loc = std::locale())
{
// Translate character into default locale and character type.
// Then, test against '\n', which is the only newline character there.
return std::use_facet< std::ctype<Char>>(loc).narrow(c, ' ') == '\n';
}
Now, removing all trailing newlines can be done like this:
void remove_trailing_newlines(std::string & str) {
while (!str.empty() && is_newline(*str.rbegin())
str.pop_back();
}
This should be absolutely portable, as it relies only on standard C++ functions.

Diacritic characters in C char arrays or strings

Background
Im working on some embedded project and Im trying to handle non-standard characters and font.
I have raw bitmap font in 600+ element array. Every 5 elements of this array contain one character. I have character 32 (space) in first 5 elements, 33 character (!) in 6-10 elements etc.
I have to handle national diacritic characters ("ę" for example). I located them after 122 character. Now im trying to remap characters, to get proper character printed when I type print("Test ę"); in C source.
Problem
So I want to type like this in source:
print("Test diactric ę");
// warning: (228) illegal character (0xC4)
When I try this (I tried to see what code C will put for "ę"):
int a = 'ę';
// error: (226) char const too long
How to workaround this?
Im using XC8 compiler (gcc based?).
I found in compiler manual, that it uses 7-bit character encoding, but maybe there is some way? My source file is encoded in UTF-8.
EDIT
Looks like wchar.h suggested by Emilien could work for me, but unfortunately there is no wchar.h for my compiler.
Maybe some preprocessor trick? I really want to avoid hardcore text preparation like this:
print("abcde");
print_diactric(123); // 123 code used for ę
print("fgh");
// to get "abcdeęf" "word"
You need to think about the difference between the source encoding (what it sounds like, the character encoding used by your C source files on the system where the compiler runs) and the target encoding, which is the encoding that the compiler assumes for the system where the code will be running.
If your compiler's target encoding is "7-bit", then there's no standard way to express a character like ę, it's simply not part of the target charset. You're going to have to work around that, perhaps by implementing the encoding by yourself from some other format.
As unwind explained, you'll need for than 7 bits in order to encode these characters, maybe you can use the wide character type?
#include <wchar.h>
#include <stdio.h>
int main(){
printf("%s\n", "漢語");
printf("%s\n", "ę");
}
output:
~$ gcc wcharexample.c -o wcharexample && ./wcharexample
漢語
ę

unicode string manipulation in c

I am using gcc in linux mint 15 and my terminal understands unicode. I will be dealing with UTF-8. I am trying to obtain the base word of a more complex unicode string. Sort of like trimming down the word 'alternative' to 'alternat' but in a different language. Hence I will be required to test the ending of each word.
In c and ASCII, I can do something like this
if(string[last_char]=='e')
last_char-=1; //Throws away the last character
Can I do something similar with unicode? That is, something like this :
if(string[last_char]=='ഒ')
last_char=-1
EDIT:
Sorry as #chux said I just notified you are asking in C. Anyway the same principle holds.
In C you can use wscanf and wprintf to do I/O with wide char strings. If your characters are inside BMP you'll be fine. Just replace char * with wchar_t * and do all kinds of things as usual.
For serious development I'd recommend convert all strings to char32_t for processing. Or use a library like ICU.
If what you need is just remove some given characters in the string, then maybe you don't need the complex unicode character handling. Treat your unicode characters as a raw char * string and do whatever string operations over it.
The old C++ oriented answer is reproduced below, for reference.
The easy way
Use std::wstring
It's basically an std::string but individual characters are typed wchar_t.
And for IO you should use std::wcin and std::wcout. For example:
std::wstring str;
std::wcin >> str;
std::wcout << str << std::endl;
However, in some platforms wchar_t is 2-byte wide, which means characters outside BMP will not work. This should be okay for you I think, but should not be used in serious development. For more text on this topic, read this.
The hard way
Use a better unicode-aware string processing library like ICU.
The C++11 way
Use some mechanisms to convert your input string to std::u32string and you're done. The conversion routines can be hand-crafted or using an existing library like ICU.
As std::u32string is formed using char32_t, you can safely assume you're dealing with Unicode correctly.

Style to write code in c (UTF-8)

In my code I use names of people. For example one of them is:
const char *translators[] = {"Jörgen Adam <adam#***.de>", NULL};
and contain ö 'LATIN SMALL LETTER O WITH DIAERESIS'
When I write code what format is right to use
UTF-8:
Jörgen Adam
or
UTF-8(hex):
J\xc3\xb6rgen Adam
UPDATE:
Text with name will be print in GTK About Dialog (name of translators)
The answer depends a lot on whether this is in a comment or a string.
If it's in a comment, there's no question: you should use raw UTF-8, so it should appear as:
/* Jörgen Adam */
If the user reading the file has a misconfigured/legacy system that treats text as something other than UTF-8, it will appear in some other way, but this is just a comment so it won't affect code generation, and the ugliness is their problem.
If on the other hand the UTF-8 is in a string, you probably want the code to be interpreted correctly even if the compile-time character set is not UTF-8. In that case, your safest bet is probably to use:
"J\xc3\xb6rgen Adam"
It might actually be safe to use the UTF-8 literal there too; I'm not 100% clear on C's specification of the handling of non-wide string literals and compile-time character set. Unless you can convince yourself that it's formally safe and not broken on a compiler you care to support, though, I would just stick with the hex.

How to print "box drawers" Unicode characters in C (Linux utf8 terminal)?

I'm trying to display Unicode characters from (Box Drawing Range: 2500–257F). It's supposed to be standard utf8 (The Unicode Standard, Version 6.2). I'm simply unable to do it.
I first tried to use the good old ASCII characters but the Linux terminal displays in utf8 and there is no conversion (symbol ?) displayed in place.
Could anyone answer these questions:
How to encode a unicode character in a C variable (style wchar_t)?
How to use the escape sequence such as 0x or 0o (hex, oct) for Unicode?
I know U+ but it seems it didn't work.
setlocale(LC_ALL,"");
short a = 0x2500, b = 0x2501;
wchar_t ac = a;
wchar_t bc = b;
wprintf(L"%c%c\n", ac, bc);
exit(0);
I know that the results are related to the font used, but I use a utf8 font (http://www.unicode.org/charts/fonts.html) and codes from 2500 to 257F must be displayed... Actually they aren't.
Thanks for your help in advance...
[EDIT LATELY]
The issue is solved since... and I found how to use wprintf() with %lc instead of %c... and deeper.
Now those bow drawers are part of my student "tools" library to make the console programming learning a little more coloured.
Use a Cstring containing the bytes for the utf-8 versions of those characters. If You print that Cstring, it will print that character.
example for Your two characters:
#include <stdio.h>
int main (int argc, char *argv[])
{
char block1[] = { 0xe2, 0x94, 0x80, '\0' };
char block2[] = { 0xe2, 0x94, 0x81, '\0' };
printf("%s%s\n", block1, block2);
return 0;
}
prints ─━ for me.
Also, if You'd print a Cstring containing uft-8 character bytes somewhere in it, it would print those characters without problems.
/* assuming You use gcc */
And IIRC gcc uses utf-8 internally anyway.
EDIT: Your question changed a bit while I was writing this. And my answer is less relevant now.
But from Your symptoms - if You see one ? for each character You expect, I'd say Your terminal font might be missing the glyphs required for those characters.
That depends on what you call "terminal".
The linux console uses various hacks to display unicode but in reality its font is limited to 512 symbols IIRC so it can't really display the whole unicode range and what it can display depends on the font loaded (this may change in the future).
Windows terminals used to access Linux are usually braindamaged one way or another unicode-wise.
Physical terminals are usually worse and only operate in ascii-land
Linux GUI terminals (such as gnome-terminal) can pretty much display everything as long as you have the corresponding fonts.
Are you sure you don't want to use ncurses instead of writing your own terminal widgets?

Resources