Style to write code in c (UTF-8) - c

In my code I use names of people. For example one of them is:
const char *translators[] = {"Jörgen Adam <adam#***.de>", NULL};
and contain ö 'LATIN SMALL LETTER O WITH DIAERESIS'
When I write code what format is right to use
UTF-8:
Jörgen Adam
or
UTF-8(hex):
J\xc3\xb6rgen Adam
UPDATE:
Text with name will be print in GTK About Dialog (name of translators)

The answer depends a lot on whether this is in a comment or a string.
If it's in a comment, there's no question: you should use raw UTF-8, so it should appear as:
/* Jörgen Adam */
If the user reading the file has a misconfigured/legacy system that treats text as something other than UTF-8, it will appear in some other way, but this is just a comment so it won't affect code generation, and the ugliness is their problem.
If on the other hand the UTF-8 is in a string, you probably want the code to be interpreted correctly even if the compile-time character set is not UTF-8. In that case, your safest bet is probably to use:
"J\xc3\xb6rgen Adam"
It might actually be safe to use the UTF-8 literal there too; I'm not 100% clear on C's specification of the handling of non-wide string literals and compile-time character set. Unless you can convince yourself that it's formally safe and not broken on a compiler you care to support, though, I would just stick with the hex.

Related

Reading non-ascii characters from a file in C

I have a file, foo.txt, which is just:
” ’
char x = fgetc(myfile);
When I use fgetc on the file, I get a constant value of 226 on both characters. Why is this? How can I fix this?
Here is my code:
FILE* f = fopen("./debate.txt", "rb");
int x = fgetc(f);
char y = (char)x;
For normal (portable) software, character encodings are a whole world of pain. The problems (and potential solutions) are:
A) The text file may be in any random/"text editor defined" encoding.
To deal with this there's 4 options:
expect input in a specific encoding (e.g. UTF-8) and refuse to support anything else (and generate an error message if the data in the file isn't valid for the encoding you chose). This will annoy some users (e.g. where the national standard is something incompatible like CNS 11643 ).
support many encodings, and let the user choose which encoding to expect (e.g. based on a command line argument). This is a little inconvenient for users and very painful for you.
support many encodings, and try to auto-detect which encoding the file used. This is a little more convenient for users until it guesses wrong and becomes a major annoyance (and you can't reduce the chance of guessing the wrong encoding to zero).
support many encodings, and let the user choose the encoding if they want, and auto-detect if the user didn't specify. This is the best possible option for users (and the worst possible option for software developers).
For these options I'd use the first (I would say "input file must be UTF-8", partly because UTF-8 has become very common and well supported, and partly because every other encoding is provably worse for technical reasons). Note that (based on your results) it's extremely likely that your input file is in UTF-8.
B) Whatever the compiler uses for char is implementation defined (could be ASCII, could be EBDIC, could be anything else), and may be either signed or unsigned.
In this case it's "very safe" (for portability) to assume ASCII. Assuming UTF-8 is the 2nd best choice but it creates problems with any code that does any maths (e.g. right shift, etc) on "possibly signed" char values.
C) The stdin, stdout, stderr pipes are random/implementation defined too.
This is similar to the previous problem, except that the best solution ("assume ASCII") is significantly harder (especially when you want to output error messages, etc that contain pieces of text from the input file). For this I'd be tempted to use ASCII as much as possible, but to cheat and output UTF-8 if I have to. If the OS (or shell) can't handle UTF-8 it'll create a mess, but most users would understand (and can work around it by piping your output to a file). The best alternative (for user output) is using a GUI and not using stdout, but that creates a large set of extra problems (and leads to a second large set of extra problems - internationalization for things like error messages, etc).
D) Whatever the compiler assumes for wchar is random/implementation defined (maybe UTF-16, maybe UTF-32, maybe anything else; and it may even be an 8-bit encoding that isn't "wide" at all).
The only sane choice here is to recognize that wchar is an unusable failure that should never (under any circumstances) be used for anything.
To be more specific, wchar is a historical mistake based on previous historical mistakes. Essentially, in the early days, Microsoft and Sun decided to adopt UCS-2 (an "all Unicode codepoints fit in 16 bits" assumption) which quickly became broken. To work around that problem Microsoft and Sun switched to UTF-16, but Microsoft was primarily running on little-endian machines and chose UTF-16LE and Sun (Java) was aiming for big-endian machines and chose UTF-16BE. The wchar extension was added to C in 1995 at the same time that companies (Microsoft, Sun) where doing everything wrong and weren't doing anything that is compatibility with each other; so wchar ended up being a "we don't know what the standard is so our standard is no standard at all" joke. For C (and C++) this was fixed in 2011 with the introduction of char16_t (UTF-16) and char32_t (UTF-32) in <uchar.h>, but adoption is slow (e.g. Microsoft is still too lazy to bother with C99).
Note that an additional part of the problem is that people want to assume that one wchar is one whole printable character, and that is almost never the case (e.g. even for UTF-32 where one wchar is one whole Unicode codepoint there are combining codepoints); and this ruins any benefit of any "wide char" implementation (even if your code is not portable at all and you know what wchar actually is).
The best solution (especially if you chose "expect that the input file is using UTF-8" to solve the first problem) is to use UTF-8 stored in uint8_t (so that nobody confuses it for whatever char is).
In that case; "converting the input from the file into your internal character encoding" can become "converting UTF-8 to UTF-8 by doing nothing"; and "converting your internal character encoding into whatever stdout wants" becomes "converting UTF-8 to ASCII (or UTF-8) by doing almost nothing (casting from uint8_t to char)". In other words, it can be extremely close to "use the same encoding for everything".

C Removing Newlines in a Portable and International Friendly Way

Simple question here with a potentially tricky answer: I am looking for a portable and localization friendly way to remove trailing newlines in C, preferably something standards-based.
I am already aware of the following solutions:
Parsing for some combination of \r and \n. Really not pretty when dealing with Windows, *nix and Mac, all which use different sequences to represent a new line. Also, do other languages even use the same escape sequence for a new line? I expect this will blow up in languages that use different glyphs from English (say, Japanese or the like).
Removing trailing n bytes and replacing final \0. Seems like a more brittle way of doing the above.
isspace looks tempting but I need to only match newlines. Other whitespace is considered valid token text.
C++ has a class to do this but it is of little help to me in a pure-C world.
locale.h seems like what I am after but I cannot see anything pertinent to extracting newline tokens.
So, with that, is this an instance that I will have to "roll my own" functionality or is there something that I have missed? Thanks!
Solution
I ended up combining both answers from Weather Vane and Loic, respectively, for my final solution. What worked was to use the handy strcspn function to break on the first newline character as selected from Loic's provided links. Thus, I can select delimiters based on a number of supported locales. Is a good point that there are too many to support generically at this level; I didn't even know that there were several competing encodings for the Cyrillic.
In this way, I can achieve "good enough" multinational support while still using standard library functions.
Since I can only accept one answer, I am selecting Weather Vane's as his was the final invocation I used. That being said, it was really the two answers together that worked for me.
The best one I know is
buffer [ strcspn(buffer, "\r\n") ] = 0;
which is a safe way of dealing with all the combinations of \r and \n - both, one or none.
I suggest to replace one or more whitespace characters with one standard space (US-ASCII 0x20). Considering only ISO-8859-1 characters (https://en.wikipedia.org/wiki/ISO/IEC_8859-1), whitespace consists of any byte in 0x00..0x20 (C0 control characters and space) and 0x7F..0xA0 (delete, C1 control characters and no-break space). Notice that US-ASCII is subset of ISO-8859-1.
But take into account that Windows 1251 (https://en.wikipedia.org/wiki/Windows-1251) assign different, visible (non-control) characters to the range 0x80..0x9F. In this case, those bytes cannot be replaced by spaces without lost of textual information.
Resources for an extensive definition of whitespace characters:
https://en.wikipedia.org/wiki/Unicode_character_property#Whitespace
http://unicode.org/reports/tr23/
http://www.unicode.org/Public/8.0.0/charts/CodeCharts.pdf
Take also onto account that different encodings may be used, most commonly:
ISO-8859-1 (https://en.wikipedia.org/wiki/ISO/IEC_8859-1)
UTF-8 (https://en.wikipedia.org/wiki/UTF-8)
Windows 1251 (https://en.wikipedia.org/wiki/Windows-1251)
But in non-western countries (for instance Russia, Japan), further character encodings are also usual. Numerous encodings exist, but it probably does not make sense to try to support each and every known encoding.
Thus try to define and restrict your use-cases, because implementing it in full generality means a lot of work.
This answer is for C++ users with the same problem.
Matching a newline character for any locale and character type can be done like this:
#include <locale>
template<class Char>
bool is_newline(Char c, std::locale const & loc = std::locale())
{
// Translate character into default locale and character type.
// Then, test against '\n', which is the only newline character there.
return std::use_facet< std::ctype<Char>>(loc).narrow(c, ' ') == '\n';
}
Now, removing all trailing newlines can be done like this:
void remove_trailing_newlines(std::string & str) {
while (!str.empty() && is_newline(*str.rbegin())
str.pop_back();
}
This should be absolutely portable, as it relies only on standard C++ functions.

unicode string manipulation in c

I am using gcc in linux mint 15 and my terminal understands unicode. I will be dealing with UTF-8. I am trying to obtain the base word of a more complex unicode string. Sort of like trimming down the word 'alternative' to 'alternat' but in a different language. Hence I will be required to test the ending of each word.
In c and ASCII, I can do something like this
if(string[last_char]=='e')
last_char-=1; //Throws away the last character
Can I do something similar with unicode? That is, something like this :
if(string[last_char]=='ഒ')
last_char=-1
EDIT:
Sorry as #chux said I just notified you are asking in C. Anyway the same principle holds.
In C you can use wscanf and wprintf to do I/O with wide char strings. If your characters are inside BMP you'll be fine. Just replace char * with wchar_t * and do all kinds of things as usual.
For serious development I'd recommend convert all strings to char32_t for processing. Or use a library like ICU.
If what you need is just remove some given characters in the string, then maybe you don't need the complex unicode character handling. Treat your unicode characters as a raw char * string and do whatever string operations over it.
The old C++ oriented answer is reproduced below, for reference.
The easy way
Use std::wstring
It's basically an std::string but individual characters are typed wchar_t.
And for IO you should use std::wcin and std::wcout. For example:
std::wstring str;
std::wcin >> str;
std::wcout << str << std::endl;
However, in some platforms wchar_t is 2-byte wide, which means characters outside BMP will not work. This should be okay for you I think, but should not be used in serious development. For more text on this topic, read this.
The hard way
Use a better unicode-aware string processing library like ICU.
The C++11 way
Use some mechanisms to convert your input string to std::u32string and you're done. The conversion routines can be hand-crafted or using an existing library like ICU.
As std::u32string is formed using char32_t, you can safely assume you're dealing with Unicode correctly.

Portable literal strings in C source files

Ok, I have this:
AllocConsole();
SetConsoleOutputCP(CP_UTF8);
HANDLE consoleHandle = GetStdHandle(STD_OUTPUT_HANDLE);
WriteConsoleA(consoleHandle, "aΕλληνικά\n", 10, NULL, NULL);
WriteConsoleW(consoleHandle, L"wΕλληνικά\n", 10, NULL, NULL);
printf("aΕλληνικά\n");
wprintf(L"wΕλληνικά\n");
Now, the issue is that depending on the encoding file was saved as only some these works. wprintf never works, but I already know why (broken Microsoft stdout implementation, which only accepts narrow characters). Yet, I have issue with three others. If I save file as UTF-8 without signature (BOM) and use MS Visual C++ compiler, only last printf works. If I want ANSI version working I need to increase character(?) count to 18:
WriteConsoleA(consoleHandle, "aΕλληνικά\n", 18, NULL, NULL);
WriteConsoleW does not work, I assume, because the string is saved as UTF-8 byte sequence even I explicitly request it to be stored as wide-char (UTF-16) with L prefix and implementation most probably expects UTF-16 encoded string not UTF-8.
If I save it in UTF-8 with BOM (as it should be), then WriteConsoleW starts to work somehow (???) and everything else stops (I get ? instead of a character). I need to decrease character count in WriteConsoleA back to 10 to keep formatting the same (otherwise i get 8 additional rectangles). Basically, WTF?
Now, let's go to UTF-16 (Unicode - Codepage 1200). Works only WriteConsoleW. Character count in WriteConsoleA should be 10 to keep formatting precise.
Saving in UTF-16 Big Endian mode (Unicode - Codepage 1201) does not change anything. Again, WTF? Shouldn't byte order inside the strings be inverted when stored to file?
Conclusion is that the way strings are compiled into binary form depends on the encoding used. Therefore, what is the portable and compiler independent way to store strings? Is there a preprocessor which would convert one string representation into another before compilation, so I could store file in UTF-8 and only preprocess strings which I need to have in UTF-16 by wrapping them some macro.
I think you've got at least a few assumptions here which are either wrong or not 100% correct as far as I know:
Now, the issue is that depending on the encoding file was saved as only some these works.
Of course, because the encoding determines how to Interpret the string literals.
wprintf never works, but I already know why (broken Microsoft stdout implementation, which only accepts narrow characters).
I've never heard of that one, but I'm rather sure this depends on the locale set for your program. I've got a few work Projects where a locale is set and the output is just fine using German umlauts etc.
If I save file as UTF-8 without signature (BOM) and use MS Visual C++ compiler, only last printf works. If I want ANSI version working I need to increase character(?) count to 18:
That's because the ANSI version wants an ANSI string, while you're passing a UTF-8 encoded string (based on the file's encoding). The output still works, because the console handles the UTF-8 conversion for you - you're essentially printing raw UTF-8 here.
WriteConsoleW does not work, I assume, because the string is saved as UTF-8 byte sequence even I explicitly request it to be stored as wide-char (UTF-16) with L prefix and implementation most probably expects UTF-16 encoded string not UTF-8.
I don't think so (although I'm not sure why it isn't working either). Have you tried Setting some easy to find string and look for it in the resulting binary? I'm rather sure it's indeed encoded using UTF-16. I assume due to the missing BOM the compiler might interpret the whole thing as a narrow string and therefore converts the UTF-8 stuff wrong.
If I save it in UTF-8 with BOM (as it should be), then WriteConsoleW starts to work somehow (???) and everything else stops (I get ? instead of a character). I need to decrease character count in WriteConsoleA back to 10 to keep formatting the same (otherwise i get 8 additional rectangles). Basically, WTF?
This is exactly what I described above. Now the wide string is encoded properly, because the Compiler now knows the file is in UTF-8, not ANSI (or some codepage). The narrow string is properly converted to the locale being used as well.
Overall, there's no encoding independant way to do it, unless you escape everything using the proper codepage and/or UTF codes in advance. I'd just stick to UTF-8 with BOM, because I think all current compilers will be able to properly read and Interpret the file (besides Microsoft's Resource Compiler; although I haven't tried feeding the 2012 Version with UTF-8).
Edit:
To use an analogy:
You're essentially saving a raw image to a file and you expect it to work properly, no matter whether other programs try to read it as a grayscale, palettized, or full color image. This won't work (despite differences being smaller).
The answer is here.
Quoting:
It is impossible for the compiler to intermix UTF-8 and UTF-16
strings into the compiled output! So you have to decide for one source
code file:
either use UTF-8 with BOM and generate UTF-16 strings only (i.e.always use L prefix),
or UTF-8 without BOM and generate UTF-8 strings only (i.e. never use L prefix),
7-bit ASCII characters are not involved and can be used with or without L prefix
The only portable and compiler independent way is to use ASCII charset and escape sequences, because there are no guarantees that any compiler would accept UTF-8 encoded file and a compiler treatment of those multibyte sequences might vary.

C CSV API for unicode

I need a C API for manipulating CSV data that can work with unicode. I am aware of libcsv (sourceforge.net/projects/libcsv), but I don't think that will work for unicode (please correct me if I'm wrong) because don't see wchar_t being used.
Please advise.
It looks like libcsv does not use the C string functions to do its work, so it almost works out of the box, in spite of its mbcs/ws ignorance. It treats the string as an array of bytes with an explicit length. This might mostly work for certain wide character encodings that pad out ASCII bytes to fill the width (so newline might be encoded as "\0\n" and space as "\0 "). You could also encode your wide data as UTF-8, which should make things a bit easier. But both approaches might founder on the way libcsv identifies space and line terminator tokens: it expects you to tell it on a byte-to-byte basis whether it's looking at a space or terminator, which doesn't allow for multibyte space/term encodings. You could fix this by modifying the library to pass a pointer into the string and the length left in the string to its space/term test functions, which would be pretty straightforward.

Resources