What must I know to handle UTF-8 in my C program? - c

I have a C program that now I need to do support to UTF-8 characters. What must I know in order to perform that? I've always hear how problematic is handle it in a C/C++ environment. Why exactly is it problematic? How does it differ from an usual C character, also its size? Can I do it without any operating system help, in pure C and still make it portable? what else I should have asked but I didn't? what I'm looking for implement is it: The characters are a name with accents(like french word: résumé) that I need to read it and put into a symbol table and then search and print them from a file. It's part of my configuration file parsing(very much .ini-like)

There's an awesome article written by Joel Spolsky, one of the Stack Overflow creators.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Apart from that, you might want to query some other Q&A's regarding this subject, like Handling special characters in C (UTF-8 encoding).
As cited in the aforementioned Q&A, Tips on Using Unicode with C/C++ might give you the basics.

Two good links that i have used in the past:
The-Basics-of-UTF8
reading-unicode-utf-8-by-hand-in-c
valter

Related

Display-width of multibyte character in C standard library – how accurate is the database?

The wcwidth call of Standard C Library returns 2 for Asian characters. Then there are Unicode symbols, like arrows. For those it returns 1. It is often the case that character is wider than single column, yet the library isn't wrong, because terminals print them at single column and allow visual overlapping, sometimes giving not bad results, like for ndash "–".
Are there characters that plainly suffer? I wonder how Asian people and people from other regions use terminals, what solutions have they developed. For example displaying a shell prompt that spans whole line and contains current directory name can be a serious problem. Can be wcwidth patched to obtain better results? Using github/wcwidth.c as a starting point, for example.
There are differences with the ambiguous-width characters. xterm has both Markus Kuhn's original (the link you show appears to be his, with the comment-header removed), as well as an alternate version with adjustments to accommodate CJK (East Asian). Besides that, it checks at startup for usable system locale tables. Some are good enough; others are not. No one's done a systematic (unbiased) survey of what's actually good (you may see some opinions on that aspect, offered as answers).

Standard (or convenient) method to read and write tabular data to a text file in c

This might sound rather awkward, but I want to ask if there is a commonly practiced way of storing tabular data in a text file to be read and written in C.
Like in python you can load a full text file nto an array by f.readlines then go through all the lines and split each line by a specific character or sequence of characters (delimiter).
How do you approach this problem in C?
Pretty much the same way you would in any other language. Pick a field separator (I.E., tab character), open the text file for reading and parse each line.
Of course, in C it will never be as easy as it is in Python, but approaches are similar.
Whoa. I am a bit baffled by the other answers which make me feel like I'm on Mainframes.stackexchange.com instead of stackoverflow.com
Why don't you pick a modern data format like JSON or XML and follow best practices for the data format of your choice?
If you want a good JSON reader/writer for C, I've used Jansson, and it's very easy and fast.
If you want a good XML reader/writer for C, I've used miniXML and it's also easy and fast. Also has SAX *and * DOM support depending on how you want to read in the XML.
Obviously there are a wealth of other libraries available as well.
Please don't give the next guy to come along and support your program some wacky custom file format to deal with.
I find getline() and strtok() to be quite convenient (getline was a gnu extension, standardized in POSIX.1-2008).
There's a handful of mechanisms, but there's a reason why scripting languages have become so popular over the least twenty years -- some of the tasks that seem simple in scripting languages are ponderous in C.
You could use flex and bison to write a parser for your tables. This really only works if the format is very well defined and "static". They're amazing tools that can do more than you might suspect, but it is very heavy machinery for what could be done simply with a split() in a scripting language.
You could read individual fields using getdelim(3). However, this was only standardized with POSIX.1-2008, so this is far from ubiquitous. (Every Linux machine with glibc should have them.)
You could read lines with fgets(3) and discover the split locations using strchr(3).
You could read lines with fgets(3) and use strtok(3) to tokenize strings.
You can use scanf(3) to perform input and scanning in one go; it seems from the questions here that scanf(3) is difficult to use correctly.
You could use character-at-a-time parsing approaches: read characters using getc(3), inspect it, do something with it, iterate until no more characters.

Parsing a string including alphabetic characters and the regional characters(French, russian, chinese) in C/C++

As the title, I don't know how to parse a string including the alphabetic characters and the special characters from other languages in C. Anyone please help me how to distinguish them in C?. Do I need to install some optional components to help C accept the characters?(I'm in linux environment). Thanks very much for your reply.
At a minimum you need to decide what character encoding(s) you are going to use or support. After that you will need to decide if you will keep the international strings in their native forms, or convert them using something like libiconv into a single encoding in your application.
So first, as Laurent pointed out in a comment, you need to understand what you are trying to do (which is not going to be very easy--fair warning). And take a look at what Joel Spolsky (co-founder of Stack Overflow) wrote many years ago: http://www.joelonsoftware.com/articles/Unicode.html

Is there a simple example on iconv transliteration of from-language-to-language for c?

Say we have the simple scenario, a string of a language, say French.
And we want that French to be converted to ASCII in a transliterated form.
How can it be done in C in the simplest way?
Also is there's a completely different way, irrelevant to iconv, ideally multiplatform?
If you want multiplatform, iconv is not the right tool. Transliteration is a GNU-specific extension. In general, transliteration is a hard problem, and the GNU iconv implementation is only sufficient for trivial cases. How a non-ASCII character gets transliterated is not a property of the character but of the language of the text and how it's being used. For instance, should "日" become "ri" or "ni" or something else entirely? Or if you want to stick with Latin-based languages, should "ö" become "o" or "oe"? Expanding to other non-Latin scripts, transliterating most Indic languages is fairly straightforward, but transliterating Thai requires some reordering of characters and transliterating Tibetan requires parsing whole syllables and identifying which letters are in root/prefix/suffix/etc. roles.
In my opinion, the best answer to "How do I transliterate to ASCII?" for most software programs is: don't. Instead fix whatever bugs or intentionally-English-centric policies made you want ASCII in the first place. The only software that should really be doing transliteration is highly-linguistically-aware software facilitating search or interpretation of texts not in the user's own native language.

Is there even fast implementaion about multibyte character string convert to unicode wstring?

In my project, where I adopted Aho-Corasick algorithm to do some message filter mode in the server side, message the server got is string of multibyte character. But after several tests I found the bottleneck is the conversion between mulitbyte string and unicode wstring. What I use now is the pair of mbstowcs_s and wcstombs_s, which takes nearly 95% time cost of the whole mode. Also, I have tried MultiByteToWideChar/WideCharToMultiByte, it got just the same result.
So I wonder if there is some other more efficient way to do the job? My project is built in VS2005, and the string converted will contain Chinese characters.
Many thanks.
There are a number of possibilities.
Firstly, what do you mean by "multi-byte character"? Do you mean UTF8 or an ISO DBCS system?
If you look at the definition of UTF8 and UTF16 there scope to do a highly optimised conversion, ripping out the "x" bits and reformatting them. See for example http://www.faqs.org/rfcs/rfc2044.html talks about UTF8<==>UTF32. Adjusting for UTF16 would be simple.
The second option might be to work entirely in UTF16. Render your Web page (or UI Dialog or whatever) in UTF16 and get the user input that way.
If all else fails, there aare other string algorithms than Aho-Corasick. Possibly look for an algorithm that works with your original encoding.
[Added 29-Jan-2010]
See http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt for more on conversions, including two C implementations of mbtowc() and wctomb(). These are designed to work with arbitrarily large wchar_ts. If you just have 16-bit wchar_ts then you can simplify it a lot.
These would be much faster than the generic (code-page-sensitive) versions in the standard library.
Deprecated (I believe) but you could always use the non-safe versions (mbstowcs and wcstombs). Not sure if this will have a marked improvement though. Alternatively, if your character set is limited (a - z, 0 - 9, for instance), you could always do it manually with a lookup table..?
Perhaps you can reduce the amount of calls to MultiByteToWideChar?
You could also probably adopt Aho-Corasick to work directly on multibyte strings.

Resources