I made some research on this subject (unicode inputs in c89) but i didn't find everythings i wanted to know.
Someone can explain me how he manage the whole keyboard (utf8) with some basic operation (only looking at the binary value), because i didn't find how make the difference between character keys and function keys.
Thanks a lot.
That's not unicode, but likely ANSI escape sequences. VT100/52 etc. for 80ies devices (Atari-ST used VT52 for instance).
Related
I have a C program that now I need to do support to UTF-8 characters. What must I know in order to perform that? I've always hear how problematic is handle it in a C/C++ environment. Why exactly is it problematic? How does it differ from an usual C character, also its size? Can I do it without any operating system help, in pure C and still make it portable? what else I should have asked but I didn't? what I'm looking for implement is it: The characters are a name with accents(like french word: résumé) that I need to read it and put into a symbol table and then search and print them from a file. It's part of my configuration file parsing(very much .ini-like)
There's an awesome article written by Joel Spolsky, one of the Stack Overflow creators.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Apart from that, you might want to query some other Q&A's regarding this subject, like Handling special characters in C (UTF-8 encoding).
As cited in the aforementioned Q&A, Tips on Using Unicode with C/C++ might give you the basics.
Two good links that i have used in the past:
The-Basics-of-UTF8
reading-unicode-utf-8-by-hand-in-c
valter
Apologies for the vagueness; I barely know how to pose this question.
Can anyone tell me the name of that family of 3 character constructs that represent another character or characters?
I think they were used in the old VT100 terminal days.
I know C supports them.
They are called trigraph. There are also two characters code called digraphs.
They are called trigraph sequences. E.g. ??/ maps to \. You have to take care to remember this when building regular expression-type parsers for C code.
As the title, I don't know how to parse a string including the alphabetic characters and the special characters from other languages in C. Anyone please help me how to distinguish them in C?. Do I need to install some optional components to help C accept the characters?(I'm in linux environment). Thanks very much for your reply.
At a minimum you need to decide what character encoding(s) you are going to use or support. After that you will need to decide if you will keep the international strings in their native forms, or convert them using something like libiconv into a single encoding in your application.
So first, as Laurent pointed out in a comment, you need to understand what you are trying to do (which is not going to be very easy--fair warning). And take a look at what Joel Spolsky (co-founder of Stack Overflow) wrote many years ago: http://www.joelonsoftware.com/articles/Unicode.html
We have a C nurses based application (runs on most flavours of Unix, but we favour RHEL). We've got Unicode support in there, but now we have to provide a Hebrew version of the application. Does anyone know a process we could go through to convert the program? It mainly gets and stores data from Oracle, which can support Hebrew, so there should not be a problem there.
It really is just the display of the text that is the issue.
It is important to know what terminal they are using because that defines how you should write the code. Some terminals support BiDi(ie bidirectional text). That means they automatically turn Hebrew/Arabian text backwards.
It has its own problems, you can check what your app would look like using mlterm.
Basically it reverses the lines that contain hebrew text while keeping what is interpreted as English characters LTR. A Hebrew character printed to 10,70 will appear in 10,10. You can use Unicode LTR RTL to try to force direction for things that break your formatting, but at least on mlterm while they work, they print garbage characters.
If they use regular terminals with unicode support, however, you should roll the characters yourself.
Then of course if it is run on bidirectional terminals the text would be backwards again and the format lost.
In my project, where I adopted Aho-Corasick algorithm to do some message filter mode in the server side, message the server got is string of multibyte character. But after several tests I found the bottleneck is the conversion between mulitbyte string and unicode wstring. What I use now is the pair of mbstowcs_s and wcstombs_s, which takes nearly 95% time cost of the whole mode. Also, I have tried MultiByteToWideChar/WideCharToMultiByte, it got just the same result.
So I wonder if there is some other more efficient way to do the job? My project is built in VS2005, and the string converted will contain Chinese characters.
Many thanks.
There are a number of possibilities.
Firstly, what do you mean by "multi-byte character"? Do you mean UTF8 or an ISO DBCS system?
If you look at the definition of UTF8 and UTF16 there scope to do a highly optimised conversion, ripping out the "x" bits and reformatting them. See for example http://www.faqs.org/rfcs/rfc2044.html talks about UTF8<==>UTF32. Adjusting for UTF16 would be simple.
The second option might be to work entirely in UTF16. Render your Web page (or UI Dialog or whatever) in UTF16 and get the user input that way.
If all else fails, there aare other string algorithms than Aho-Corasick. Possibly look for an algorithm that works with your original encoding.
[Added 29-Jan-2010]
See http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt for more on conversions, including two C implementations of mbtowc() and wctomb(). These are designed to work with arbitrarily large wchar_ts. If you just have 16-bit wchar_ts then you can simplify it a lot.
These would be much faster than the generic (code-page-sensitive) versions in the standard library.
Deprecated (I believe) but you could always use the non-safe versions (mbstowcs and wcstombs). Not sure if this will have a marked improvement though. Alternatively, if your character set is limited (a - z, 0 - 9, for instance), you could always do it manually with a lookup table..?
Perhaps you can reduce the amount of calls to MultiByteToWideChar?
You could also probably adopt Aho-Corasick to work directly on multibyte strings.