I'm writing a little wrapper for an application that uses files as arguments.
The wrapper needs to be in Unicode, so I'm using wchar_t for the characters and strings I have. Now I find myself in a problem, I need to have the arguments of the program in a array of wchar_t's and in a wchar_t string.
Is it possible? I'm defining the main function as
int main(int argc, char *argv[])
Should I use wchar_t's for argv?
Thank you very much, I seem not to find useful info on how to use Unicode properly in C.
In general, no. It will depend on the O/S, but the C standard says that the arguments to 'main()' must be 'main(int argc, char **argv)' or equivalent, so unless char and wchar_t are the same basic type, you can't do it.
Having said that, you could get UTF-8 argument strings into the program, convert them to UTF-16 or UTF-32, and then get on with life.
On a Mac (10.5.8, Leopard), I got:
Osiris JL: echo "ï€" | odx
0x0000: C3 AF E2 82 AC 0A ......
0x0006:
Osiris JL:
That's all UTF-8 encoded. (odx is a hex dump program).
See also: Why is it that UTF-8 encoding is used when interacting with a UNIX/Linux environment
Portable code doesn't support it. Windows (for example) supports using wmain instead of main, in which case argv is passed as wide characters.
On Windows, you can use GetCommandLineW() and CommandLineToArgvW() to produce an argv-style wchar_t[] array, even if the app is not compiled for Unicode.
On Windows anyway, you can have a wmain() for UNICODE builds. Not portable though. I dunno if GCC or Unix/Linux platforms provide anything similar.
Assuming that your Linux environment uses UTF-8 encoding then the following code will prepare your program for easy Unicode treatment in C++:
int main(int argc, char * argv[]) {
std::setlocale(LC_CTYPE, "");
// ...
}
Next, wchar_t type is 32-bit in Linux, which means it can hold individual Unicode code points and you can safely use wstring type for classical string processing in C++ (character by character). With setlocale call above, inserting into wcout will automatically translate your output into UTF-8 and extracting from wcin will automatically translate UTF-8 input into UTF-32 (1 character = 1 code point). The only problem that remains is that argv[i] strings are still UTF-8 encoded.
You can use the following function to decode UTF-8 into UTF-32. If the input string is corrupted it will return properly converted characters until the place where the UTF-8 rules were broken. You could improve it if you need more error reporting. But for argv data one can safely assume that it is correct UTF-8:
#define ARR_LEN(x) (sizeof(x)/sizeof(x[0]))
wstring Convert(const char * s) {
typedef unsigned char byte;
struct Level {
byte Head, Data, Null;
Level(byte h, byte d) {
Head = h; // the head shifted to the right
Data = d; // number of data bits
Null = h << d; // encoded byte with zero data bits
}
bool encoded(byte b) { return b>>Data == Head; }
}; // struct Level
Level lev[] = {
Level(2, 6),
Level(6, 5),
Level(14, 4),
Level(30, 3),
Level(62, 2),
Level(126, 1)
};
wchar_t wc = 0;
const char * p = s;
wstring result;
while (*p != 0) {
byte b = *p++;
if (b>>7 == 0) { // deal with ASCII
wc = b;
result.push_back(wc);
continue;
} // ASCII
bool found = false;
for (int i = 1; i < ARR_LEN(lev); ++i) {
if (lev[i].encoded(b)) {
wc = b ^ lev[i].Null; // remove the head
wc <<= lev[0].Data * i;
for (int j = i; j > 0; --j) { // trailing bytes
if (*p == 0) return result; // unexpected
b = *p++;
if (!lev[0].encoded(b)) // encoding corrupted
return result;
wchar_t tmp = b ^ lev[0].Null;
wc |= tmp << lev[0].Data*(j-1);
} // trailing bytes
result.push_back(wc);
found = true;
break;
} // lev[i]
} // for lev
if (!found) return result; // encoding incorrect
} // while
return result;
} // wstring Convert
On Windows, you can use tchar.h and _tmain, which will be turned into wmain if the _UNICODE symbol is defined at compile time, or main otherwise. TCHAR *argv[] will similarly be expanded to WCHAR * argv[] if unicode is defined, and char * argv[] if not.
If you want to have your main method work cross platform, you can define your own macros to the same effect.
TCHAR.h contains a number of convenience macros for conversion between wchar and char.
Related
I have a function that expects a wchar_t array as a parameter.I don't know of a standard library function to make a conversion from char to wchar_t so I wrote a quick dirty function, but I want a reliable solution free from bugs and undefined behaviors. Does the standard library have a function that makes this conversion ?
My code:
wchar_t *ctow(const char *buf, wchar_t *output)
{
const char ANSI_arr[] = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789`~!##$%^&*()-_=+[]{}\\|;:'\",<.>/? \t\n\r\f";
const wchar_t WIDE_arr[] = L"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789`~!##$%^&*()-_=+[]{}\\|;:'\",<.>/? \t\n\r\f";
size_t n = 0, len = strlen(ANSI_arr);
while (*buf) {
for (size_t x = 0; x < len; x++) {
if (*buf == ANSI_arr[x]) {
output[n++] = WIDE_arr[x];
break;
}
}
buf++;
}
output[n] = L'\0';
return output;
}
Well, conversion functions are declared in stdlib.h (*). But you must know that for any character in latin1 aka ISO-8859-1 charset the conversion to a wide character is a mere assignation, because character of unicode code below 256 are the latin1 characters.
So if your initial charset is ISO-8859-1, the convertion is simply:
wchar_t *ctow(const char *buf, wchar_t *output) {
wchar_t cr = output;
while (*buf) {
*output++ = *buf++;
}
*output = 0;
return cr;
}
provided caller passed a pointer to an array of size big enough to store all the converted characters.
If you are using any other charset, you will have to use a well known library like icu, or build one by hand, which is simple for single byte charsets (ISO-8859-x serie), more trikier for multibyte ones like UTF8.
But without knowing the charsets you want to be able to process, I cannot say more...
BTW, plain ascii is a subset of ISO-8859-1 charset.
(*) From cplusplus.com
int mbtowc (wchar_t* pwc, const char* pmb, size_t max);
Convert multibyte sequence to wide character
The multibyte character pointed by pmb is converted to a value of type wchar_t and stored at the location pointed by pwc. The function returns the length in bytes of the multibyte character.
mbtowc has its own internal shift state, which is altered as necessary only by calls to this function. A call to the function with a null pointer as pmb resets the state (and returns whether multibyte characters are state-dependent).
The behavior of this function depends on the LC_CTYPE category of the selected C locale.
It does in the header wchar.h. It is called btowc:
The btowc function returns WEOF if c has the value EOF or if (unsigned char)c
does not constitute a valid single-byte character in the initial shift state. Otherwise, it
returns the wide character representation of that character.
That isn't a conversion from wchar_t to char. It's a function for destroying data outside of ISO-646. No method in the C library will make that conversion for you. You can look at the ICU4C library. If you are only on Windows, you can look at the relevant functions in the Win32 API (WideCharToMultiByte, etc).
I'm trying to use iconv(3) to convert a wide-character string to UTF-8 using the code below. When I run the below, the iconv call returns E2BIG, as if there were not enough bytes of space available in the output buffer. This occurs despite the fact that (I think) I have sized the output buffer to admit the worst-case expansion for UTF-8. In fact, given that the input is a simple ASCII 'A' encoded as a wchar_t followed by a zero wchar_t terminator, the output should be exactly two bytes/chars: an 'A' followed by a '\0'.
'man utf-8' on my Linux system says that the maximum length of a UTF-8 byte sequence is 6 bytes, so I believe that for an input buffer of 2 wchar_ts (a character followed by the null terminator), making (on my system) 8 bytes total (since sizeof(wchar_t) == 4), a buffer of 12 bytes (2 * UTF8_SEQUENCE_MAXLEN) should be sufficient.
By experiment, if I increase UTF8_SEQUENCE_MAXLEN to 16, iconv's return value indicates success (15 still fails). But I cannot see any way that any wchar_t value would occupy so many bytes when encoded in UTF-8.
Have I gone wrong in my calculations? Are 16-byte UTF-8 sequences possible? What have I done wrong?
#include <stdio.h>
#include <stdlib.h>
#include <iconv.h>
#include <wchar.h>
#define UTF8_SEQUENCE_MAXLEN 6
/* #define UTF8_SEQUENCE_MAXLEN 16 */
int
main(int argc, char **argv)
{
wchar_t *wcs = L"A";
signed char utf8[(1 /* wcslen(wcs) */ + 1 /* L'\0' */) * UTF8_SEQUENCE_MAXLEN];
char *iconv_in = (char *) wcs;
char *iconv_out = (char *) &utf8[0];
size_t iconv_in_bytes = (wcslen(wcs) + 1 /* L'\0' */) * sizeof(wchar_t);
size_t iconv_out_bytes = sizeof(utf8);
size_t ret;
iconv_t cd;
cd = iconv_open("WCHAR_T", "UTF-8");
if ((iconv_t) -1 == cd) {
perror("iconv_open");
return EXIT_FAILURE;
}
ret = iconv(cd, &iconv_in, &iconv_in_bytes, &iconv_out, &iconv_out_bytes);
if ((size_t) -1 == ret) {
perror("iconv");
return EXIT_FAILURE;
}
return EXIT_SUCCESS;
}
The arguments to iconv_open are the wrong way around.
The order of arguments is (to, from), not (from, to), as is clearly stated in the manpage.
Consequently, changing
iconv_open("WCHAR_T", "UTF-8");
to
iconv_open("UTF-8", "WCHAR_T");
causes the (otherwise unchanged) code above to work as expected.
D'oh. Need to read manpages more closely.
With MSVC 2010 i try to compile this in C or C++ mode (needs to be compilable in both) and
it does not work. Why? I thought and found in the documentation that '\x' takes the next two characters as hex characters and not more (4 characters when using \X").
I also learned that there is no portable way to use character codes outside ASCII in C source code anyway, so how can i specify some german ISO-8859-1 characters?
int main() {
char* x = "\xBCd"; // Why is this not char(188) + 'd'
}
// returns test.c(2) : error C2022: '3021' : too big for character
// and a warning with GCC
Unfortunately you've stumbled upon the fact that \x will read every last character that appears to be hex1,2, instead you'll need to break this up:
const char *x = "\xBC" "d"; /* const added to satisfy literal assignment probs */
Consider the output from this program:
/* wide.c */
#include <stdio.h>
int main(int argc, char **argv)
{
const char *x = "\x000000000000021";
return printf("%s\n", x);
}
Compiled and executed:
C:\temp>cl /nologo wide.c
wide.c
C:\temp>wide
!
Tested on Microsoft's C++ compiler shipped with VS 2k12, 2k10, 2k8, and 2k5
Tested on gcc 4.3.4.
I have a version number returned as a string which looks something like "6.4.12.9", four numbers, each separated by a "."
What I would like to do is to parse the string into 4 distinct integers. Giving me
int1 = 6
int2 = 4
int3 = 12
int4 = 9
I'd normally use a regex for this but that option isn't available to me using C.
You can use sscanf
int a,b,c,d;
const char *version = "1.6.3.1";
if(sscanf(version,"%d.%d.%d.%d",&a,&b,&c,&d) != 4) {
//error parsing
} else {
//ok, use the integers a,b,c,d
}
If you're on a POSIX system, and limiting yourself to POSIX is okay, you can use the POSIX standard regular expression library by doing:
#include <regex.h>
then read the relevant manual page for the API. I would not recommend a regexp-solution for this problem to begin with, but I wanted to point out for clarity that regular expressions are often available in C. Do note that this is not "standard C", so you can't use it everywhere, only on POSIX (i.e. "Unix-like") systems.
You could used strtok() for this (followed by strtol()), just make sure you're aware of the semantics of strtok(), they're slightly unusual.
You could also use sscanf().
One solution using strtoul.
int main (int argc, char const* argv[])
{
char ver[] = "6.4.12.9";
char *next = ver;
int v[4], i;
for(i = 0; i < 4; i++, next++)
v[i] = strtoul(next, &next, 10);
return 0;
}
You can use strtoul() to parse the string and get a pointer to the first non-numeric character. Another solution would be tokenizing the string using strtok() and then using strtoul() or atoi() to get an integer.
If none of them will exceed 255, inet_pton will parse it nicely for you. :-)
I am having problems with converting UTF-8 to Unicode.
Below is the code:
int charset_convert( char * string, char * to_string,char* charset_from, char* charset_to)
{
char *from_buf, *to_buf, *pointer;
size_t inbytesleft, outbytesleft, ret;
size_t TotalLen;
iconv_t cd;
if (!charset_from || !charset_to || !string) /* sanity check */
return -1;
if (strlen(string) < 1)
return 0; /* we are done, nothing to convert */
cd = iconv_open(charset_to, charset_from);
/* Did I succeed in getting a conversion descriptor ? */
if (cd == (iconv_t)(-1)) {
/* I guess not */
printf("Failed to convert string from %s to %s ",
charset_from, charset_to);
return -1;
}
from_buf = string;
inbytesleft = strlen(string);
/* allocate max sized buffer,
assuming target encoding may be 4 byte unicode */
outbytesleft = inbytesleft *4 ;
pointer = to_buf = (char *)malloc(outbytesleft);
memset(to_buf,0,outbytesleft);
memset(pointer,0,outbytesleft);
ret = iconv(cd, &from_buf, &inbytesleft, &pointer, &outbytesleft);ing
memcpy(to_string,to_buf,(pointer-to_buf);
}
main():
int main()
{
char UTF []= {'A', 'B'};
char Unicode[1024]= {0};
char* ptr;
int x=0;
iconv_t cd;
charset_convert(UTF,Unicode,"UTF-8","UNICODE");
ptr = Unicode;
while(*ptr != '\0')
{
printf("Unicode %x \n",*ptr);
ptr++;
}
return 0;
}
It should give A and B but i am getting:
ffffffff
fffffffe
41
Thanks,
Sandeep
It looks like you are getting UTF-16 out in a little endian format:
ff fe 41 00 ...
Which is U+FEFF (ZWNBSP aka byte order mark), U+0041 (latin capital letter A), ...
You then stop printing because your while loop has terminated on the first null byte. The following bytes should be: 42 00.
You should either return a length from your function or make sure that the output is terminated with a null character (U+0000) and loop until you find this.
UTF-8 is Unicode.
You do not need to covert unless you need some other type of Unicode encoding like UTF-16, or UTF-32
UTF is not Unicode. UTF is an encoding of the integers in the Unicode standard. The question, as is, makes no sense. If you mean you want to convert from (any) UTF to the unicode code point (i.e. the integer that stands for an assigned code point, roughly a character), then you need to do a bit of reading, but it involves bit-shifting for the values of the 1, 2, 3 or 4 bytes in UTF-8 byte sequence (see Wikipedia, while Markus Kuhn's text is also excellent)
Unless I am missing something as nobody has pointed it out yet, "UNICODE" isn't a valid encoding name in libiconv as it is the name of a family of encodings.
http://www.gnu.org/software/libiconv/
(edit) Actually iconv -l shows UNICODE as a listed entry but no details, in the source code its listed in the notes as an alias for UNICODE-LITTLE but in the subnotes it mentions:
* UNICODE (big endian), UNICODEFEFF (little endian)
We DON'T implement these because they are stupid and not standardized.
In the aliases header files UNICODELITTLE (no hyphen) resolves as follows:
lib/aliases.gperf:UNICODELITTLE, ei_ucs2le
i.e. UCS2-LE (UTF-16 Little Endian), which should match Windows internal "Unicode" encoding.
http://en.wikipedia.org/wiki/UTF-16/UCS-2
However you are clearly recommended to explicitly specify UCS2-LE or UCS2-BE unless the first bytes are a Byte Order Mark (BOM) value 0xfeff to indicate byte order scheme.
=> You are seeing the BOM as the first bytes of the output because that is what the "UNICODE" encoding name means, it means UCS2 with a header indicating the byte order scheme.