I'm reading out text from a website with Curl. All the rawdata is being returned character by character with
return memEof(mp) ? EOF : (int)(*(unsigned char *)(mp->readptr++));
My problem is, that all the special characters such as ÄÖÜäöüß etc are all wrong and look very cryptic. I'm currently correcting them manually by adjusting their values using this table:
http://www.barcoderesource.com/barcodeasciicharacters.shtml
I was wondering now, if there is a more elegant way to do this and how others approach these kinds of issues.
This is an encoding issue. If you read data byte by byte, you can handle correctly and easily just single byte encodings (like ISO-8859 "family" and many more), provided you have a way to convert them correctly in a target encoding, if you need. But with encodings like UTF-8 you are less lucky, since to get the right code you need to read 1 byte, or maybe 2, or maybe three... If you stream them into a string, and print the string altogether, and the output device intended encoding is the same of the input encoding, you get the right char shown anyway.
If it does not happen, and you are not printing each byte as if it were a symbol for sure, then the output device intended encoding does not match the one the string is written with.
If the output, once you print the string "altogether" looks ok, then the problem is that you are interpreting each byte as a single character, while it is not (you have a multibyte encoding for char like the special one you mentioned; likely it is UTF-8 but it could be not too).
If you get equal results in both cases (when you print each byte one by one and when you output a string that keeps the whole byte sequence), then the output device intended encoding is a single byte encoding like the input encoding, but they do not match.
Further details would need to know how you collect the bytes you read in order to print them and say that they looks cryptic.
An example.
const char *string = "\xc3\xa8\xc3\xb2\xc3\xa0";
int i;
for(i = 0; string[i] != 0; i++)
{
printf("%c\n", string[i]);
// using \n is important; if you "sequence" the chars and the output enc is
// utf-8, you obtain the right output
}
printf("%s", string);
You obtain different results if the output device encoding is UTF-8; if it is a single byte encoding, you obtain the same output (newlines apart), but it is "wrong" with respect to what I've written, i.e. èòà.
The "same" text is, in Latin1, "\xe8\xf2\xe0". Latin1 is a single byte encoding, so the above speech applies. If printed on a terminal understanding utf-8, you can obtain something like �� ...
So, encodings matter, device/format output encoding matters too, and you must be aware of both in order to handle and show properly the text. (About format, an example could be html, where you can specify the encoding of the content... you must be coherent, and you'll see everything fine)
I guess you have to use an external library like iconv to create a wchar_t string which contains the data. This depends on the used character encoding.
Related
I'm using libxml/xmlwriter to generate an XML file within a program.
const char *s = someCharactersFromSomewhere();
xmlTextWriterWriteAttribute (writer, _xml ("value"), _xml (s));
In general I don't have much control over the contents of s, so I can't guarantee that it will be well-formatted in UTF-8. Mostly it is, but if not, the XML which is generated will be malformed.
What I'd like to find is a way to convert s to valid UTF-8, with any invalid character sequences in s replaced with escapes or removed.
Alternatively, if there is an alternative to xmlTextWriterWriteAttribute, or some option I can pass in when initializing the XML writer, such that it guarantees that it will always write valid UTF-8, that would be even better.
One more thing to mention is that the solution must work with both Linux and OSX. Ideally writing as little of my own code as possible! :P
In case the string is encoded in ASCII, then it will always be valid UTF-8 string.
This is because UTF-8 is backwards compatible with ASCII encoding.
See the second paragraph on Wikipedia here.
Windows primarily works with UTF-16, this means you will have to convert from UTF-16 to UTF-8 before you pass the string to the XML library.
If you have 8-bit ascii input then you can simply junk any character code > 127.
If you have some dodgy UTF-8 it is quite easy to parse, but the widechar symbol number that you generate might be out of the unicode range. You can use mbrlen() to individually validate each character.
I am describing this using unsigned chars. If you must use signed chars, then >128 means <0.
At its simplest:
Until the null byte
1 If the next byte is 0, then end the loop
2 If the next byte is < 128 then it is ascii, so keep it
3 If the next byte is >=128 < 128+64 it is invalid - discard it
4 If the next byte is >= 128+64 then it is probably a proper UTF-8 lead byte
call size_t mbrlen(const char *s, size_t n, mbstate_t *ps);
to see how many bytes to keep
if mbrlen says the code is bad (either the lead byte or the trail bytes),
skip 1 byte. Rule 3 will skip the rest.
Even simpler logic just calls mbrlen repeatedly, as it can accept the low ascii range.
You can assume that all the "furniture" of the file (eg xml <>/ symbols, spaces, quotes and newlines) won't be altered by this edit, as they are all valid 7-bit ascii codes.
char is a single byte character, while UTF codepoints range from 0 to 0x10FFFFF, so how do you represent a UTF character in only one byte?
First of all you need a wchar_t character. Those are used with wprintf(3) versions of the normal printf(3) routines. If you dig a little on this, you'll see that mapping your UTF codepoints into valid UTF-8 encoding is straigtforward, based on your setlocale(3) settings. Look at those manual pages referenced, and you'll get an idea of the task you are facing.
There's full support for wide character sets in the C standard... but you have to use it through the internationalization libraries and locales availables.
Ok, I have this:
AllocConsole();
SetConsoleOutputCP(CP_UTF8);
HANDLE consoleHandle = GetStdHandle(STD_OUTPUT_HANDLE);
WriteConsoleA(consoleHandle, "aΕλληνικά\n", 10, NULL, NULL);
WriteConsoleW(consoleHandle, L"wΕλληνικά\n", 10, NULL, NULL);
printf("aΕλληνικά\n");
wprintf(L"wΕλληνικά\n");
Now, the issue is that depending on the encoding file was saved as only some these works. wprintf never works, but I already know why (broken Microsoft stdout implementation, which only accepts narrow characters). Yet, I have issue with three others. If I save file as UTF-8 without signature (BOM) and use MS Visual C++ compiler, only last printf works. If I want ANSI version working I need to increase character(?) count to 18:
WriteConsoleA(consoleHandle, "aΕλληνικά\n", 18, NULL, NULL);
WriteConsoleW does not work, I assume, because the string is saved as UTF-8 byte sequence even I explicitly request it to be stored as wide-char (UTF-16) with L prefix and implementation most probably expects UTF-16 encoded string not UTF-8.
If I save it in UTF-8 with BOM (as it should be), then WriteConsoleW starts to work somehow (???) and everything else stops (I get ? instead of a character). I need to decrease character count in WriteConsoleA back to 10 to keep formatting the same (otherwise i get 8 additional rectangles). Basically, WTF?
Now, let's go to UTF-16 (Unicode - Codepage 1200). Works only WriteConsoleW. Character count in WriteConsoleA should be 10 to keep formatting precise.
Saving in UTF-16 Big Endian mode (Unicode - Codepage 1201) does not change anything. Again, WTF? Shouldn't byte order inside the strings be inverted when stored to file?
Conclusion is that the way strings are compiled into binary form depends on the encoding used. Therefore, what is the portable and compiler independent way to store strings? Is there a preprocessor which would convert one string representation into another before compilation, so I could store file in UTF-8 and only preprocess strings which I need to have in UTF-16 by wrapping them some macro.
I think you've got at least a few assumptions here which are either wrong or not 100% correct as far as I know:
Now, the issue is that depending on the encoding file was saved as only some these works.
Of course, because the encoding determines how to Interpret the string literals.
wprintf never works, but I already know why (broken Microsoft stdout implementation, which only accepts narrow characters).
I've never heard of that one, but I'm rather sure this depends on the locale set for your program. I've got a few work Projects where a locale is set and the output is just fine using German umlauts etc.
If I save file as UTF-8 without signature (BOM) and use MS Visual C++ compiler, only last printf works. If I want ANSI version working I need to increase character(?) count to 18:
That's because the ANSI version wants an ANSI string, while you're passing a UTF-8 encoded string (based on the file's encoding). The output still works, because the console handles the UTF-8 conversion for you - you're essentially printing raw UTF-8 here.
WriteConsoleW does not work, I assume, because the string is saved as UTF-8 byte sequence even I explicitly request it to be stored as wide-char (UTF-16) with L prefix and implementation most probably expects UTF-16 encoded string not UTF-8.
I don't think so (although I'm not sure why it isn't working either). Have you tried Setting some easy to find string and look for it in the resulting binary? I'm rather sure it's indeed encoded using UTF-16. I assume due to the missing BOM the compiler might interpret the whole thing as a narrow string and therefore converts the UTF-8 stuff wrong.
If I save it in UTF-8 with BOM (as it should be), then WriteConsoleW starts to work somehow (???) and everything else stops (I get ? instead of a character). I need to decrease character count in WriteConsoleA back to 10 to keep formatting the same (otherwise i get 8 additional rectangles). Basically, WTF?
This is exactly what I described above. Now the wide string is encoded properly, because the Compiler now knows the file is in UTF-8, not ANSI (or some codepage). The narrow string is properly converted to the locale being used as well.
Overall, there's no encoding independant way to do it, unless you escape everything using the proper codepage and/or UTF codes in advance. I'd just stick to UTF-8 with BOM, because I think all current compilers will be able to properly read and Interpret the file (besides Microsoft's Resource Compiler; although I haven't tried feeding the 2012 Version with UTF-8).
Edit:
To use an analogy:
You're essentially saving a raw image to a file and you expect it to work properly, no matter whether other programs try to read it as a grayscale, palettized, or full color image. This won't work (despite differences being smaller).
The answer is here.
Quoting:
It is impossible for the compiler to intermix UTF-8 and UTF-16
strings into the compiled output! So you have to decide for one source
code file:
either use UTF-8 with BOM and generate UTF-16 strings only (i.e.always use L prefix),
or UTF-8 without BOM and generate UTF-8 strings only (i.e. never use L prefix),
7-bit ASCII characters are not involved and can be used with or without L prefix
The only portable and compiler independent way is to use ASCII charset and escape sequences, because there are no guarantees that any compiler would accept UTF-8 encoded file and a compiler treatment of those multibyte sequences might vary.
I saw the other questions about the subject but all of them were missing important details:
I want to convert \u00252F\u00252F\u05de\u05e8\u05db\u05d6 to utf8. I understand that you look through the stream for \u followed by four hex which you convert to bytes. The problems are as follows:
I heard that sometimes you look for 4 bytes after and sometimes 6 bytes after, is this correct? If so, then how do you determine which it is? E.g. is \u00252F 4 or 6 bytes?
In the case of \u0025 this maps to one byte instead of two (0x25), why? Is the four hex supposed to represent utf16 which i am supposed to convert to utf8?
How do I know whether the text is supposed to be the literal characters \u0025 or the unicode sequence? Does that mean that all backslashes must be escaped in the stream?
Lastly, am I being stupid in doing this by hand when I can use iconv to do this for me?
If you have the iconv interfaces at your disposal, you can simply convert the \u0123\uABCD etc. sequences to an array of bytes 01 23 AB CD ..., replacing any unescaped ASCII characters with a 00 byte followed by the ASCII byte, then run the array through iconv with a conversion descriptor obtained by iconv_open("UTF-8", "UTF-16-BE").
Of course you can also do it much more efficiently working directly with the input yourself, but that requires reading and understanding the Unicode specification of UTF-16 and UTF-8.
In some conventions (like C++11 string literals), you parse a specific number of hex digits, like four after \u and eight after \U. That may or may not be the convention with the input you provided, but it seems a reasonable guess. Other styles, like C++'s \x you parse as many hex digits as you can find after the \x, which means that you have to jump through some hoops if you do want to put a literal hex digit immediately after one of these escaped characters.
Once you have all the values, you need to know what encoding they're in (e.g., UTF-16 or UTF-32) and what encoding you want (e.g., UTF-8). You then use a function to create a new string in the new encoding. You can write such a function (if you know enough about both encoding formats), or you can use a library. Some operating systems may provide such a function, but you might want to use a third-party library for portability.
I'm trying to print out a wchar_t* string.
Code goes below:
#include <stdio.h>
#include <string.h>
#include <wchar.h>
char *ascii_ = "中日友好"; //line-1
wchar_t *wchar_ = L"中日友好"; //line-2
int main()
{
printf("ascii_: %s\n", ascii_); //line-3
wprintf(L"wchar_: %s\n", wchar_); //line-4
return 0;
}
//Output
ascii_: 中日友好
Question:
Apparently I should not assign CJK characters to char* pointer in line-1, but I just did it, and the output of line-3 is correct, So why? How could printf() in line-3 give me the non-ascii characters? Does it know the encoding somehow?
I assume the code in line-2 and line-4 are correct, but why I didn't get any output of line-4?
First of all, it's usually not a good idea to use non-ascii characters in source code. What's probably happening is that the chinese characters are being encoded as UTF-8 which works with ascii.
Now, as for why the wprintf() isn't working. This has to do with stream orientation. Each stream can only be set to either normal or wide. Once set, it cannot be changed. It is set the first time it is used. (which is ascii due to the printf). After that the wprintf will not work due the incorrect orientation.
In other words, once you use printf() you need to keep on using printf(). Similarly, if you start with wprintf(), you need to keep using wprintf().
You cannot intermix printf() and wprintf(). (except on Windows)
EDIT:
To answer the question about why the wprintf line doesn't work even by itself. It's probably because the code is being compiled so that the UTF-8 format of 中日友好 is stored into wchar_. However, wchar_t needs 4-byte unicode encoding. (2-bytes in Windows)
So there's two options that I can think of:
Don't bother with wchar_t, and just stick with multi-byte chars. This is the easy way, but may break if the user's system is not set to the Chinese locale.
Use wchar_t, but you will need to encode the Chinese characters using unicode escape sequences. This will obviously make it unreadable in the source code, but it will work on any machine that can print Chinese character fonts regardless of the locale.
Line 1 is not ascii, it's whatever multibyte encoding is used by your compiler at compile-time. On modern systems that's probably UTF-8. printf does not know the encoding. It's just sending bytes to stdout, and as long as the encodings match, everything is fine.
One problem you should be aware of is that lines 3 and 4 together invoke undefined behavior. You cannot mix character-based and wide-character io on the same FILE (stdout). After the first operation, the FILE has an "orientation" (either byte or wide), and after that any attempt to perform operations of the opposite orientation results in UB.
You are omitting one step and therefore think the wrong way.
You have a C file on disk, containing bytes. You have a "ASCII" string and a wide string.
The ASCII string takes the bytes exactly like they are in line 1 and outputs them.
This works as long as the encoding of the user's side is the same as the one on the programmer's side.
The wide string first decodes the given bytes into unicode codepoints and stored in the program- maybe this goes wrong on your side. On output they are encoded again according to the encoding on the user's side. This ensures that these characters are emitted as they are intended to, not as they are entered.
Either your compiler assumes the wrong encoding, or your output terminal is set up the wrong way.
Is it possible to know if a file has Unicode (16-byte per char) or 8-bit ASCII content?
You may be able to read a byte-order-mark, if the file has this present.
UTF-16 characters are all at least 16-bits, with some being 32-bits with the right prefix (0xE000 to 0xFFFF). So simply scanning each char to see if less than 128 won't work. For example, the two bytes 0x20 0x20 would encode in ASCII and UTF-8 for two spaces, but encode in UTF-16 for a single character 0x2020 (dagger). If the text is known to be English with the occasional non-ASCII character, then most every other byte will be zero. But without some apriori knowledge about the text and/or it's encoding, there is no reliable way distinguish a general ASCII string from a general UTF-16 string.
Ditto to what Brian Agnew said about reading the byte order mark, a special two bytes that might appear at the beginning of the file.
You can also know if it is ASCII by scanning every byte in the file and seeing if they are all less than 128. If they are all less than 128, then it's just an ASCII file. If some of them are more than 128, there is some other encoding in there.
First off, ASCII is 7-bit, so if any byte has its high bit set you know the file isn't ASCII.
The various "common" character sets such as ISO-8859-x, Windows-1252, etc, are 8-bit, so if every other byte is 0, you know that you're dealing with Unicode that only uses the ISO-8859 characters.
You'll run into problems where you're trying to distinguish between Unicode and some encoding such as UTF-8. In this case, almost every byte will have a value, so you can't make an easy decision. You can, as Pascal says do some sort of statistical analysis of the content: Arabic and Ancient Greek probably won't be in the same file. However, this is probably more work than it's worth.
Edit in response to OP's comment:
I think that it will be sufficient to check for the presence of 0-value bytes (ASCII NUL) within your content, and make the choice based on that. The reason being that JavaScript keywords are ASCII, and ASCII is a subset of Unicode. Therefore any Unicode representation of those keywords will consist of one byte containing the ASCII character (low byte), and another containing 0 (the high byte).
My one caveat is that you carefully read the documentation to ensure that their use of the word "Unicode" is correct (I looked at this page to understand the function, did not look any further).
If the file for which you have to solve this problem is long enough each time, and you have some idea what it's supposed to be (say, English text in unicode or English text in ASCII), you can do a simple frequency analysis on the chars and see if the distribution looks like that of ASCII or of unicode.
Unicode is an alphabet, not a encoding. You probably meant UTF-16. There is lot of libraries around (python-chardet comes to mind instantly) to autodetect encoding of text, though they all use heuristics.
To programmatically discern the type of a file -- including, but not limited to the encoding -- the best bet is to use libmagic. BSD-licensed it is part of just about every Unix-system you are about to encounter, but for a lesser ones you can bundle it with your application.
Detecting the mime-type from C, for example, is as simple as:
Magic = magic_open(MAGIC_MIME|MAGIC_ERROR);
mimetype = magic_buffer(Magic, buf, bufsize);
Other languages have their own modules wrapping this library.
Back to your question, here is what I get from file(1) (the command-line interface to libmagic(3)):
% file /tmp/*rdp
/tmp/meow.rdp: Little-endian UTF-16 Unicode text, with CRLF, CR line terminators
For your specific use-case, it's very easy to tell. Just scan the file, if you find any NULL ("\0"), it must be UTF-16. JavaScript got to have ASCII chars and they are represented by a leading 0 in UTF-16.