Detect UTF-16 file content - file

Is it possible to know if a file has Unicode (16-byte per char) or 8-bit ASCII content?

You may be able to read a byte-order-mark, if the file has this present.

UTF-16 characters are all at least 16-bits, with some being 32-bits with the right prefix (0xE000 to 0xFFFF). So simply scanning each char to see if less than 128 won't work. For example, the two bytes 0x20 0x20 would encode in ASCII and UTF-8 for two spaces, but encode in UTF-16 for a single character 0x2020 (dagger). If the text is known to be English with the occasional non-ASCII character, then most every other byte will be zero. But without some apriori knowledge about the text and/or it's encoding, there is no reliable way distinguish a general ASCII string from a general UTF-16 string.

Ditto to what Brian Agnew said about reading the byte order mark, a special two bytes that might appear at the beginning of the file.
You can also know if it is ASCII by scanning every byte in the file and seeing if they are all less than 128. If they are all less than 128, then it's just an ASCII file. If some of them are more than 128, there is some other encoding in there.

First off, ASCII is 7-bit, so if any byte has its high bit set you know the file isn't ASCII.
The various "common" character sets such as ISO-8859-x, Windows-1252, etc, are 8-bit, so if every other byte is 0, you know that you're dealing with Unicode that only uses the ISO-8859 characters.
You'll run into problems where you're trying to distinguish between Unicode and some encoding such as UTF-8. In this case, almost every byte will have a value, so you can't make an easy decision. You can, as Pascal says do some sort of statistical analysis of the content: Arabic and Ancient Greek probably won't be in the same file. However, this is probably more work than it's worth.
Edit in response to OP's comment:
I think that it will be sufficient to check for the presence of 0-value bytes (ASCII NUL) within your content, and make the choice based on that. The reason being that JavaScript keywords are ASCII, and ASCII is a subset of Unicode. Therefore any Unicode representation of those keywords will consist of one byte containing the ASCII character (low byte), and another containing 0 (the high byte).
My one caveat is that you carefully read the documentation to ensure that their use of the word "Unicode" is correct (I looked at this page to understand the function, did not look any further).

If the file for which you have to solve this problem is long enough each time, and you have some idea what it's supposed to be (say, English text in unicode or English text in ASCII), you can do a simple frequency analysis on the chars and see if the distribution looks like that of ASCII or of unicode.

Unicode is an alphabet, not a encoding. You probably meant UTF-16. There is lot of libraries around (python-chardet comes to mind instantly) to autodetect encoding of text, though they all use heuristics.

To programmatically discern the type of a file -- including, but not limited to the encoding -- the best bet is to use libmagic. BSD-licensed it is part of just about every Unix-system you are about to encounter, but for a lesser ones you can bundle it with your application.
Detecting the mime-type from C, for example, is as simple as:
Magic = magic_open(MAGIC_MIME|MAGIC_ERROR);
mimetype = magic_buffer(Magic, buf, bufsize);
Other languages have their own modules wrapping this library.
Back to your question, here is what I get from file(1) (the command-line interface to libmagic(3)):
% file /tmp/*rdp
/tmp/meow.rdp: Little-endian UTF-16 Unicode text, with CRLF, CR line terminators

For your specific use-case, it's very easy to tell. Just scan the file, if you find any NULL ("\0"), it must be UTF-16. JavaScript got to have ASCII chars and they are represented by a leading 0 in UTF-16.

Related

Convert a `char *` to UTF-8 in C, or when using xmlwriter?

I'm using libxml/xmlwriter to generate an XML file within a program.
const char *s = someCharactersFromSomewhere();
xmlTextWriterWriteAttribute (writer, _xml ("value"), _xml (s));
In general I don't have much control over the contents of s, so I can't guarantee that it will be well-formatted in UTF-8. Mostly it is, but if not, the XML which is generated will be malformed.
What I'd like to find is a way to convert s to valid UTF-8, with any invalid character sequences in s replaced with escapes or removed.
Alternatively, if there is an alternative to xmlTextWriterWriteAttribute, or some option I can pass in when initializing the XML writer, such that it guarantees that it will always write valid UTF-8, that would be even better.
One more thing to mention is that the solution must work with both Linux and OSX. Ideally writing as little of my own code as possible! :P
In case the string is encoded in ASCII, then it will always be valid UTF-8 string.
This is because UTF-8 is backwards compatible with ASCII encoding.
See the second paragraph on Wikipedia here.
Windows primarily works with UTF-16, this means you will have to convert from UTF-16 to UTF-8 before you pass the string to the XML library.
If you have 8-bit ascii input then you can simply junk any character code > 127.
If you have some dodgy UTF-8 it is quite easy to parse, but the widechar symbol number that you generate might be out of the unicode range. You can use mbrlen() to individually validate each character.
I am describing this using unsigned chars. If you must use signed chars, then >128 means <0.
At its simplest:
Until the null byte
1 If the next byte is 0, then end the loop
2 If the next byte is < 128 then it is ascii, so keep it
3 If the next byte is >=128 < 128+64 it is invalid - discard it
4 If the next byte is >= 128+64 then it is probably a proper UTF-8 lead byte
call size_t mbrlen(const char *s, size_t n, mbstate_t *ps);
to see how many bytes to keep
if mbrlen says the code is bad (either the lead byte or the trail bytes),
skip 1 byte. Rule 3 will skip the rest.
Even simpler logic just calls mbrlen repeatedly, as it can accept the low ascii range.
You can assume that all the "furniture" of the file (eg xml <>/ symbols, spaces, quotes and newlines) won't be altered by this edit, as they are all valid 7-bit ascii codes.
char is a single byte character, while UTF codepoints range from 0 to 0x10FFFFF, so how do you represent a UTF character in only one byte?
First of all you need a wchar_t character. Those are used with wprintf(3) versions of the normal printf(3) routines. If you dig a little on this, you'll see that mapping your UTF codepoints into valid UTF-8 encoding is straigtforward, based on your setlocale(3) settings. Look at those manual pages referenced, and you'll get an idea of the task you are facing.
There's full support for wide character sets in the C standard... but you have to use it through the internationalization libraries and locales availables.

Would a C compiler actually have an ascii look up table

I know there are a few similar questions around relating to this, but it's still not completely clear.
For example: If in my C source file, I have lots of defined string literals, as the compiler is translating this source file, does it go through each character of strings and use a look-up table to get the ascii number for each character?
I'd guess that if entering characters dynamically into a a running C program from standard input, it is the terminal that is translating actual characters to numbers, but then if we have in the code for example :
if (ch == 'c'){//.. do something}
the compiler must have its own way of understanding and mapping the characters to numbers?
Thanks in advance for some help with my confusion.
The C standard talks about the source character set, which the set of characters it expects to find in the source files, and the execution character set, which is set of characters used natively by the target platform.
For most modern computers that you're likely to encounter, the source and execution character sets will be the same.
A line like if (ch == 'c') will be stored in the source file as a sequence of values from the source character set. For the 'c' part, the representation is likely 0x27 0x63 0x27, where the 0x27s represent the single quote marks and the 0x63 represents the letter c.
If the execution character set of the platform is the same as the source character set, then there's no need to translate the 0x63 to some other value. It can just use it directly.
If, however, the execution character set of the target is different (e.g., maybe you're cross-compiling for an IBM mainframe that still uses EBCDIC), then, yes, it will need a way to look up the 0x63 it finds in the source file to map it to the actual value for a c used in the target character set.
Outside the scope of what's defined by the standard, there's the distinction between character set and encoding. While a character set tells you what characters can be represented (and what their values are), the encoding tells you how those values are stored in a file.
For "plain ASCII" text, the encoding is typically the identity function: A c has the value 0x63, and it's encoded in the file simply as a byte with the value of 0x63.
Once you get beyond ASCII, though, there can be more complex encodings. For example, if your character set is Unicode, the encoding might be UTF-8, UTF-16, or UTF-32, which represent different ways to store a sequence of Unicode values (code points) in a file.
So if your source file uses a non-trivial encoding, the compiler will have to have an algorithm and/or a lookup table to convert the values it reads from the source file into the source character set before it actually does any parsing.
On most modern systems, the source character set is typically Unicode (or a subset of Unicode). On Unix-derived systems, the source file encoding is typically UTF-8. On Windows, the source encoding might be based on a code page, UTF-8, or UTF-16, depending on the code editor used to create the source file.
On many modern systems, the execution character set is also Unicode, but, on an older or less powerful computer (e.g., an embedded system), it might be restricted to ASCII or the characters within a particular code page.
Edited to address follow-on question in the comments
Any tool that reads text files (e.g., an editor or a compiler) has three options: (1) assume the encoding, (2) take an educated guess, or (3) require the user to specify it.
Most unix utilities assume UTF-8 because UTF-8 is ubiquitous in that world.
Windows tools usually check for a Unicode byte-order mark (BOM), which can indicate UTF-16 or UTF-8. If there's no BOM, it might apply some heuristics (IsTextUnicode) to guess the encoding, or it might just assume the file is in the user's current code page.
For files that have only characters from ASCII, guessing wrong usually isn't fatal. UTF-8 was designed to be compatible with plain ASCII files. (In fact, every ASCII file is a valid UTF-8 file.) Also many common code pages are supersets of ASCII, so a plain ASCII file will be interpreted correctly. It would be bad to guess UTF-16 or UTF-32 for plain ASCII, but that's unlikely given how the heuristics work.
Regular compilers don't expend much code dealing with all of this. The host environment can handle many of the details. A cross-compiler (one that runs on one platform to make a binary that runs on a different platform) might have to deal with mapping between character sets and encodings.
Sort of. Except you can drop the ASCII bit, in full generality at least.
The mapping used between int literals like 'c' and the numeric equivalent is a function of the encoding used by the architecture that the compiler is targeting. ASCII is one such encoding, but there are others, and the C standard places only minimal requirements on the encoding, an important one being that '0' through to '9' must be consecutive, in one block, positive and able to fit into a char. Another requirement is that 'A' to 'Z' and 'a' to 'z' must be positive values that can fit into a char.
No, the compiler is not required to have such a thing. Think a minute about a pre-C11 compiler, reading EBCDIC source and translating for an EBCDIC machine. What use would have an ASCII look-up table in such a compiler?
Also think another minute about how such ASCII look-up table(s) would look like in such a compiler!

using regular expression with unicode string in C

I'm currently using regular expression on unicode strings but I just need to match ASCII characters thus effectively ignore all unicode characters and until now functions in regex.h work fine (I'm on linux so the encoding is utf8). But can someone confirm if its really ok to do so? Or do I need a regex library on Unicode (like ICU?)
UTF-8 is a variable length encoding; some characters are 1 byte, some 2, others 3 or 4. You know now many bytes to read by the prefix of each character. 0 for 1 byte, 110 for 2 bytes, 1110 for 3 bytes, 11110 for 4 bytes.
If you try to read a UTF-8 string as ASCII, or any other fixed-width encoding, things will go very wrong... unless that UTF-8 string contains nothing but 1 byte characters in which case it matches ASCII.
However since no bytes in UTF-8 contain a null byte, and none of the extra bytes can be confused with ASCII, and if you really are only matching ASCII, you might be able to get away with it... but I wouldn't recommend it because there are such better regex options than POSIX, they're easy to use, and why leave a hidden encoding bomb in your code for some sucker to deal with later? (Note: that sucker may be you)
Instead, use a Unicode aware regex library like Perl Compatible Regular Expressions (PCRE). PCRE is Unicode aware by passing the PCRE2_UTF flag to pcre2_compile. PCRE regex syntax is more powerful and more widely understood than POSIX regexes, and PCRE has more features. And PCRE comes with Gnome Lib which itself provides a feast of very handy C functions.
You need to be careful about your patterns and about the text your going to match.
As an example, given the expression a.b:
"axb" matches
"aèb" does NOT match
The reason is that è is two bytes long when UTF-8 encoded but . would only match the first one.
So as long as you only match sequences of ASCII characters you're safe. If you mix ASCII and non ASCII characters, you're in trouble.
You can try to match a single UTF-8 encoded "character" with something like:
([\xC0-\xDF].|[\xE0-\xEF]..|\xF0...|.)
but this assumes that the text is encoded correctly (and, frankly, I never tried it).

How to convert Unicode escaped characters to utf8?

I saw the other questions about the subject but all of them were missing important details:
I want to convert \u00252F\u00252F\u05de\u05e8\u05db\u05d6 to utf8. I understand that you look through the stream for \u followed by four hex which you convert to bytes. The problems are as follows:
I heard that sometimes you look for 4 bytes after and sometimes 6 bytes after, is this correct? If so, then how do you determine which it is? E.g. is \u00252F 4 or 6 bytes?
In the case of \u0025 this maps to one byte instead of two (0x25), why? Is the four hex supposed to represent utf16 which i am supposed to convert to utf8?
How do I know whether the text is supposed to be the literal characters \u0025 or the unicode sequence? Does that mean that all backslashes must be escaped in the stream?
Lastly, am I being stupid in doing this by hand when I can use iconv to do this for me?
If you have the iconv interfaces at your disposal, you can simply convert the \u0123\uABCD etc. sequences to an array of bytes 01 23 AB CD ..., replacing any unescaped ASCII characters with a 00 byte followed by the ASCII byte, then run the array through iconv with a conversion descriptor obtained by iconv_open("UTF-8", "UTF-16-BE").
Of course you can also do it much more efficiently working directly with the input yourself, but that requires reading and understanding the Unicode specification of UTF-16 and UTF-8.
In some conventions (like C++11 string literals), you parse a specific number of hex digits, like four after \u and eight after \U. That may or may not be the convention with the input you provided, but it seems a reasonable guess. Other styles, like C++'s \x you parse as many hex digits as you can find after the \x, which means that you have to jump through some hoops if you do want to put a literal hex digit immediately after one of these escaped characters.
Once you have all the values, you need to know what encoding they're in (e.g., UTF-16 or UTF-32) and what encoding you want (e.g., UTF-8). You then use a function to create a new string in the new encoding. You can write such a function (if you know enough about both encoding formats), or you can use a library. Some operating systems may provide such a function, but you might want to use a third-party library for portability.

ANSI C UTF-8 problem

First I develope an independent platform library by using ANSI C (not C++ and any non standard libs like MS CRT or glibc, ...).
After a few searchs, I found that one of the best way to internationalization in ANSI C, is using UTF-8 encoding.
In utf-8:
strlen(s): always counts the number of bytes.
mbstowcs(NULL,s,0): The number of characters can be counted.
But I have some problems when I want to random access of elements(characters) of a utf-8 string.
In ASCII encoding:
char get_char(char* assci_str, int n)
{
// It is very FAST.
return assci_str[n];
}
In UTF-16/32 encoding:
wchar_t get_char(wchar_t* wstr, int n)
{
// It is very FAST.
return wstr[n];
}
And here my problem in UTF-8 encoding:
// What is the return type?
// Because sizeof(utf-8 char) is 8 or 16 or 24 or 32.
/*?*/ get_char(char* utf8str, int n)
{
// I can found Nth character of string by using for.
// But it is too slow.
// What is the best way?
}
Thanks.
Perhaps you're thinking about this a bit wrongly. UTF-8 is an encoding which is useful for serializing data, e.g. writing it to a file or the network. It is a very non-trivial encoding, though, and a raw string of Unicode codepoints can end up in any number of encoded bytes.
What you should probably do, if you want to handle text (given your description), is to store raw, fixed-width strings internally. If you're going for Unicode (which you should), then you need 21 bits per codepoint, so the nearest integral type is uint32_t. In short, store all your strings internally as arrays of integers. Then you can random-access each codepoint.
Only encode to UTF-8 when you are writing to a file or console, and decode from UTF-8 when reading.
By the way, a Unicode codepoint is still a long way from a character. The concept of a character is just far to high-level to have a simple general mechanic. (E.g. "a" + "accent grave" -- two codepoints, how many characters?)
You simply can't. If you do need a lot of such queries, you can build an index for the UTF-8 string, or convert it to UTF-32 up front. UTF-32 is a better in-memory representation while UTF-8 is good on disk.
By the way, the code you listed for UTF-16 is not correct either. You may want to take care of the surrogate characters.
What do you want to count? As Kerrek SB has noted, you can have decomposed glyphs, i.e. "é" can be represented as a single character (LATIN SMALL LETTER E WITH ACUTE U+00E9), or as two characters (LATIN SMALL LETER E U+0065 COMBINING ACUTE ACCENT U+0301). Unicode has composed and decomposed normalization forms.
What you are probably interested in counting is not characters, but grapheme clusters. You need some higher level library to deal with this, and to deal with normalization forms, and proper (locale-dependent) collation, proper line-breaking, proper case-folding (e.g. german ß->SS) proper bidi support, etc... Real I18N is complex.
Contrary to what others have said, I don' really see a benefit in using UTF-32 instead of UTF-8: When processing text, grapheme clusters (or 'user-perceived characters') are far more useful than Unicode characters (ie raw codepoints), so even UTF-32 has to be treated as a variable-length coding.
If you do not want to use a dedicated library, I suggest using UTF-8 as on-disk, endian-agnostic representation and modified UTF-8 (which differs from UTF-8 by encoding the zero character as a two-byte sequence) as in-memory representation compatible with ASCIIZ.
The necessary information for splitting strings into grapheme clusters can be found in annex 29 and the character database.

Resources