C11 Unicode Support - c

I am writing some string conversion functions similar to atoi() or strtoll(). I wanted to include a version of my function that would accept a char16_t* or char32_t* instead of just a char* or wchar_t*.
My function works fine, but as I was writing it I realized that I do not understand what char16_t or char32_t are. I know that the standard only requires that they are an integer type of at least 16 or 32 bits respectively but the implication is that they are UTF-16 or UTF-32.
I also know that the standard defines a couple of functions but they did not include any *get or *put functions (like they did when they added in wchar.h in C99).
So I am wondering: what do they expect me to do with char16_t and char32_t?

That's a good question with no apparent answer.
The uchar.h types and functions added in C11 are largely useless. They only support conversions between the new type (char16_t or char32_t) and the locale-specific, implementation-defined multibyte encoding, mappings which are not going to be complete unless the locale is UTF-8 based. The useful conversions (to/from wchar_t, and to/from UTF-8) are not supported. Of course you can roll your own for conversions to/from UTF-8 since these conversions are 100% specified by the relevant RFCs/UCS/Unicode standards, but be careful: most people implement them wrong and have dangerous bugs.
Note that the new compiler-level features for UTF-8, UTF-16, and UTF-32 literals (u8, u, and U, respectively) are potentially useful; you can process the resulting strings with your own functions in meaningful ways that don't depend at all on locale. But the library-level support for Unicode in C11 is, in my opinion, basically useless.

Testing if a UTF-16 or UTF-32 charter in the ASCII range is one of the "usual" 10 digits, +, - or a "normal" white-space is easy to do as well as convert '0'-'9' to a digit. Given that, atoi_utf16/32() proceeds like atoi(). Simply inspect one character at a time.
Testing if some other UTF-16/UTF-32 is a digit or white-space - that is harder. Code would need an extended isspace(), isdigit() which can be had be switching locales (setlocale()) if the needed locale is available. (Note: likely need to restore locale when the function is done.
Converting a character that passes isdigit() but is not one of the usual 10 to its value is problematic. Anyways, that appears to not even be allowed.
Conversion steps:
Set locale to a corresponding one for UTF-16/UTF-32.
Use isspace() for white-space detection.
Convert is a similar fashion for your_atof().
Restore local.

This question may be a bit old, but I'd like to touch on implementing your functions with char16_t and char32_t support.
The easiest way to do this is to write your strtoull function using the char32_t type (call it something like strtoull_c32). This makes parsing unicode easier because every character in UTF-32 occupies four bytes. Then implement strtoull_c16 and strtoull_c8 by internally converting both UTF-8 and UTF-16 encodings to UTF-32 and passing them to strtoull_c32.
I honestly haven't looked at the Unicode facilities in the C11 standard library, but if they don't provide a suitable way for converting those types to UTF-32 then you can use a third party library to make the conversion for you.
There's ICU, which was started by IBM and then adopted by the Unicode Consortium. It's a very feature-rich and stable library that's been around for a long time.
I started a UTF library (UTFX) for C89 recently, that you could use for this too. It's pretty simple and lightweight, unit tested and documented. You could give that a go, or use it to learn more about how UTF conversions work.

Related

How can I convert a PCHAR* to a TCHAR*?

I am looking for ways to convert a PCHAR* variable to a TCHAR* without having any warnings in Visual Studio( this is a requirement)?
Looking online I can't find a function or a method to do so without having warnings. Maybe somebody has come across something similar?
Thank you !
convert a PCHAR* variable to a TCHAR*
PCHAR is a typedef that resolves to char*, so PCHAR* means char**.
TCHAR is a macro #define'd to either the "wide" wchar_t or the "narrow" char.
In neither case can you (safely) convert between a char ** and a simple character pointer, so the following assumes the question is actually about converting a PCHAR to a TCHAR*.
PCHAR is the same TCHAR* in ANSI builds, and no conversion would be necessary in that case, so it can be further assumed that the question is about Unicode builds.
The PCHAR comes from the function declaration(can t be changed) and TCHAR comes from GetCurrentDirectory. I want to concatenate the 2 using _tcscat_s but I need to convert the PCHAR first.
The general question of converting between narrow and wide strings has been answered before, see for example Convert char * to LPWSTR or How to convert char* to LPCWSTR?. However, in this particular case, you could weigh the alternatives before choosing the general approaches.
Change your build settings to ANSI, instead of Unicode, then no conversion is necessary.
That's as easy as making sure neither UNICODE nor _UNICODE macros are defined when compiling, or changing in the IDE the project Configuration Properties / Advanced / Character Set from Use Unicode Character Set to either Not Set or Use Multi-Byte Character Set.
Disclaimer: it is retrograde nowadays to compile against an 8-bit Windows codepage. I am not advising it, and doing that means many international characters cannot be represented literally. However, a chain is only as strong as its weakest link, and if you are forced to use narrow strings returned by an external function that you cannot change, then that's limiting the usefulness of going full Unicode elsewhere.
Keep the build as Unicode, but change just the concatenation code to use ANSI strings.
This can be done by explicitly calling the ANSI version GetCurrentDirectoryA of the API, which returns a narrow string. Then you can strcat that directly with the other PCHAR string.
Keep it as is, but combine the narrow and wide strings using [w]printf instead of _tcscat_s.
char szFile[] = "test.txt";
PCHAR pszFile = szFile; // narrow string from ext function
wchar_t wszDir[_MAX_PATH];
GetCurrentDirectoryW(_MAX_PATH, wszDir); // wide string from own code
wchar_t wszPath[_MAX_PATH];
wsprintf(wszPath, L"%ws\\%hs", wszDir, pszFile); // combined into wide string

What exactly are char16_t and char32_t, and where can I find them?

I was looking for char16_t and char32_t, since I’m working with Unicode, and all I could find on the Web was they were inside uchar.h. I found said header inside the iOS SDK (not the macOS one, for some reason), but there were no such types in it. I saw them in a different header, though, but I could not find where they're defined. Also, the info on the internet is scarce at best, so I’m kinda lost here; but I did read wchar_t should not be used for Unicode, which is exactly what I’ve been doing so far, so please help:(
char16_t and char32_t are specified in the C standard. (Citations below are from the 2018 standard.)
Per clause 7.28, the header <uchar.h> declares them as unsigned integer types to be used for 16-bit and 32-bit characters, respectively. You should not have to hunt for them in any other header; #include <uchar.h> should suffice.
Also per clause 7.28, each of these types is a narrowest unsigned integer type with required number of bits. (For example, on an implementation that supported only unsigned integers of 8, 18, 24, and 36, and 50 bits, char16_t would have to be the 18-bit size; it could not be 24, and char32_t would have to be 36.)
Per clause 6.4.5, when a string literal is prefixed by u or U, as in u"abc" or U"abc", it is a wide string literal in which the elements have type char16_t or char32_t, respectively.
Per clause 6.10.8.2, if the C implementation defines the preprocessor macro __STDC_UTF_16__ to be 1, it indicates that char16_t values are UTF-16 encoded. Similarly, __STDC_UTF_32__ indicates char32_t values are UTF-32 encoded. In the absence of these macros, no assertion is made about the encodings.
Microsoft has a fair description: https://learn.microsoft.com/en-us/cpp/cpp/char-wchar-t-char16-t-char32-t?view=vs-2017
char is the original, typically 8-bit, character representation.
wchar is a "wide char", 16-bits, used by Windows. Microsoft was an early adopter of Unicode, unfortunately this stuck them with this only-used-on-Windows encoding.
char16 and char32, used for UTF-16 and -32
Most non-Windows systems use UTF-8 for encoding (and even Windows 10 is adopting this, https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows#UTF-8). UTF-8 is by far the most common encoding used today on the web. (ref: https://en.wikipedia.org/wiki/UTF-8)
UTF-8 is stored in a series of chars. UTF-8 is likely the encoding you will find simplest to adopt, depending on your OS.

Start from which version does C standard support non-alphanum variable name? [duplicate]

I find in the new C++ Standard
2.11 Identifiers [lex.name]
identifier:
identifier-nondigit
identifier identifier-nondigit
identifier digit
identifier-nondigit:
nondigit
universal-character-name
other implementation-defined character
with the additional text
An identifier is an arbitrarily long sequence of letters and digits. Each universal-character-name in an identifier shall designate a character whose encoding in ISO 10646 falls into one of the ranges specified
in E.1. [...]
I can not quite comprehend what this means. From the old std I am used to that a "universal character name" is written \u89ab for example. But using those in an identifier...? Really?
Is the new standard more open w.r.t to Unicode? And I do not refer to the new Literal Types "uHello \u89ab thing"u32, I think I understood those. But:
Can (portable) source code be in any unicode encoding, like UTF-8, UTF-16 or any (how-ever-defined) codepage?
Can I write an identifier with \u1234 in it myfu\u1234ntion (for whatever purpose)
Or can i use the "character names" that unicode defines like in the ICU, i.e.
const auto x = "German Braunb\U{LOWERCASE LETTER A WITH DIARESIS}r."u32;
or even in an identifier in the source itself? That would be a treat... cough...
I think the answer to all thise questions is no but I can not map this reliably to the wording in the standard... :-)
Edit: I found "2.2 Phases of translation [lex.phases]", Phase 1:
Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set [...] if necessary. The set of physical source file characters accepted is implementation-defined. [...] Any source file character not in the basic
source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)
By reading this I now think, that a compiler may choose to accept UTF-8, UTF-16 or any codepage it wishes (by meta information or user configuration). In Phase 1 it translates this into an ASCII form ("basic source character set") in which then the Unicode-characters are replaced by its \uNNNN notation (or the compiler can choose to continue to work in its Unicode-representation, but than has to make sure it handles the other \uNNNN the same way.
What do you think?
Is the new standard more open w.r.t to Unicode?
With respect to allowing universal character names in identifiers the answer is no; UCNs were allowed in identifiers back in C99 and C++98. However compilers did not implement that particular requirement until recently. Clang 3.3 I think introduces support for this and GCC has had an experimental feature for this for some time. Herb Sutter also mentioned during his Build 2013 talk "The Future of C++" that this feature would also be coming to VC++ at some point. (Although IIRC Herb refers to it as a C++11 feature; it is in fact a C++98 feature.)
It's not expected that identifiers will be written using UCNs. Instead the expected behavior is to write the desired character using the source encoding. E.g., source will look like:
long pörk;
not:
long p\u00F6rk;
However UCNs are also useful for another purpose; Compilers are not all required to accept the same source encodings, but modern compilers all support some encoding scheme where at least the basic source characters have the same encoding (that is, modern compilers all support some ASCII compatible encoding).
UCNs allow you to write source code with only the basic characters and yet still name extended characters. This is useful in, for example, writing a string literal "°" in source code that will be compiled both as CP1252 and as UTF-8:
char const *degree_sign = "\u00b0";
This string literal is encoded into the appropriate execution encoding on multiple compilers, even when the source encodings differ, as long as the compilers at least share the same encoding for basic characters.
Can (portable) source code be in any unicode encoding, like UTF-8, UTF-16 or any (how-ever-defined) codepage?
It's not required by the standard, but most compilers will accept UTF-8 source. Clang supports only UTF-8 source (although it has some compatibility for non-UTF-8 data in character and string literals), gcc allows the source encoding to be specified and includes support for UTF-8, and VC++ will guess at the encoding and can be made to guess UTF-8.
(Update: VS2015 now provides an option to force the source and execution character sets to be UTF-8.)
Can I write an identifier with \u1234 in it myfu\u1234ntion (for whatever purpose)
Yes, the specification mandates this, although as I said not all compilers implement this requirement yet.
Or can i use the "character names" that unicode defines like in the ICU, i.e.
const auto x = "German Braunb\U{LOWERCASE LETTER A WITH DIARESIS}r."u32;
No, you cannot use Unicode long names.
or even in an identifier in the source itself? That would be a treat... cough...
If the compiler supports a source code encoding that contains the extended character you want then that character written literally in the source must be treated exactly the same as the equivalent UCN. So yes, if you use a compiler that supports this requirement of the C++ spec then you may write any character in its source character set directly in the source without bothering with writing UCNs.
I think the intent is to allow Unicode characters in identifiers, such as:
long pöjk;
ostream* å;
I suggest using clang++ instead of g++. Clang is designed to be highly compatible with GCC (wikipedia-source), so you can most likely just substitute that command.
I wanted to use Greek symbols in my source code.
If code readability is the goal, then it seems reasonable to use (for example) α over alpha. Especially when used in larger mathematical formulas, they can be read more easily in the source code.
To achieve this, this is a minimal working example:
> cat /tmp/test.cpp
#include <iostream>
int main()
{
int α = 10;
std::cout << "α = " << α << std::endl;
return 0;
}
> clang++ /tmp/test.cpp -o /tmp/test
> /tmp/test
α = 10
This article https://www.securecoding.cert.org/confluence/display/seccode/PRE30-C.+Do+not+create+a+universal+character+name+through+concatenation works with the idea that int \u0401; is compliant code, though it's based on C99, instead of C++0x.
Present versions of gcc (up to version 5.2 so far) only support ASCII and in some cases EBCDIC input files. Therefore, unicode characters in identifiers have to be represented using \uXXXX and \UXXXXXXXX escape sequences in ASCII encoded files. While it may be possible to represent unicode characters as ??/uXXXX and ??/UXXXXXXX in EBCDIC encoded input files, I have not tested this. At anyrate, a simple one-line patch to cpp allows direct reading of UTF-8 input provided a recent version of iconv is installed. Details are in
https://www.raspberrypi.org/forums/viewtopic.php?p=802657
and may be summarized by the patch
diff -cNr gcc-5.2.0/libcpp/charset.c gcc-5.2.0-ejo/libcpp/charset.c
*** gcc-5.2.0/libcpp/charset.c Mon Jan 5 04:33:28 2015
--- gcc-5.2.0-ejo/libcpp/charset.c Wed Aug 12 14:34:23 2015
***************
*** 1711,1717 ****
struct _cpp_strbuf to;
unsigned char *buffer;
! input_cset = init_iconv_desc (pfile, SOURCE_CHARSET, input_charset);
if (input_cset.func == convert_no_conversion)
{
to.text = input;
--- 1711,1717 ----
struct _cpp_strbuf to;
unsigned char *buffer;
! input_cset = init_iconv_desc (pfile, "C99", input_charset);
if (input_cset.func == convert_no_conversion)
{
to.text = input;

Converting a UTF-8 text to wchar_t

I know this question has been asked quite a few times here, and i did read some of the answers, But there are a few suggested solutions and im trying to figure out the best of them.
I'm writing a C99 app that basically receives XML text encoded in UTF-8.
Part of it's job is to copy and manipulate that string (finding a substr, cat it, ex..)
As i would rather not to use an outside not-standard library right now, im trying to implement it using wchar_t.
Currently, im using mbstowcs to convert it to wchar_t for easy manipulation, and for some input i tried in different languages - it worked fine.
Thing is, i did read some people out there had some issues with UTF-8 and mbstowcs, so i would like to hear out about whether this use is permitted/acceptable.
Other option i faced was using iconv with WCHAR_T parameter. Thing is, im working on a platform(not a PC) which it's locale is very very limit to only ANSI C locale. How about that?
I did also encounter some C++ library which is very popular. but im limited for C99 implementation.
Also, i would be compiling this code on another platform, which the sizeof of wchar_t is different (2 bytes versus 4 bytes on my machine). How can i overcome that? using fixed-size char containers? but then, which manipulation functions should i use instead?
Happy to hear some thoughts. thanks.
C does not define what encoding the char and wchar_t types are and the standard library only mandates some functions that translate between the two without saying how. If the implementation-dependent encoding of char is not UTF-8 then mbstowcs will result in data corruption.
As noted in the rationale for the C99 standard:
However, the five functions are often too restrictive and too primitive to develop portable international programs that manage characters.
...
C90 deliberately chose not to invent a more complete multibyte- and wide-character library, choosing instead to await their natural development as the C community acquired more experience with wide characters.
Sourced from here.
So, if you have UTF-8 data in your chars there isn't a standard API way to convert that to wchar_ts.
In my opinion wchar_t should usually be avoided unless necessary - you might need it if you're using WIN32 APIs for example. I am not convinced it will simplify string manipulation. wchar_t is always UTF-16LE on Windows so you may still need to have more than one wchar_t to represent a single Unicode code point anyway.
I suggest you investigate the ICU project - at least from an educational standpoint.
Also, i would be compiling this code on another platform, which the
sizeof of wchar_t is different (2 bytes versus 4 bytes on my machine).
How can i overcome that? using fixed-size char containers?
You could do that with conditional typedefs like this:
#if defined(__STDC_UTF_16__)
typedef _Char16_t CHAR16;
#elif defined(_WIN32)
typedef wchar_t CHAR16;
#else
typedef uint16_t CHAR16;
#endif
#if defined(__STDC_UTF_32__)
typedef _Char32_t CHAR32;
#elif defined(__STDC_ISO_10646__)
typedef wchar_t CHAR32;
#else
typedef uint32_t CHAR32;
#endif
This will define the typedefs CHAR16 and CHAR32 to use the new C++11 character types if available, but otherwise fall back to using wchar_t when possible and fixed-width unsigned integers otherwise.

Why weren't new (bit width specific) printf() format option strings adoped as part of C99?

While researching how to do cross-platform printf() format strings in C (that is, taking into account the number of bits I expect each integer argument to printf() should be) I ran across this section of the Wikipedia article on printf(). The article discusses non-standard options that can be passed to printf() format strings, such as (what seems to be a Microsoft-specific extension):
printf("%I32d\n", my32bitInt);
It goes on to state that:
ISO C99 includes the inttypes.h header
file that includes a number of macros
for use in platform-independent printf
coding.
... and then lists a set of macros that can be found in said header. Looking at the header file, to use them I would have to write:
printf("%"PRId32"\n", my32bitInt);
My question is: am I missing something? Is this really the standard C99 way to do it? If so, why? (Though I'm not surprised that I have never seen code that uses the format strings this way, since it seems so cumbersome...)
The C Rationale seems to imply that <inttypes.h> is standardizing existing practice:
<inttypes.h> was derived from the header of the same name found on several existing 64-bit systems.
but the remainder of the text doesn't write about those macros, and I don't remember they were existing practice at the time.
What follows is just speculation, but educated by experience of how standardization committees work.
One advantage of the C99 macros over standardizing additional format specifier for printf (note that C99 also did add some) is that providing <inttypes.h> and <stdint.h> when you already have an implementation supporting the required features in an implementation specific way is just writing two files with adequate typedef and macros. That reduces the cost of making existing implementation conformant, reduces the risk of breaking existing programs which made use of the existing implementation specifics features (the standard way doesn't interfere) and facilitate the porting of conformant programs to implementation who don't have these headers (they can be provided by the program). Additionally, if the implementation specific ways already varied at the time, it doesn't favorize one implementation over another.
Correct, this is how the C99 standard says you should use them. If you want truly portablt code that is 100% standards-conformant to the letter, you should always print an int using "%d" and an int32_t using "%"PRId32.
Most people won't bother, though, since there are very few cases where failure to do so would matter. Unless you're porting your code to Win16 or DOS, you can assume that sizeof(int32_t) <= sizeof(int), so it's harmless to accidentally printf an int32_t as an int. Likewise, a long long is pretty much universally 64 bits (although it is not guaranteed to be so), so printing an int64_t as a long long (e.g. with a %llx specifier) is safe as well.
The types int_fast32_t, int_least32_t, et al are hardly ever used, so you can imagine that their corresponding format specifiers are used even more rarely.
You can always cast upwards and use %jd which is the intmax_t format specifier.
printf("%jd\n", (intmax_t)(-2));
I used intmax_t to show that any intXX_t can be used, but simply casting to long is much better for the int32_t case, then use %ld.
I can only speculate about why. I like AProgrammer's answer above, but there's one aspect overlooked: what are you going to add to printf as a format modifier? There are already two different ways that numbers are used in a printf format string (width and precision). Adding a third kind of number to say how many bits of precision are in the argument would be great, but where are you going to put it without confusing people? Unfortunatey one of the flaws in C is that printf was not designed to be extensible.
The macros are awful, but when you have to write code that is portable across 32-bit and 64-bit platforms, they are a godsend. Definitely saved my bacon.
I think the answer to your question why is either
Nobody could think of a better way to do it, or
The standards committee couldn't agree on anything they felt was clearly better.
Another possibility: backward compatibility. If you add more format specifiers to printf, or additional options, it is possible that a specifier in some pre-C99 code would have a format string interpreted differently.
With the C99 change, you're not changing the functionality of printf.

Resources