isLetter with accented characters in C - c

I'd like to create (or find) a C function to check if a char c is a letter...
I can do this for a-z and A-Z easily of course.
However i get an error if testing c == á,ã,ô,ç,ë, etc
Probably those special characters are stored in more then a char...
I'd like to know:
How these special characters are stored, which arguments my function needs to receive, and how to do it?
I'd also like to know if are there any standard function that already does this.

I think you're looking for the iswalpha() routine:
#include <wctype.h>
int iswalpha(wint_t wc);
DESCRIPTION
The iswalpha() function is the wide-character equivalent of
the isalpha(3) function. It tests whether wc is a wide
character belonging to the wide-character class "alpha".
It does depend upon the LC_CTYPE of the current locale(7), so its use in a program that is supposed to handle multiple types of input correctly simultaneously might not be ideal.

If you are working with single-byte codesets such as ISO 8859-1 or 8859-15 (or any of the other 8859-x codesets), then the isalpha() function will do the job if you also remember to use setlocale(LC_ALL, ""); (or some other suitable invocation of setlocale()) in your program. Without this, the program runs in the C locale, which only classifies the ASCII characters (8859-x characters in the range 0x00..0x7F).
If you are working with multibyte or wide character codesets (such as UTF8 or UTF16), then you need to look to the wide character functions found in <wchar.h> and <wctype.h>.

How these characters are stored is locale-dependent. On most UNIX systems, they'll be stored as UTF8, whereas a Win32 machine will likely represent them as UTF16. UTF8 is stored as a variable-amount of chars, whereas UTF16 is stored using surrogate pairs - and thus inside a wchar_t (or unsigned short) (though incidentally, sizeof(wchar_t) on Windows is only 2 (vs 4 on *nix), and thus you'll often need 2 wchar_t types to store the 1 character if a surrogate pair encoding is used - which it will be in many cases).
As was mentioned, the iswalpha() routine will do this for you, and is documented here. It should take care of locale-specific issues for you.

You probably want http://site.icu-project.org/. It provides a portable library with APIs for this.

Related

Do strcmp and strstr test binary equivalence?

https://learn.microsoft.com/en-us/windows/win32/intl/security-considerations--international-features
This webpage makes me wonder.
Apparently some windows api may consider two strings equal when they are actually different byte sequences.
I want to know how C standard library behaves in this respect.
in other words, does strcmp(a,b)==0 imply strlen(a)==strlen(b)&&memcmp(a,b,strlen(a))==0?
and what about other string functions, including wide character versions?
edit:
for example, CompareStringW equates L"\x00C5" and L"\x212B"
printf("%d\n",CompareStringW(LOCALE_INVARIANT,0,L"\x00C5",-1,L"\x212B",-1)==CSTR_EQUAL); outputs 1
what I'm asking is whether C library functions never behave like this
two strings using different encodings can be the same even if their byte representation are different.
standard library strcmp does compare plain "character" strings and in this case strcmp(a,b)==0 implies strlen(a)==strlen(b)&&memcmp(a,b,strlen(a))==0
Functions like wcscmp require both strings to be encoded the same way, so their byte representation should be the same.
The regular string functions operate byte-by-byte. The specification says:
The sign of a nonzero value returned by the comparison functions memcmp, strcmp, and strncmp is determined by the sign of the difference between the values of the first pair of characters (both interpreted as unsigned char) that differ in the objects being compared.
strcmp() and memcmp() do the same comparisons. The only difference is that strcmp() uses the null terminators in the strings as the limit, memcmp() uses a parameter for this, and strncmp() takes a limit parameter and uses whichever comes first.
The wide string function specification says:
Unless explicitly stated otherwise, the functions described in this subclause order two wide characters the same way as two integers of the underlying integer type designated by wchar_t.
wcscmp() doesn't say otherwise, so it's also comparing the wide characters numerically, not by converting their encodings to some common character representations. wcscmp() is to wmemcmp() as strcmp() is to memcmp().
On the other hand, wcscoll() compares the strings as interpreted according to the LC_COLLATE category of the current locale. So this may not be equivalent to memcmp().
For other functions you should check the documentation to see whether they reference the locale.
Apparently some windows api may consider two strings equal when they are actually different byte sequences.
Depending on context and where you got those strings from, that would actually be the semantically correct behavor.
There are multiple ways to encode certain characters. The German 'ä', for example. In Unicode, this could be U+00E4 LATIN SMALL LETTER A WITH DAERHESIS, or it could be the sequence of U+0308 COMBINING DIAERESIS and U+0061 LATIN SMALL LETTER A. You could desire a comparison function that actually compares these equal. Or you could have them not compare equal, but have a standalone function that turns one representation into the other ("normalization").
You could want a comparison function that compares '6' (six) as equal to '๖' (also six, just in Thai). ("Canonicalization")
The byte string functions (strcmp() etc.) are not capable of any of that. They only deal in byte sequences, and are unaware of anything I wrote above.
As for the wide string functions (wcscmp() etc.), well... they are not that either, really.
in other words, does strcmp(a,b)==0 imply strlen(a)==strlen(b)&&memcmp(a,b,strlen(a))==0? and what about other string functions, including wide character versions?
Either will test for binary equivalence, as there are no mechanics in the C Standard Library to normalize or canonicalize strings.[1]
If you are actually dealing in processing strings (as opposed to just passing them through, for which C byte strings and wide strings are adequate), you should use the ICU library, the de facto standard for C/C++ Unicode handling. It looks daunting but actually needs to be to handle all these things correctly.
Basically, any C/C++ API that promises to do the same is either using the ICU library itself, or is very likely not doing what it advertises.
[1]: Actually, strcoll() / strxfrm() and wcscoll() / wcsxfrm() actually provide enough wiggle room to squeeze in proper Unicode mechanics for collation, but I don't know of an implementation that actually bothers to do so.

How does C uppercase letters?

I see this code in glibc-2.33/ctype/ctype.c:
// [...]
#define __ctype_toupper \
((int32_t *) _NL_CURRENT (LC_CTYPE, _NL_CTYPE_TOUPPER) + 128)
// [...]
int
toupper (int c)
{
return c >= -128 && c < 256 ? __ctype_toupper[c] : c;
}
libc_hidden_def (toupper)
I understand that it's checking if c is within -128 and 256 (inclusive) and returns the character as-is if it's outside that range, but what does _NL_CURRENT (LC_CTYPE, _NL_CTYPE_TOUPPER) + 128) mean and where do I actually find the source code of how letters are uppercased? This seems to be looking up the current locale, I am only interested in en_US.UTF-8. Also, how can a character be negative?
I don't care about glibc specifically, I just want to know how all the ASCII characters (all as in from NUL to DEL) are uppercased in C.
"C" doesn't convert characters to upper case. The C standard only mandates that there be a function in the standard library which does so correctly according to the current locale, and that it does so in a particular way in the "C" locale (which is the only locale which is guaranteed to exist).
Library implementations are free to accomplish that task as the implementers see fit, and they all do it in different ways. Even radically different ways. Some C libraries don't support locales other than the "C" locale with an ASCII character set. An example of such a C library is musl and it is hard to beat the simplicity of its implementation:
int toupper(int c)
{
if (islower(c)) return c & 0x5f;
return c;
}
As you can see, the above code depends on islower. Here it is:
int islower(int c)
{
return (unsigned)c-'a' < 26;
}
Because of the call to islower, toupper returns unchanged any argument outside of the range of lower case characters, even arguments not in the valid range for toupper. Since the standard doesn't define the behaviour of toupper for arguments outside of the valid range (essentially values which might be returned by fgetc), just returning invalid arguments unchanged is certainly as acceptable as any other behaviour. Glibc's toupper function will often segfault on invalid arguments, since it uses the argument as an index into an array (as you can see in the code you cite). That behaviour is also acceptable according to the standard.
The Glibc implementation is a lot more complicated. And behind the scenes it depends on the locale data which is compiled from locale definition files, a process which is completely outside of the C standard and somewhat defined by the Posix standard (although the GNU implementation diverges in some way from Posix).
But here's the scoop: If you're using single byte characters in a UTF-8 locale, none of glibc's complicated code makes the slightest difference. The musl implementation works precisely as required in a UTF-8 locale, because the only alphabetic characters representable in a single byte UTF-8 representation are the 52 characters in the "Roman" alphabet. All the other Unicode characters are only representable in wide characters and multibyte sequences.
Furthermore, environments which use a single-byte encoding other than UTF-8 are increasingly rare. There are certainly a lot of us who had to learn this stuff because our programs ran on a variety of platforms which used different ISO-8859-x code pages. Or different single-byte Windows codepages. But in the end, Unicode won out. (And many of us breathed huge sighs of relief.) So most of this apparatus is no longer really necessary except in legacy environments.
But that's not to say that Unicode magically solves all the complications involved in managing the huge variety of alphabets in use in the world. Far from it. What Unicode does do is two-fold: it clarifies what the complications are (most of which is not captured by C/Posix locales), and it provides some basic standards for implementations.
And, as a side effect, UTF-8 standardises single-byte codes to basically conform with the original ASCII 7-bit standard. So if you're only dealing with 7-bit characters (which, these days, is probably less than ideal), you don't need anything beyond musl-style implementations. And if you are dealing with "all the world's character sets", you'll be looking for a library which actually conforms to Unicode, and which uses something other than char to represent characters.
But one complication is going to remain forever, sadly: the fact that C does not standardise the signedness of char. On platforms on which char is signed (Unix X86 and Windows, for two major examples),
(char)0xA0 is (a) unspecified and (b) probably -96, which is what a single-byte 0xA0 represents in 2's complement. So if you write code which uses the various functions in ctype.h and don't take care of negative char values, and then you try to use that code with a UTF-8 encoded string which includes characters outside of the single-byte domain, then you will end up passing negative numbers to functions which might not be expecting them.
If you go back at the root and look for _NL_CTYPE_TOUPPER you will find a commit where it is written
[..] (ctype_output): Support for alternate locale format: Computation of
nelems changes. _NL_CTYPE_TOUPPER32 [...]
So basically _NL_CTYPE_TOUPPER is the macro for _NL_CTYPE_TOUPPER(8bits) as for example in French you have À as uppercase version of à
Following this link you will find the header file langinfo.h that has this enum starting at line 43 and with _NL_CTYPE_TOUPPER defined at line 259.
LC_CTYPE category: character classification.
256 This information is accessed by the functions in <ctype.h>.
LC_CTYPE is defined for each language, see for example for French:
fr_FR:2000"
Note that it doesn't make a lot of sense to call this function since characters with accent are not contained in the ASCII table, but since this function is the one handling both utf8 and ascii that's how it works.

Understanding and writing wchar_t in C

I'm currently rewriting (a part of) the printf() function for a school project.
Overall, we were required to reproduce the behaviour of the function with several flags, conversions, length modifiers ...
The only thing I have left to do and that gets me stuck are the flags %C / %S (or %lc / %ls).
So far, I've gathered that wchar_t is a type that can store characters on more than one byte, in order to accept more characters or symbols and therefore be compatible with pretty much every language, regardless of their alphabet and special characters.
However, I wasn't able to find any concrete information on what a wchar looks like for the machine, it's actual length (which apparently vary based on several factors including the compiler, the OS ...) or how to actually write them.
Thank you in advance
Note that we are limited in the functions we are allowed to use. The only allowed functions are write(), malloc(), free(), and exit().
We must be able to code any other required function ourselves.
To sum this up, what I'm asking here is some informations on how to interpret and write "manually" any wchar_t character, with as little code as possible so that I can try to understand the whole process and code it myself.
A wchar_t is similar to a char in the sense that it is a number, but when displaying a char or wchar_t we don't want to see the number, but the drawn character corresponding to the number. The mapping from the number to the characters aren't defined by neither char nor wchar_t, they depend on the system. So there is no difference in the end usage between char and wchar_t except for their sizes.
Given the above, the most trivial implementation of printf("%ls") is one where you know what are the system encodings for use with char and wchar_t. For example, in my system, char has 8 bits, has encoding UTF-8, while wchar_t is 32 bits and has encoding UTF-32. So the printf implementation just converts from UTF-32 to UTF-8 and outputs the result.
A more general implementation must support different and configurable encodings and may need to inspect what's the current encoding. In this case functions like wcsnrtombs() or iconv() must be used.

Behavior of extended bytes/characters in C/POSIX locale

C and POSIX both require only a very limited set of characters be present in the C/POSIX locale, but allow additional characters to exist. This leaves a great deal of freedom to the implementation; for instance, supporting all of Unicode (as UTF-8) in the C locale is conforming behavior. However, most historical implementations treat the C locale as having an "8-bit-clean" single-byte character encoding, either ISO-8859-1 (Latin-1) or a sort of "abstract 8-bit character set" where the non-ASCII bytes are abstract characters with no particular identity. (However, in the latter case, if the compiler defines __STDC_ISO_10646__, they normatively correspond to Unicode characters, usually the Latin-1 range.)
Another conforming option that seems much less popular is to treat all non-ASCII bytes as non-characters, i.e. respond to them with an EILSEQ error.
What I'm interested in knowing is whether there are implementations which take this or any other unusual options in implementing the C locale. Are there implementations where attempting to convert "high bytes" in the C locale results in EILSEQ or anything other than treating them as (abstract or Latin-1) single-byte characters or UTF-8?
From your comment to the previous answer:
The ways in which the assumption could be wrong are basically that bytes outside the portable character set could be illegal non-character bytes (EILSEQ) or make up some multibyte encoding (UTF-8 or a stateless legacy CJK encoding)
Here you can find one example.
Plan 9 only supports the "C" locale. As you can see in utf.c and rune.c, when it find a rune outside the portable characters, it simply handles it as a character from a different encoding.
Another candidates could be Minix and the *BSD family (as far as they use citrus). In the Minix source code I've also found the file command looking for new encoding when the character size is not 8bit.
Amusingly, I just found that the most widely-used implementation, glibc, is an example of what I'm looking for. Consider this simple program:
#include <stdlib.h>
#include <stdio.h>
int main()
{
wchar_t wc = 0;
int n = mbtowc(&wc, "\x80", 1);
printf("%d %.4x\n", n, (int)wc);
}
On glibc, it prints -1 0000. If the byte 0x80 were an extended character in the implementation's C/POSIX locale, it would print 1 followed by some nonzero character number.
Thus, the "common knowledge" that the C/POSIX locale is "8-bit-clean" on glibc is simply false. What's going on is that there's a gross inconsistency; despite the fact that all the standard utilities, regular expression matching, etc. are specified to operate on (multibyte) characters as if read by mbrtowc, the implementations of these utilities/functions are taking a shortcut when they see MB_CUR_MAX==1 or LC_CTYPE containing "C" (or similar) and reading char values directly instead of processing input with mbrtowc or similar. This is leading to an inconsistency between the specified behavior (which, as their implementation of the C/POSIX locale is defined, would have to treat high bytes as illegal sequences) and the implementation behavior (which is bypassing the locale system entirely).
With all that said, I am still looking for other implementations with the properties requested in the question.
"What I'm interested in knowing is whether there are implementations which take this or any other unusual options in implementing the C locale."
This question is very difficult to answer because it mixes the "C Locale", which I'm assuming refers to the C Standard limited character set mentioned above, with "other unusual options", which I'm assuming refers to how the specific implementation handles characters outside the (limited) C locale. Every C Implementation must implement the C Locale; I don't think there's any unusual options surrounding that.
Let's assume for argument that the question is: "...unusual options in implementing additional/extended characters beyond the C locale." Now this becomes an implementation-dependent question, and as you have already mentioned, it "leaves a great deal of freedom to the implementation." So without knowing the target compiler/hardware, it would still be difficult to answer definitively.
Now the last part:
"...attempting to convert "high bytes" in the C locale results in EILSEQ or anything other than treating them as (abstract or Latin-1) single-byte characters or UTF-8?"
Instead of converting high bytes while in the C Locale, you might be able to set the Locale in your program as in this SO Question: Does the underlying character set depend only on the C implementation?
This way you can ensure that your characters will be treated in the Locale that you expect.
It is my understanding that the C Locale only concerns itself with the first 7-bits (of an 8-bit char type), based on the sources below:
http://www.cprogramming.com/tutorial/unicode.html
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html
http://www.in-ulm.de/~mascheck/locale/
The terms "high bytes" and "Unicode" and "UTF-8" are in the class of multi-byte or wide-character encodings, and are very locale specific (and beyond the range of the minimal C Locale). I'm not clear on how it would be possible to "convert high bytes" in the (pure) C Locale. It's quite possible that implementations would pick a default (extended) locale if none was explicitly set (or pull it from the OS environment settings as stated in one of the links above).
The POSIX standard is quite clear in this regard.
The introduction to character sets in POSIX.1-2017 says:
6.2 Character Encoding
The POSIX locale shall contain 256 single-byte characters including the characters in Portable Character Set and Non-Portable Control Characters, which have the properties listed in LC_CTYPE. It is unspecified whether characters not listed in those two tables are classified as punct or cntrl, or neither. Other locales shall contain the characters in Portable Character Set and may contain any or all of the control characters identified in Non-Portable Control Characters; the presence, meaning, and representation of any additional characters are locale-specific.
(emphasis mine)
The page for mbtowc() says:
The mbtowc() function shall fail if:
[EILSEQ]
 An invalid character sequence is detected. In the POSIX locale an [EILSEQ] error cannot occur since all byte values are valid characters.
Note that the POSIX locale is defined to be identical to the C locale.
So if an operating system conforms to POSIX, mbtowc is a no-op in the POSIX locale. Characters 128–255 are passed through just as characters 0–127 are. Implementations that operate differently are in violation of the standard.

Does wide character input/output in C always read from / write to the correct (system default) encoding?

I'm primarily interested in the Unix-like systems (e.g., portable POSIX) as it seems like Windows does strange things for wide characters.
Do the read and write wide character functions (like getwchar() and putwchar()) always "do the right thing", for example read from utf-8 and write to utf-8 when that is the set locale, or do I have to manually call wcrtomb() and print the string using e.g. fputs()? On my system (openSUSE 12.3) where $LANG is set to en_GB.UTF-8 they do seem to do the right thing (inspecting the output I see what looks like UTF-8 even though strings were stored using wchar_t and written using the wide character functions).
However I am unsure if this is guaranteed. For example cprogramming.com states that:
[wide characters] should not be used for output, since spurious zero
bytes and other low-ASCII characters with common meanings (such as '/'
and '\n') will likely be sprinkled throughout the data.
Which seems to indicate that outputting wide characters (presumably using the wide character output functions) can wreak havoc.
Since the C standard does not seem to mention coding at all I really have no idea who/when/how coding is applied when using wchar_t. So my question is basically if reading, writing and using wide characters exclusively is a proper thing to do when my application has no need to know about the encoding used. I only need string lengths and console widths (wcswidth()), so to me using wchar_t everywhere when dealing with text seems ideal.
The relevant text governing the behavior of the wide character stdio functions and their relationship to locale is from POSIX XSH 2.5.2 Stream Orientation and Encoding Rules:
http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_05_02
Basically, the wide character stdio functions always write in the encoding that's in effect (per the LC_CTYPE locale category) at the time the FILE stream becomes wide-oriented; this means the first time a wide stdio function is called on it, or fwide is used to set the orientation to wide. So as long as a proper LC_CTYPE locale is in effect matching the desired "system" encoding (e.g. UTF-8) when you start working with the stream, everything should be fine.
However, one important consideration you should not overlook is that you must not mix byte and wide oriented operations on the same FILE stream. Failure to observe this rule is not a reportable error; it simply results in undefined behavior. As a good deal of library code assumes stderr is byte oriented (and some even makes the same assumption about stdout), I would strongly discourage ever using wide-oriented functions on the standard streams. If you do, you need to be very careful about which library functions you use.
Really, I can't think of any reason at all to use wide-oriented functions. fprintf is perfectly capable of sending wide-character strings to byte-oriented FILE streams using the %ls specifier.
So long as the locale is set correctly, there shouldn't be any issues processing UTF-8 files on a system using UTF-8, using the wide character functions. They'll be able to interpret things correctly, i.e. they'll treat a character as 1-4 bytes as necessary (in both input and output). You can test it out by something like this:
#include <stdio.h>
#include <locale.h>
#include <wchar.h>
int main()
{
setlocale(LC_CTYPE, "en_GB.UTF-8");
// setlocale(LC_CTYPE, ""); // to use environment variable instead
wchar_t *txt = L"£Δᗩ";
wprintf(L"The string %ls has %d characters\n", txt, wcslen(txt));
}
$ gcc -o loc loc.c && ./loc
The string £Δᗩ has 3 characters
If you use the standard functions (in particular character functions) on multibyte strings carelessly, things will start to break, e.g. the equivalent:
char *txt = "£Δᗩ";
printf("The string %s has %zu characters\n", txt, strlen(txt));
$ gcc -o nloc nloc.c && ./nloc
The string £Δᗩ has 7 characters
The string still prints correctly here because it's essentially just a stream of bytes, and as the system is expecting UTF-8 sequences, they're translated perfectly. Of course strlen is reporting the number of bytes in the string, 7 (plus the \0), with no understanding that a character and a byte aren't equivalent.
In this respect, because of the compatibility between ASCII and UTF-8, you can often get away with treating UTF-8 files as simply multibyte C strings, as long as you're careful.
There's a degree of flexibility as well. It's possible to convert a standard C string (as a multibyte string) to a wide character string easily:
char *stdtxt = "ASCII and UTF-8 €£¢";
wchar_t buf[100];
mbstowcs(buf, stdtxt, 20);
wprintf(L"%ls has %zu wide characters\n", buf, wcslen(buf));
Output:
ASCII and UTF-8 €£¢ has 19 wide characters
Once you've used a wide character function on a stream, it's set to wide orientation. If you later want to use standard byte i/o functions, you'll need to re-open the stream first. This is probably why the recommendation is not to use it on stdout. However, if you only use wide character functions on stdin and stdout (including any code that you link to), you will not have any problems.
Don't use fputs with anything else than ASCII.
If you want to write down lets say UTF8, then use a function who return the real size used by the utf8 string and use fwrite to write the good number of bytes, without worrying of vicious '\0' inside the string.

Resources