Do strcmp and strstr test binary equivalence? - c

https://learn.microsoft.com/en-us/windows/win32/intl/security-considerations--international-features
This webpage makes me wonder.
Apparently some windows api may consider two strings equal when they are actually different byte sequences.
I want to know how C standard library behaves in this respect.
in other words, does strcmp(a,b)==0 imply strlen(a)==strlen(b)&&memcmp(a,b,strlen(a))==0?
and what about other string functions, including wide character versions?
edit:
for example, CompareStringW equates L"\x00C5" and L"\x212B"
printf("%d\n",CompareStringW(LOCALE_INVARIANT,0,L"\x00C5",-1,L"\x212B",-1)==CSTR_EQUAL); outputs 1
what I'm asking is whether C library functions never behave like this

two strings using different encodings can be the same even if their byte representation are different.
standard library strcmp does compare plain "character" strings and in this case strcmp(a,b)==0 implies strlen(a)==strlen(b)&&memcmp(a,b,strlen(a))==0
Functions like wcscmp require both strings to be encoded the same way, so their byte representation should be the same.

The regular string functions operate byte-by-byte. The specification says:
The sign of a nonzero value returned by the comparison functions memcmp, strcmp, and strncmp is determined by the sign of the difference between the values of the first pair of characters (both interpreted as unsigned char) that differ in the objects being compared.
strcmp() and memcmp() do the same comparisons. The only difference is that strcmp() uses the null terminators in the strings as the limit, memcmp() uses a parameter for this, and strncmp() takes a limit parameter and uses whichever comes first.
The wide string function specification says:
Unless explicitly stated otherwise, the functions described in this subclause order two wide characters the same way as two integers of the underlying integer type designated by wchar_t.
wcscmp() doesn't say otherwise, so it's also comparing the wide characters numerically, not by converting their encodings to some common character representations. wcscmp() is to wmemcmp() as strcmp() is to memcmp().
On the other hand, wcscoll() compares the strings as interpreted according to the LC_COLLATE category of the current locale. So this may not be equivalent to memcmp().
For other functions you should check the documentation to see whether they reference the locale.

Apparently some windows api may consider two strings equal when they are actually different byte sequences.
Depending on context and where you got those strings from, that would actually be the semantically correct behavor.
There are multiple ways to encode certain characters. The German 'ä', for example. In Unicode, this could be U+00E4 LATIN SMALL LETTER A WITH DAERHESIS, or it could be the sequence of U+0308 COMBINING DIAERESIS and U+0061 LATIN SMALL LETTER A. You could desire a comparison function that actually compares these equal. Or you could have them not compare equal, but have a standalone function that turns one representation into the other ("normalization").
You could want a comparison function that compares '6' (six) as equal to '๖' (also six, just in Thai). ("Canonicalization")
The byte string functions (strcmp() etc.) are not capable of any of that. They only deal in byte sequences, and are unaware of anything I wrote above.
As for the wide string functions (wcscmp() etc.), well... they are not that either, really.
in other words, does strcmp(a,b)==0 imply strlen(a)==strlen(b)&&memcmp(a,b,strlen(a))==0? and what about other string functions, including wide character versions?
Either will test for binary equivalence, as there are no mechanics in the C Standard Library to normalize or canonicalize strings.[1]
If you are actually dealing in processing strings (as opposed to just passing them through, for which C byte strings and wide strings are adequate), you should use the ICU library, the de facto standard for C/C++ Unicode handling. It looks daunting but actually needs to be to handle all these things correctly.
Basically, any C/C++ API that promises to do the same is either using the ICU library itself, or is very likely not doing what it advertises.
[1]: Actually, strcoll() / strxfrm() and wcscoll() / wcsxfrm() actually provide enough wiggle room to squeeze in proper Unicode mechanics for collation, but I don't know of an implementation that actually bothers to do so.

Related

How does C uppercase letters?

I see this code in glibc-2.33/ctype/ctype.c:
// [...]
#define __ctype_toupper \
((int32_t *) _NL_CURRENT (LC_CTYPE, _NL_CTYPE_TOUPPER) + 128)
// [...]
int
toupper (int c)
{
return c >= -128 && c < 256 ? __ctype_toupper[c] : c;
}
libc_hidden_def (toupper)
I understand that it's checking if c is within -128 and 256 (inclusive) and returns the character as-is if it's outside that range, but what does _NL_CURRENT (LC_CTYPE, _NL_CTYPE_TOUPPER) + 128) mean and where do I actually find the source code of how letters are uppercased? This seems to be looking up the current locale, I am only interested in en_US.UTF-8. Also, how can a character be negative?
I don't care about glibc specifically, I just want to know how all the ASCII characters (all as in from NUL to DEL) are uppercased in C.
"C" doesn't convert characters to upper case. The C standard only mandates that there be a function in the standard library which does so correctly according to the current locale, and that it does so in a particular way in the "C" locale (which is the only locale which is guaranteed to exist).
Library implementations are free to accomplish that task as the implementers see fit, and they all do it in different ways. Even radically different ways. Some C libraries don't support locales other than the "C" locale with an ASCII character set. An example of such a C library is musl and it is hard to beat the simplicity of its implementation:
int toupper(int c)
{
if (islower(c)) return c & 0x5f;
return c;
}
As you can see, the above code depends on islower. Here it is:
int islower(int c)
{
return (unsigned)c-'a' < 26;
}
Because of the call to islower, toupper returns unchanged any argument outside of the range of lower case characters, even arguments not in the valid range for toupper. Since the standard doesn't define the behaviour of toupper for arguments outside of the valid range (essentially values which might be returned by fgetc), just returning invalid arguments unchanged is certainly as acceptable as any other behaviour. Glibc's toupper function will often segfault on invalid arguments, since it uses the argument as an index into an array (as you can see in the code you cite). That behaviour is also acceptable according to the standard.
The Glibc implementation is a lot more complicated. And behind the scenes it depends on the locale data which is compiled from locale definition files, a process which is completely outside of the C standard and somewhat defined by the Posix standard (although the GNU implementation diverges in some way from Posix).
But here's the scoop: If you're using single byte characters in a UTF-8 locale, none of glibc's complicated code makes the slightest difference. The musl implementation works precisely as required in a UTF-8 locale, because the only alphabetic characters representable in a single byte UTF-8 representation are the 52 characters in the "Roman" alphabet. All the other Unicode characters are only representable in wide characters and multibyte sequences.
Furthermore, environments which use a single-byte encoding other than UTF-8 are increasingly rare. There are certainly a lot of us who had to learn this stuff because our programs ran on a variety of platforms which used different ISO-8859-x code pages. Or different single-byte Windows codepages. But in the end, Unicode won out. (And many of us breathed huge sighs of relief.) So most of this apparatus is no longer really necessary except in legacy environments.
But that's not to say that Unicode magically solves all the complications involved in managing the huge variety of alphabets in use in the world. Far from it. What Unicode does do is two-fold: it clarifies what the complications are (most of which is not captured by C/Posix locales), and it provides some basic standards for implementations.
And, as a side effect, UTF-8 standardises single-byte codes to basically conform with the original ASCII 7-bit standard. So if you're only dealing with 7-bit characters (which, these days, is probably less than ideal), you don't need anything beyond musl-style implementations. And if you are dealing with "all the world's character sets", you'll be looking for a library which actually conforms to Unicode, and which uses something other than char to represent characters.
But one complication is going to remain forever, sadly: the fact that C does not standardise the signedness of char. On platforms on which char is signed (Unix X86 and Windows, for two major examples),
(char)0xA0 is (a) unspecified and (b) probably -96, which is what a single-byte 0xA0 represents in 2's complement. So if you write code which uses the various functions in ctype.h and don't take care of negative char values, and then you try to use that code with a UTF-8 encoded string which includes characters outside of the single-byte domain, then you will end up passing negative numbers to functions which might not be expecting them.
If you go back at the root and look for _NL_CTYPE_TOUPPER you will find a commit where it is written
[..] (ctype_output): Support for alternate locale format: Computation of
nelems changes. _NL_CTYPE_TOUPPER32 [...]
So basically _NL_CTYPE_TOUPPER is the macro for _NL_CTYPE_TOUPPER(8bits) as for example in French you have À as uppercase version of à
Following this link you will find the header file langinfo.h that has this enum starting at line 43 and with _NL_CTYPE_TOUPPER defined at line 259.
LC_CTYPE category: character classification.
256 This information is accessed by the functions in <ctype.h>.
LC_CTYPE is defined for each language, see for example for French:
fr_FR:2000"
Note that it doesn't make a lot of sense to call this function since characters with accent are not contained in the ASCII table, but since this function is the one handling both utf8 and ascii that's how it works.

What is the difference between sqlite3_bind_text, sqlite3_bind_text16 and sqlite3_bind_text64?

I am using sqlite3 C interface. After reading document at https://www.sqlite.org/c3ref/bind_blob.html , I am totally confused.
What is the difference between sqlite3_bind_text, sqlite3_bind_text16 and sqlite3_bind_text64?
The document only describe that sqlite3_bind_text64 can accept encoding parameter including SQLITE_UTF8, SQLITE_UTF16, SQLITE_UTF16BE, or SQLITE_UTF16LE.
So I guess, based on the parameters pass to these functions, that:
sqlite3_bind_text is for ANSI characters, char *
sqlite3_bind_text16 is for UTF-16 characters,
sqlite3_bind_text64 is for various encoding mentioned above.
Is that correct?
One more question:
The document said "If the fourth parameter to sqlite3_bind_text() or sqlite3_bind_text16() is negative, then the length of the string is the number of bytes up to the first zero terminator." But it does not said what will happen for sqlite3_bind_text64. Originally I thought this is a typo. However, when I pass -1 as the fourth parameter to sqlite3_bind_text64, I will always get SQLITE_TOOBIG error, that makes me think they remove sqlite3_bind_text64 from the above statement by purpose. Is that correct?
Thanks
sqlite3_bind_text() is for UTF-8 strings.
sqlite3_bind_text16() is for UTF-16 strings using your processor's native endianness.
sqlite3_bind_text64() lets you specify a particular encoding (utf-8, native utf-16, or a particular endian utf-16). You'll probably never need it.
sqlite3_bind_blob() should be used for non-Unicode strings that are just treated as binary blobs; all sqlite string functions work only with Unicode.

What is a "wide character string" in C language?

I came across this in the book:
wscanf(L"%lf", &variable);
where the first parameter is of type of wchar_t *.
This s different from scanf("%lf", &variable); where the first parameter is of type char *.
So what is the difference than. I have never heard "wide character string" before. I have heard something called Raw String Literals which is printing the string as it is (no need for things like escape sequences) but that was not in C.
The exact nature of wide characters is (purposefully) left implementation defined.
When they first invented the concept of wchar_t, ISO 10646 and Unicode were still competing with each other (whereas they now, mostly cooperate). Rather than try to decree that an international character would be one or the other (or possibly something else entirely) they simply provided a type (and some functions) that the implementation could define to support international character sets as they chose.
Different implementations have exercised that potential for variation. For example, if you use Microsoft's compiler on Windows, wchar_t will be a 16-bit type holding UTF-16 Unicode (originally it held UCS-2 Unicode, but that's now officially obsolete).
On Linux, wchar_t will more often be a 32-bit type, holding UCS-4/UTF-32 encoded Unicode. Ports of gcc to at least some other operating systems do the same, though I've never tried to confirm that it's always the case.
There is, however, no guarantee of that. At least in theory an implementation on Linux could use 16 bits, or one on Windows could use 32 bits, or either one could decide to use 64 bits (though I'd be a little surprised to see that in reality).
In any case, the general idea of how things are intended to work, is that a single wchar_t is sufficient to represent a code point. For I/O, the data is intended to be converted from the external representation (whatever it is) into wchar_ts, which (is supposed to) make them relatively easy to manipulate. Then during output, they again get transformed into the encoding of your choice (which may be entirely different from the encoding you read).
"Wide character string" is referring to the encoding of the characters in the string.
From Wikipedia:
A wide character is a computer character datatype that generally has a
size greater than the traditional 8-bit character. The increased
datatype size allows for the use of larger coded character sets.
UTF-16 is one of the most commonly used wide character encodings.
Further, wchar_t is defined by Microsoft as an unsigned short(16-bit) data object. This could be and is most likely a different definition in other operating systems or languages.
Taken from the Wikipedia article from the comment below:
"The width of wchar_t is compiler-specific and can be as small as 8
bits. Consequently, programs that need to be portable across any C or
C++ compiler should not use wchar_t for storing Unicode text. The
wchar_t type is intended for storing compiler-defined wide characters,
which may be Unicode characters in some compilers."

isLetter with accented characters in C

I'd like to create (or find) a C function to check if a char c is a letter...
I can do this for a-z and A-Z easily of course.
However i get an error if testing c == á,ã,ô,ç,ë, etc
Probably those special characters are stored in more then a char...
I'd like to know:
How these special characters are stored, which arguments my function needs to receive, and how to do it?
I'd also like to know if are there any standard function that already does this.
I think you're looking for the iswalpha() routine:
#include <wctype.h>
int iswalpha(wint_t wc);
DESCRIPTION
The iswalpha() function is the wide-character equivalent of
the isalpha(3) function. It tests whether wc is a wide
character belonging to the wide-character class "alpha".
It does depend upon the LC_CTYPE of the current locale(7), so its use in a program that is supposed to handle multiple types of input correctly simultaneously might not be ideal.
If you are working with single-byte codesets such as ISO 8859-1 or 8859-15 (or any of the other 8859-x codesets), then the isalpha() function will do the job if you also remember to use setlocale(LC_ALL, ""); (or some other suitable invocation of setlocale()) in your program. Without this, the program runs in the C locale, which only classifies the ASCII characters (8859-x characters in the range 0x00..0x7F).
If you are working with multibyte or wide character codesets (such as UTF8 or UTF16), then you need to look to the wide character functions found in <wchar.h> and <wctype.h>.
How these characters are stored is locale-dependent. On most UNIX systems, they'll be stored as UTF8, whereas a Win32 machine will likely represent them as UTF16. UTF8 is stored as a variable-amount of chars, whereas UTF16 is stored using surrogate pairs - and thus inside a wchar_t (or unsigned short) (though incidentally, sizeof(wchar_t) on Windows is only 2 (vs 4 on *nix), and thus you'll often need 2 wchar_t types to store the 1 character if a surrogate pair encoding is used - which it will be in many cases).
As was mentioned, the iswalpha() routine will do this for you, and is documented here. It should take care of locale-specific issues for you.
You probably want http://site.icu-project.org/. It provides a portable library with APIs for this.

C - isgraph() function

Does anyone know how the isgraph() function works in C? I understand its use and results, but the code behind it is what I'm interested in.
For example, does it look at only the char value of it and compare it to the ASCII table? Or does it actually check to see if it can be displayed? If so, how?
The code behind the isgraph() function varies by platform (or, more precisely, by implementation). One common technique is to use an initialized array of bit-fields, one per character in the (single-byte) codeset plus EOF (which has to be accepted by the functions), and then selecting the relevant bit. This allows for a simple implementation as a macro which is safe (only evaluates its argument once) and as a simple (possibly inline) function.
#define isgraph(x) (__charmap[(x)+1]&__PRINT)
where __charmap and __PRINT are names reserved for the implementation. The +1 part deals with the common situation where EOF is -1.
According to the C standard (ISO/IEC 9899:1999):
§7.4.1.6 The isgraph function
Synopsis
#include <ctype.h>
int isgraph(int c);
Description
The isgraph function tests for any printing character except space (' ').
And:
§7.4 Character handling <ctype.h>
¶1 The header declares several functions useful for classifying and mapping
characters.166) In all cases the argument is an int, the value of which shall be
representable as an unsigned char or shall equal the value of the macro EOF. If the
argument has any other value, the behavior is undefined.
¶2 The behavior of these functions is affected by the current locale. Those functions that
have locale-specific aspects only when not in the "C" locale are noted below.
¶3 The term printing character refers to a member of a locale-specific set of characters, each
of which occupies one printing position on a display device; the term control character
refers to a member of a locale-specific set of characters that are not printing
characters.167) All letters and digits are printing characters.
166) See ‘‘future library directions’’ (7.26.2).
167) In an implementation that uses the seven-bit US ASCII character set, the printing characters are those
whose values lie from 0x20 (space) through 0x7E (tilde); the control characters are those whose
values lie from 0 (NUL) through 0x1F (US), and the character 0x7F (DEL).
It's called isgraph, not isGraph (and char, not Char), and the POSIX Programmer's Manual says
The isgraph() function shall test
whether c is a character of class
graph in the program's current locale;
see the Base Definitions volume of
IEEE Std 1003.1-2001,
Chapter 7, Locale.
So yes, it looks it up in a table (or equivalent code). It can't check whether it can actually be displayed, since that would vary depending upon the output device, many of which can display chars in addition to those for which isgraph returns true.
isgraph checks for "printable" characters, but the definition of "printable" can vary depending on your locale. Your locale may use characters that aren't in the ASCII table. Internally, it's most likely either a table lookup, a range-based test ((x >= 'a') && (x <= 'z'), etc), or a combination of both. Different implementations may do it slightly differently.
The isgraph() macro only looks at the ASCII table, or your location/country/providence/planet/galaxy's version of the ASCII table.
Here's a test code Counting Words, which found you can increase performance by writing your own version, which initializes a bool array[256] using isgraph(). There are benchmark results with the code.
Since bool variables/arrays are actually BYTEs, not bits, you can do even better, in terms of memory efficiency, if you use a bit array, and test that. It happily takes up only 32 bytes. That's almost certainly going to get cashed on any general-purpose modern processor.
Importantly, if you want a slightly different test than the standard ones provided here (see graphic depiction of character tests), you are free to change the initialization provided by the standard test to include your own exceptions.

Resources