C - isgraph() function - c

Does anyone know how the isgraph() function works in C? I understand its use and results, but the code behind it is what I'm interested in.
For example, does it look at only the char value of it and compare it to the ASCII table? Or does it actually check to see if it can be displayed? If so, how?

The code behind the isgraph() function varies by platform (or, more precisely, by implementation). One common technique is to use an initialized array of bit-fields, one per character in the (single-byte) codeset plus EOF (which has to be accepted by the functions), and then selecting the relevant bit. This allows for a simple implementation as a macro which is safe (only evaluates its argument once) and as a simple (possibly inline) function.
#define isgraph(x) (__charmap[(x)+1]&__PRINT)
where __charmap and __PRINT are names reserved for the implementation. The +1 part deals with the common situation where EOF is -1.
According to the C standard (ISO/IEC 9899:1999):
§7.4.1.6 The isgraph function
Synopsis
#include <ctype.h>
int isgraph(int c);
Description
The isgraph function tests for any printing character except space (' ').
And:
§7.4 Character handling <ctype.h>
¶1 The header declares several functions useful for classifying and mapping
characters.166) In all cases the argument is an int, the value of which shall be
representable as an unsigned char or shall equal the value of the macro EOF. If the
argument has any other value, the behavior is undefined.
¶2 The behavior of these functions is affected by the current locale. Those functions that
have locale-specific aspects only when not in the "C" locale are noted below.
¶3 The term printing character refers to a member of a locale-specific set of characters, each
of which occupies one printing position on a display device; the term control character
refers to a member of a locale-specific set of characters that are not printing
characters.167) All letters and digits are printing characters.
166) See ‘‘future library directions’’ (7.26.2).
167) In an implementation that uses the seven-bit US ASCII character set, the printing characters are those
whose values lie from 0x20 (space) through 0x7E (tilde); the control characters are those whose
values lie from 0 (NUL) through 0x1F (US), and the character 0x7F (DEL).

It's called isgraph, not isGraph (and char, not Char), and the POSIX Programmer's Manual says
The isgraph() function shall test
whether c is a character of class
graph in the program's current locale;
see the Base Definitions volume of
IEEE Std 1003.1-2001,
Chapter 7, Locale.
So yes, it looks it up in a table (or equivalent code). It can't check whether it can actually be displayed, since that would vary depending upon the output device, many of which can display chars in addition to those for which isgraph returns true.

isgraph checks for "printable" characters, but the definition of "printable" can vary depending on your locale. Your locale may use characters that aren't in the ASCII table. Internally, it's most likely either a table lookup, a range-based test ((x >= 'a') && (x <= 'z'), etc), or a combination of both. Different implementations may do it slightly differently.

The isgraph() macro only looks at the ASCII table, or your location/country/providence/planet/galaxy's version of the ASCII table.
Here's a test code Counting Words, which found you can increase performance by writing your own version, which initializes a bool array[256] using isgraph(). There are benchmark results with the code.
Since bool variables/arrays are actually BYTEs, not bits, you can do even better, in terms of memory efficiency, if you use a bit array, and test that. It happily takes up only 32 bytes. That's almost certainly going to get cashed on any general-purpose modern processor.
Importantly, if you want a slightly different test than the standard ones provided here (see graphic depiction of character tests), you are free to change the initialization provided by the standard test to include your own exceptions.

Related

Do strcmp and strstr test binary equivalence?

https://learn.microsoft.com/en-us/windows/win32/intl/security-considerations--international-features
This webpage makes me wonder.
Apparently some windows api may consider two strings equal when they are actually different byte sequences.
I want to know how C standard library behaves in this respect.
in other words, does strcmp(a,b)==0 imply strlen(a)==strlen(b)&&memcmp(a,b,strlen(a))==0?
and what about other string functions, including wide character versions?
edit:
for example, CompareStringW equates L"\x00C5" and L"\x212B"
printf("%d\n",CompareStringW(LOCALE_INVARIANT,0,L"\x00C5",-1,L"\x212B",-1)==CSTR_EQUAL); outputs 1
what I'm asking is whether C library functions never behave like this
two strings using different encodings can be the same even if their byte representation are different.
standard library strcmp does compare plain "character" strings and in this case strcmp(a,b)==0 implies strlen(a)==strlen(b)&&memcmp(a,b,strlen(a))==0
Functions like wcscmp require both strings to be encoded the same way, so their byte representation should be the same.
The regular string functions operate byte-by-byte. The specification says:
The sign of a nonzero value returned by the comparison functions memcmp, strcmp, and strncmp is determined by the sign of the difference between the values of the first pair of characters (both interpreted as unsigned char) that differ in the objects being compared.
strcmp() and memcmp() do the same comparisons. The only difference is that strcmp() uses the null terminators in the strings as the limit, memcmp() uses a parameter for this, and strncmp() takes a limit parameter and uses whichever comes first.
The wide string function specification says:
Unless explicitly stated otherwise, the functions described in this subclause order two wide characters the same way as two integers of the underlying integer type designated by wchar_t.
wcscmp() doesn't say otherwise, so it's also comparing the wide characters numerically, not by converting their encodings to some common character representations. wcscmp() is to wmemcmp() as strcmp() is to memcmp().
On the other hand, wcscoll() compares the strings as interpreted according to the LC_COLLATE category of the current locale. So this may not be equivalent to memcmp().
For other functions you should check the documentation to see whether they reference the locale.
Apparently some windows api may consider two strings equal when they are actually different byte sequences.
Depending on context and where you got those strings from, that would actually be the semantically correct behavor.
There are multiple ways to encode certain characters. The German 'ä', for example. In Unicode, this could be U+00E4 LATIN SMALL LETTER A WITH DAERHESIS, or it could be the sequence of U+0308 COMBINING DIAERESIS and U+0061 LATIN SMALL LETTER A. You could desire a comparison function that actually compares these equal. Or you could have them not compare equal, but have a standalone function that turns one representation into the other ("normalization").
You could want a comparison function that compares '6' (six) as equal to '๖' (also six, just in Thai). ("Canonicalization")
The byte string functions (strcmp() etc.) are not capable of any of that. They only deal in byte sequences, and are unaware of anything I wrote above.
As for the wide string functions (wcscmp() etc.), well... they are not that either, really.
in other words, does strcmp(a,b)==0 imply strlen(a)==strlen(b)&&memcmp(a,b,strlen(a))==0? and what about other string functions, including wide character versions?
Either will test for binary equivalence, as there are no mechanics in the C Standard Library to normalize or canonicalize strings.[1]
If you are actually dealing in processing strings (as opposed to just passing them through, for which C byte strings and wide strings are adequate), you should use the ICU library, the de facto standard for C/C++ Unicode handling. It looks daunting but actually needs to be to handle all these things correctly.
Basically, any C/C++ API that promises to do the same is either using the ICU library itself, or is very likely not doing what it advertises.
[1]: Actually, strcoll() / strxfrm() and wcscoll() / wcsxfrm() actually provide enough wiggle room to squeeze in proper Unicode mechanics for collation, but I don't know of an implementation that actually bothers to do so.

How does C uppercase letters?

I see this code in glibc-2.33/ctype/ctype.c:
// [...]
#define __ctype_toupper \
((int32_t *) _NL_CURRENT (LC_CTYPE, _NL_CTYPE_TOUPPER) + 128)
// [...]
int
toupper (int c)
{
return c >= -128 && c < 256 ? __ctype_toupper[c] : c;
}
libc_hidden_def (toupper)
I understand that it's checking if c is within -128 and 256 (inclusive) and returns the character as-is if it's outside that range, but what does _NL_CURRENT (LC_CTYPE, _NL_CTYPE_TOUPPER) + 128) mean and where do I actually find the source code of how letters are uppercased? This seems to be looking up the current locale, I am only interested in en_US.UTF-8. Also, how can a character be negative?
I don't care about glibc specifically, I just want to know how all the ASCII characters (all as in from NUL to DEL) are uppercased in C.
"C" doesn't convert characters to upper case. The C standard only mandates that there be a function in the standard library which does so correctly according to the current locale, and that it does so in a particular way in the "C" locale (which is the only locale which is guaranteed to exist).
Library implementations are free to accomplish that task as the implementers see fit, and they all do it in different ways. Even radically different ways. Some C libraries don't support locales other than the "C" locale with an ASCII character set. An example of such a C library is musl and it is hard to beat the simplicity of its implementation:
int toupper(int c)
{
if (islower(c)) return c & 0x5f;
return c;
}
As you can see, the above code depends on islower. Here it is:
int islower(int c)
{
return (unsigned)c-'a' < 26;
}
Because of the call to islower, toupper returns unchanged any argument outside of the range of lower case characters, even arguments not in the valid range for toupper. Since the standard doesn't define the behaviour of toupper for arguments outside of the valid range (essentially values which might be returned by fgetc), just returning invalid arguments unchanged is certainly as acceptable as any other behaviour. Glibc's toupper function will often segfault on invalid arguments, since it uses the argument as an index into an array (as you can see in the code you cite). That behaviour is also acceptable according to the standard.
The Glibc implementation is a lot more complicated. And behind the scenes it depends on the locale data which is compiled from locale definition files, a process which is completely outside of the C standard and somewhat defined by the Posix standard (although the GNU implementation diverges in some way from Posix).
But here's the scoop: If you're using single byte characters in a UTF-8 locale, none of glibc's complicated code makes the slightest difference. The musl implementation works precisely as required in a UTF-8 locale, because the only alphabetic characters representable in a single byte UTF-8 representation are the 52 characters in the "Roman" alphabet. All the other Unicode characters are only representable in wide characters and multibyte sequences.
Furthermore, environments which use a single-byte encoding other than UTF-8 are increasingly rare. There are certainly a lot of us who had to learn this stuff because our programs ran on a variety of platforms which used different ISO-8859-x code pages. Or different single-byte Windows codepages. But in the end, Unicode won out. (And many of us breathed huge sighs of relief.) So most of this apparatus is no longer really necessary except in legacy environments.
But that's not to say that Unicode magically solves all the complications involved in managing the huge variety of alphabets in use in the world. Far from it. What Unicode does do is two-fold: it clarifies what the complications are (most of which is not captured by C/Posix locales), and it provides some basic standards for implementations.
And, as a side effect, UTF-8 standardises single-byte codes to basically conform with the original ASCII 7-bit standard. So if you're only dealing with 7-bit characters (which, these days, is probably less than ideal), you don't need anything beyond musl-style implementations. And if you are dealing with "all the world's character sets", you'll be looking for a library which actually conforms to Unicode, and which uses something other than char to represent characters.
But one complication is going to remain forever, sadly: the fact that C does not standardise the signedness of char. On platforms on which char is signed (Unix X86 and Windows, for two major examples),
(char)0xA0 is (a) unspecified and (b) probably -96, which is what a single-byte 0xA0 represents in 2's complement. So if you write code which uses the various functions in ctype.h and don't take care of negative char values, and then you try to use that code with a UTF-8 encoded string which includes characters outside of the single-byte domain, then you will end up passing negative numbers to functions which might not be expecting them.
If you go back at the root and look for _NL_CTYPE_TOUPPER you will find a commit where it is written
[..] (ctype_output): Support for alternate locale format: Computation of
nelems changes. _NL_CTYPE_TOUPPER32 [...]
So basically _NL_CTYPE_TOUPPER is the macro for _NL_CTYPE_TOUPPER(8bits) as for example in French you have À as uppercase version of à
Following this link you will find the header file langinfo.h that has this enum starting at line 43 and with _NL_CTYPE_TOUPPER defined at line 259.
LC_CTYPE category: character classification.
256 This information is accessed by the functions in <ctype.h>.
LC_CTYPE is defined for each language, see for example for French:
fr_FR:2000"
Note that it doesn't make a lot of sense to call this function since characters with accent are not contained in the ASCII table, but since this function is the one handling both utf8 and ascii that's how it works.

How does C language transform char literal to number and vice versa

I've been diving into C/low-level programming/system design recently. As a seasoned Java developer I still remember my attemtps to pass SUN Java Certification and questions if char type in Java can be cast to Integer and how can that be done. That is what I know and remember - numbers up to 255 can be treated both like numbers or characters depending on casting.
Getting to know C I want to know more but I find it hard to find proper answer (tried googling but I usually get gazilion results how just to convert char to int in the code) how does EXACTLY it work, that C compiler/system calls transform number to character and vice versa.
AFAIK in the memory numbers are being stored. So let's assume in the memory cell we store value 65 (which is letter 'A'). So there is a value stored and suddenly C code wants to get it and store into char variable. So far so good. And then we issue printf procedure with %c formatting for given char parameter.
And here is where the magic happens - HOW EXACTLY printf knows that character with value 65 is letter 'A' (and should display it as a letter). It is a base sign from raw ASCII range (not some funny emoji-style UTF sign). Does it call external STD/libraries/system calls to consult encoding system? I would love some nitty-gritty, low-level explanation or at least link to trusted source.
The C language is largely agnostic about the actual encoding of characters. It has a source character set which defines how the compiler treats characters in the source code. So, for instance on an old IBM system the source character set might be EBCDIC where 65 does not represent 'A'.
C also has an execution character set which defines the meaning of characters in the running program. This is the one that seems more pertinent to your question. But it doesn't really affect the behavior of I/O functions like printf. Instead it affects the results of ctype.h functions like isalpha and toupper. printf just treats it as a char sized value which it receives as an int due to variadic functions using default argument promotions (any type smaller than int is promoted to int, and float is promoted to double). printf then shuffles off the same value to the stdout file and then it's somebody else's problem.
If the source character set and execution character set are different, then the compiler will perform the appropriate conversion so the source token 'A' will be manipulated in the running program as the corresponding A from the execution character set. The choice of actual encoding for the two character sets, ie. whether it's ASCII or EBCDIC or something else is implementation defined.
With a console application it is the console or terminal which receives the character value that has to look it up in a font's glyph table to display the correct image of the character.
Character constants are of type int. Except for the fact that it is implementation defined whether char is signed or unsigned, a char can mostly be treated as a narrow integer. The only conversion needed between the two is narrowing or widening (and possibly sign extension).
"HOW EXACTLY printf knows that character with value 65 is letter 'A' (and should display it as a letter)."
It usually doesn't, and it does not even need to. Even the compiler does not see characters ', A and ' in the C language fragment
char a = 'A';
printf("%c", c);
If the source and execution character sets are both ASCII or ASCII-compatible, as is usually the case nowadays, the compiler will have among the stream of bytes the triplet 39, 65, 39 - or rather 00100111 01000001 00100111. And its parser has been programmed with a rule that something between two 00100111s is a character literal, and since 01000001 is not a magic value it is translated as is to the final program.
The C program, at runtime, then handles 01000001 all the time (though from time to time it might be 01000001 zero-extended to an int, e.g. 00000000 00000000 00000000 01000001 on 32-bit systems; adding leading zeroes does not change its numerical value). On some systems, printf - or rather the underlying internal file routines - might translate the character value 01000001 to something else. But on most systems, 01000001 will be passed to the operating system as is. Then on the operating system - or possibly in a GUI program receiving the output from the operating system - will want to display that character, and then the display font is consulted for the glyph that corresponds to 01000001, and usually the glyph for letter 01000001 looks something like
A
And that will be displayed to the user.
At no point does the system really operate with glyphs or characters but just binary numbers. The system in itself is a Chinese room.
The real magic of printf is not how it handles characters, but how it handles numbers, as these are converted to more characters. While %c passes values as-is, %d will convert such a simple integer value as 0b101111000110000101001110 to stream of bytes 0b00110001 0b00110010 0b00110011 0b00110100 0b00110101 0b00110110 0b00110111 0b00111000 so that the display routine will correctly display it as
12345678
char in C is just an integer CHAR_BIT bits long. Usually it is 8 bits long.
HOW EXACTLY printf knows that character with value 65 is letter 'A'
The implementation knows what characters encoding it uses and pritnf function code takes the appropriate action do output the letter 'A'

Is there a limit on the number of values that can be printed by a single call of printf?

Does the number of values printed by printf depend on the memory allocated for a specific program or it can keep on printing the values?
The C Standard documents the minimum number of arguments that a compiler should accept for a function call:
C11 5.2.4.1 Translation limits
The implementation shall be able to translate and execute at least one program that contains at least one instance of every one of the following limits:
...
127 arguments in one function call
...
Therefore, you should be able to pass at least 126 values to printf after the initial format string, assuming the format string is properly constructed and consistent with the actual arguments that follow.
If the format string is a string literal, the standard guarantees that the compiler can handle string literals at least 4095 bytes long, and source lines at least 4095 characters long. You can use string concatenation to split the literal on multiple source lines. If you use a char array for the format string, no such limitation exists.
The only environmental limit documented for the printf family of functions is this:
The number of characters that can be produced by any single conversion shall be at least 4095
This makes the behavior of format %10000d at best defined by the implementation, but the standard does not mandate anything.
A compliant compiler/library combination should therefore accept at least 126 values for printf, whether your environment allows even more arguments may be defined by the implementation and documented as such, but is not guaranteed by the standard.

Behavior of extended bytes/characters in C/POSIX locale

C and POSIX both require only a very limited set of characters be present in the C/POSIX locale, but allow additional characters to exist. This leaves a great deal of freedom to the implementation; for instance, supporting all of Unicode (as UTF-8) in the C locale is conforming behavior. However, most historical implementations treat the C locale as having an "8-bit-clean" single-byte character encoding, either ISO-8859-1 (Latin-1) or a sort of "abstract 8-bit character set" where the non-ASCII bytes are abstract characters with no particular identity. (However, in the latter case, if the compiler defines __STDC_ISO_10646__, they normatively correspond to Unicode characters, usually the Latin-1 range.)
Another conforming option that seems much less popular is to treat all non-ASCII bytes as non-characters, i.e. respond to them with an EILSEQ error.
What I'm interested in knowing is whether there are implementations which take this or any other unusual options in implementing the C locale. Are there implementations where attempting to convert "high bytes" in the C locale results in EILSEQ or anything other than treating them as (abstract or Latin-1) single-byte characters or UTF-8?
From your comment to the previous answer:
The ways in which the assumption could be wrong are basically that bytes outside the portable character set could be illegal non-character bytes (EILSEQ) or make up some multibyte encoding (UTF-8 or a stateless legacy CJK encoding)
Here you can find one example.
Plan 9 only supports the "C" locale. As you can see in utf.c and rune.c, when it find a rune outside the portable characters, it simply handles it as a character from a different encoding.
Another candidates could be Minix and the *BSD family (as far as they use citrus). In the Minix source code I've also found the file command looking for new encoding when the character size is not 8bit.
Amusingly, I just found that the most widely-used implementation, glibc, is an example of what I'm looking for. Consider this simple program:
#include <stdlib.h>
#include <stdio.h>
int main()
{
wchar_t wc = 0;
int n = mbtowc(&wc, "\x80", 1);
printf("%d %.4x\n", n, (int)wc);
}
On glibc, it prints -1 0000. If the byte 0x80 were an extended character in the implementation's C/POSIX locale, it would print 1 followed by some nonzero character number.
Thus, the "common knowledge" that the C/POSIX locale is "8-bit-clean" on glibc is simply false. What's going on is that there's a gross inconsistency; despite the fact that all the standard utilities, regular expression matching, etc. are specified to operate on (multibyte) characters as if read by mbrtowc, the implementations of these utilities/functions are taking a shortcut when they see MB_CUR_MAX==1 or LC_CTYPE containing "C" (or similar) and reading char values directly instead of processing input with mbrtowc or similar. This is leading to an inconsistency between the specified behavior (which, as their implementation of the C/POSIX locale is defined, would have to treat high bytes as illegal sequences) and the implementation behavior (which is bypassing the locale system entirely).
With all that said, I am still looking for other implementations with the properties requested in the question.
"What I'm interested in knowing is whether there are implementations which take this or any other unusual options in implementing the C locale."
This question is very difficult to answer because it mixes the "C Locale", which I'm assuming refers to the C Standard limited character set mentioned above, with "other unusual options", which I'm assuming refers to how the specific implementation handles characters outside the (limited) C locale. Every C Implementation must implement the C Locale; I don't think there's any unusual options surrounding that.
Let's assume for argument that the question is: "...unusual options in implementing additional/extended characters beyond the C locale." Now this becomes an implementation-dependent question, and as you have already mentioned, it "leaves a great deal of freedom to the implementation." So without knowing the target compiler/hardware, it would still be difficult to answer definitively.
Now the last part:
"...attempting to convert "high bytes" in the C locale results in EILSEQ or anything other than treating them as (abstract or Latin-1) single-byte characters or UTF-8?"
Instead of converting high bytes while in the C Locale, you might be able to set the Locale in your program as in this SO Question: Does the underlying character set depend only on the C implementation?
This way you can ensure that your characters will be treated in the Locale that you expect.
It is my understanding that the C Locale only concerns itself with the first 7-bits (of an 8-bit char type), based on the sources below:
http://www.cprogramming.com/tutorial/unicode.html
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html
http://www.in-ulm.de/~mascheck/locale/
The terms "high bytes" and "Unicode" and "UTF-8" are in the class of multi-byte or wide-character encodings, and are very locale specific (and beyond the range of the minimal C Locale). I'm not clear on how it would be possible to "convert high bytes" in the (pure) C Locale. It's quite possible that implementations would pick a default (extended) locale if none was explicitly set (or pull it from the OS environment settings as stated in one of the links above).
The POSIX standard is quite clear in this regard.
The introduction to character sets in POSIX.1-2017 says:
6.2 Character Encoding
The POSIX locale shall contain 256 single-byte characters including the characters in Portable Character Set and Non-Portable Control Characters, which have the properties listed in LC_CTYPE. It is unspecified whether characters not listed in those two tables are classified as punct or cntrl, or neither. Other locales shall contain the characters in Portable Character Set and may contain any or all of the control characters identified in Non-Portable Control Characters; the presence, meaning, and representation of any additional characters are locale-specific.
(emphasis mine)
The page for mbtowc() says:
The mbtowc() function shall fail if:
[EILSEQ]
 An invalid character sequence is detected. In the POSIX locale an [EILSEQ] error cannot occur since all byte values are valid characters.
Note that the POSIX locale is defined to be identical to the C locale.
So if an operating system conforms to POSIX, mbtowc is a no-op in the POSIX locale. Characters 128–255 are passed through just as characters 0–127 are. Implementations that operate differently are in violation of the standard.

Resources