According to C11 WG14 draft version N1570:
The header <ctype.h> declares several functions useful for classifying
and mapping characters. In all cases the argument is an int, the
value of which shall be representable as an unsigned char or shall
equal the value of the macro EOF. If the argument has any other value,
the behavior is undefined.
Is it undefined behaviour?:
#include <ctype.h>
#include <limits.h>
#include <stdlib.h>
int main(void) {
char c = CHAR_MIN; /* let assume that char is signed and CHAR_MIN < 0 */
return isspace(c) ? EXIT_FAILURE : EXIT_SUCCESS;
}
Does the standard allow to pass char to isspace() (char to int)? In other words, is char after conversion to int representable as an unsigned char?
Here's how wiktionary defines "representable":
Capable of being represented.
Is char capable of being represented as unsigned char? Yes. §6.2.6.1/4:
Values stored in non-bit-field objects of any other object type
consist of n × CHAR_BIT bits, where n is the size of an object of that
type, in bytes. The value may be copied into an object of type
unsigned char [n] (e.g., by memcpy); the resulting set of bytes is
called the object representation of the value.
sizeof(char) == 1 therefore its object representation is unsigned char[1] i.e., char is capable of being represented as an unsigned char. Where am I wrong?
Concrete example, I can represent [-2, -1, 0, 1] as [0, 1, 2, 3]. If I can't then why?
Related: According to §6.3.1.3 isspace((unsigned char)c) is portable if INT_MAX >= UCHAR_MAX otherwise it is implementation-defined.
What does representable in a type mean?
Re-formulated, a type is a convention for what the underlying bit-patterns mean. A value is thus representable in a type, if that type assigns some bit-pattern that meaning.
A conversion (which might need a cast), is a mapping from a value (represented with a specific type) to a value (possibly different) represented in the target type.
Under the given assumption (that char is signed), CHAR_MIN is certainly negative, and the text you quoted leaves no room for interpretation:
Yes, it is undefined behavior, as unsigned char cannot represent any negative numbers.
If that assumption did not hold, your program would be well-defined, because CHAR_MIN would be 0, a valid value for unsigned char.
Thus, we have a case where it is implementation-defined whether the program is undefined or well-defined.
As an aside, there is no guarantee that sizeof(int)>1 or INT_MAX >= CHAR_MAX, so int might not be able to represent all values possible for unsigned char.
As conversions are defined to be value-preserving, a signed char can always be converted to int.
But if it was negative, that does not change the impossibility of representing a negative value as an unsigned char. (The conversion is defined, as conversion from any integral type to any unsigned integral type is always defined, though narrowing conversions need a cast.)
Under the assumption that char is signed then this would be undefined behavior, otherwise it is well defined since CHAR_MIN would have the value 0. It is easier to see the intention and meaning of:
the value of which shall be representable as an unsigned char or shall
equal the value of the macro EOF
if we read section 7.4 Character handling <ctype.h> from the Rationale for International Standard—Programming Languages—C which says (emphasis mine going forward):
Since these functions are often used primarily as macros, their domain
is restricted to the small positive integers representable in an
unsigned char, plus the value of EOF. EOF is traditionally -1, but may
be any negative integer, and hence distinguishable from any valid
character code. These macros may thus be efficiently implemented by
using the argument as an index into a small array of attributes.
So valid values are:
Positive integers that can fit into unsigned char
EOF which is some implementation defined negative number
Even though this is C99 rationale since the particular wording you are referring to does not change from C99 to C11 and so the rationale still fits.
We can also find why the interface uses int as an argument as opposed to char, from section 7.1.4 Use of library functions, it says:
All library prototypes are specified in terms of the “widened” types
an argument formerly declared as char is now written as int. This
ensures that most library functions can be called with or without a
prototype in scope, thus maintaining backwards compatibility with
pre-C89 code. Note, however, that since functions like printf and
scanf use variable-length argument lists, they must be called in the
scope of a prototype.
The revealing quote (for me) is §6.3.1.3/1:
if the value can be represented by the new type, it is unchanged.
i.e., if the value has to be changed then the value can't be represented by the new type.
Therefore an unsigned type can't represent a negative value.
To answer the question in the title: "representable" refers to "can be represented" from §6.3.1.3 and unrelated to "object representation" from §6.2.6.1.
It seems trivial in retrospect. I might have been confused by the habit of treating b'\xFF', 0xff, 255, -1 as the same byte in Python:
>>> (255).to_bytes(1, 'big')
b'\xff'
>>> int.from_bytes(b'\xFF', 'big')
255
>>> 255 == 0xff
True
>>> (-1).to_bytes(1, 'big', signed=True)
b'\xff'
and the disbelief that it is an undefined behavior to pass a character to a character classification function e.g., isspace(CHAR_MIN).
Related
Simple question on type conversion in C, assume this line of code:
signed char a = 133;
As the max value of a signed char is 128, does the above code have implementation defined behaviour according to the third rule of casting?
if the value cannot be represented by the new type and it's not unsigned, then the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised.
First of all, 133 is not unsigned. Since it will always fit in an int, it will be of type int, and signed (furthermore in C99+, all unsuffixed decimal constants are signed! To get unsigned numbers you must add U/u at the end).
Second, this isn't a cast but a conversion. A cast in C is explicit conversion (or non-conversion) to a certain type, marked with construct (type)expression. In this case you could write the initialization to use an explicit cast with
signed char a = (signed char)133;
In this case it would not change the behaviour of the initialization.
third, this is indeed an initialization, not an assignment, so it has different rules for what is an acceptable expression. If this initializer is for an object with static storage duration, then the initializer must be a certain kind of compile-time constant. But for this particular case, both assignment and initialization would do the conversion the same way.
Now we get to the point whether the 3rd integer conversion rule applies - for that you need to know what the 2 first ones are:
the target type is an integer type (not _Bool) with the value representable in it (does not apply in this case since as you well know 133 is not representable if SCHAR_MAX is 127)
the target type is unsigned (well it isn't)
so therefore we get to C11 6.3.1.3p3:
Otherwise, the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised.
The question is whether it has implementation-defined behaviour - yes, the implementation must document what will happen - either how it calculates the result, or which signal it will raise in that occasion.
For GCC 10.2 the manuals state it:
The result of, or the signal raised by, converting an integer to a signed integer type when the value cannot be represented in an object of that type (C90 6.2.1.2, C99 and C11 6.3.1.3).
For conversion to a type of width N, the value is reduced modulo 2^N to be within range of the type; no signal is raised.
The Clang "documentation" is a "little" less accessible, you just have to read the source code...
This is implicit type conversion at assignment. 133 will be copied bit by bit into the variable a . 133 in binary is 10000101 which when copied into a will represent a negative number as, if leading bit is 1 then it represents a negative number. then the actual value of a will be determined by using 2's compliment method which comes out to be -123 (It also depends on how negative numbers are implemented for signed data types).
The signedness of char is not standardized. Hence there are signed char and unsigned char types. Therefore functions which work with single character must use the argument type which can hold both signed char and unsigned char (this
type was chosen to be int), because if the argument type was char, we would
get type conversion warnings from the compiler (if -Wconversion is used) in code like this:
char c = 'ÿ';
if (islower((unsigned char) c)) ...
warning: conversion to ‘char’ from ‘unsigned char’ may change the sign of the result
(here we consider what would happen if the argument type of islower() was char)
And the thing which makes it work without explicit typecasting is automatic promotion
from char to int.
Further, the ISO C90 standard, where wchar_t was introduced, does not say anything
specific about the representation of wchar_t.
Some quotations from glibc reference:
it would be legitimate to define wchar_t as char
if wchar_t is defined as char the type wint_t must be defined as int due to the parameter promotion.
So, wchar_t can perfectly well be defined as char, which means that similar rules
for wide character types must apply, i.e., there may be implementations where
wchar_t is positive, and there may be implementations where wchar_t is negative.
From this it follows that there must exist unsigned wchar_t and signed wchar_t types (for the same reason as there are unsigned char and signed char types).
Private communication reveals that an implementation is allowed to support wide
characters with >=0 value only (independently of signedness of wchar_t). Anybody knows what this means? Does thin mean that when wchar_t is 16-bit
type (for example), we can only use 15 bits to store the value of wide character?
In other words, is it true that a sign-extended wchar_t is a valid value?
See also this question.
Also, private communication reveals that the standard requires that any valid value of wchar_t must
representable by wint_t. Is it true?
Consider this example:
#include <locale.h>
#include <ctype.h>
int main (void)
{
setlocale(LC_CTYPE, "fr_FR.ISO-8859-1");
/* 11111111 */
char c = 'ÿ';
if (islower(c)) return 0;
return 1;
}
To make it portable, we need the cast to '(unsigned char)'.
This is necessary because char may be the equivalent signed char,
in which case a byte where the top bit is set would be sign
extended when converting to int, yielding a value that is outside
the range of unsigned char.
Now, why is this scenario different from the following example for
wide characters?
#include <locale.h>
#include <wchar.h>
#include <wctype.h>
int main(void)
{
setlocale(LC_CTYPE, "");
wchar_t wc = L'ÿ';
if (iswlower(wc)) return 0;
return 1;
}
We need to use iswlower((unsigned wchar_t)wc) here, but
there is no unsigned wchar_t type.
Why there are no unsigned wchar_t and signed wchar_t types?
UPDATE
Are the standards saying that casting to unsigned int and to int in the following two programs is guaranteed to be correct?
(I just replaced wint_t and wchar_t to their actual meaning in glibc)
#include <locale.h>
#include <wchar.h>
int main(void)
{
setlocale(LC_CTYPE, "en_US.UTF-8");
unsigned int wc;
wc = getwchar();
putwchar((int) wc);
}
--
#include <locale.h>
#include <wchar.h>
#include <wctype.h>
int main(void)
{
setlocale(LC_CTYPE, "en_US.UTF-8");
int wc;
wc = L'ÿ';
if (iswlower((unsigned int) wc)) return 0;
return 1;
}
TL;DR:
Why there are no unsigned wchar_t and signed wchar_t types?
Because C's wide-character handling facilities were defined such that they are not needed.
In more detail,
The signedness of char is not standardized.
To be precise, "The implementation shall define char to have the same range, representation, and behavior as either signed char or unsigned char." (C2011, 6.2.5/15)
Hence there are signed char and unsigned char types.
"Hence" implies causation, which would be hard to argue clearly, but certainly signed char and unsigned char are more appropriate when you want to handle numbers, as opposed to characters.
Therefore functions which work with single character must use the argument type which can hold both signed char and unsigned char
No, not at all. Standard library functions that work with individual characters could easily be defined in terms of type char, regardless of whether that type is signed, because the library implementation does know its signedness. If that were a problem then it would apply equally to the string functions, too -- char would be useless.
Your example of getchar() is non-apposite. It returns int rather than a character type because it needs to be able to return an error indicator that does not correspond to any character. Moreover, the code you present does not correspond to the accompanying warning message: it contains a conversion from int to unsigned char, but no conversion from char to unsigned char.
Some other character-handling functions accept int parameters or return values of type int both for compatibility with getchar() and other stdio functions, and for historic reasons. In days of yore, you couldn't actually pass a char at all -- it would always be promoted to int, and that is what the functions would (and must) accept. One cannot later change the argument type, evolution of the language notwithstanding.
Further, the ISO C90 standard, where wchar_t was introduced, does not say anything specific about the representation of wchar_t.
C90 isn't really relevant any longer, but no doubt it says something very similar to C2011 (7.19/2), which describes wchar_t as
an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales [...].
Your quotations from the glibc reference are non-authoritative, except possibly for glibc only. They appear in any case to be commentary, not specification, and its unclear why you raise them. Certainly, though, at least the first is correct. Referring to the standard, if all the members of the largest extended character set specified among the locales supported by a given implementation could fit in a char then that implementation could define wchar_t as char. Such implementations used to be much more common than they are today.
You ask several questions:
Private communication reveals that an implementation is allowed to support wide characters with >=0 value only (independently of signedness of wchar_t). Anybody knows what this means?
I think it means that whoever communicated that to you doesn't know what they are talking about, or perhaps that what they are talking about is something different than the requirements placed by the C standard. You will find that in practice, character sets are defined with only non-negative character codes, but that is not a constraint placed by the C standard.
Does thin mean that when wchar_t is 16-bit type (for example), we can only use 15 bits to store the value of wide character?
The C standard does not say or imply that. You can store the value of any supported character in a wchar_t. In particular, if an implementation supports a character set containing character codes exceeding 32767, then you can store those in a wchar_t.
In other words, is it true that a sign-extended wchar_t is a valid value?
The C standard does not say or imply that. It does not even say whether wchar_t is a signed type (if not, then sign extension is meaningless for it). If it is a signed type, then there is no guarantee about whether sign-extending a value representing a character in some supported character set (which value could, in principle, be negative) will produce a value that also represents a character in that character set, or in any other supported character set. The same is true of adding 1 to a wchar_t value.
Also, private communication reveals that the standard requires that any valid value of wchar_t must representable by wint_t. Is it true?
It depends what you mean by "valid". The standard says that wint_t
is an integer type unchanged by default argument promotions that can hold any value corresponding to members of the extended character set, as well as at least one value that does not correspond to any member of the extended character set.
(C2011, 7.29.1/2)
wchar_t must be able to hold any value corresponding to a member of the extended character set, in any supported locale. wint_t must be able to hold all of those values, too. It may be, however, that wchar_t is capable of representing values that do not correspond to any character in any supported character set. Such values are valid in the sense that the type can represent them. wint_t is not required to be able to represent such values.
For example, if the largest extended character set of any supported locale uses character codes up to but not exceeding 32767, then an implementation would be free to implement wchar_t as an unsigned 16-bit integer, and wint_t as a signed 16-bit integer. The values representable by wchar_t that do not correspond to extended characters are then not representable by wint_t (but wint_t still has many candidates for its required value that does not correspond to any character).
With respect to the character and wide-character classification functions, the only answer is that the differences simply arise from different specifications. The char classification functions are defined to work with the same values that getchar() is defined to return -- either -1 or a character value converted, if necessary, to unsigned char. The wide character classification functions, on the other hand, accept arguments of type wint_t, which can represent the values of all wide-character unchanged, therefore there is no need for a conversion.
You claim in this regard that
We need to use iswlower((unsigned wchar_t)wc) here, but there is no unsigned wchar_t type.
No and maybe. You do not need to convert the wchar_t argument to iswlower() to any other type, and in particular, you do not need to convert it to an explicitly unsigned type. The wide character classification functions are not analogous to the regular character classification functions in this respect, having been designed with the benefit of hindsight. As for unsigned wchar_t, C does not require such a type to exist, so portable code should not use it, but it may exist in some implementations.
Regarding the update appended to the question:
Are the standards saying that casting to unsigned int and to int in the following two programs is guaranteed to be correct? (I just replaced wint_t and wchar_t to their actual meaning in glibc)
The standard says nothing of the sort about conforming implementations in general. I'll suppose, however, that you mean to ask specifically about conforming implementations for which wchar_t is int and wint_t is unsigned int.
On such an implementation, your first program is flawed because it does not account for the possibility that getwchar() returns WEOF. Converting WEOF to type wchar_t, if doing so does not cause a signal to be raised, is not guaranteed to produce a value that corresponds to any wide character. Passing the result of such a conversion to putwchar() therefore does not exhibit defined behavior. Moreover, if WEOF is defined with the same value as UINT_MAX (which is not representable by int) then the conversion of that value to int has implementation-defined behavior independently of the putwchar() call.
On the other hand, I think the key point you are struggling with is that if the value returned by getwchar() in the first program is not WEOF, then it is guaranteed to be one that is unchanged by conversion to wchar_t. Your first program will perform as appears to be intended in that case, but the cast to int (or wchar_t) is unnecessary.
Similarly, the second program is correct provided that the wide-character literal corresponds to a character in the applicable extended character set, but the cast is unnecessary and changes nothing. The wchar_t value of such a literal is guaranteed to be representable by type wint_t, so the cast changes the type of its operand, but not the value. (But if the literal does not correspond to a character in the extended character set then the behavior is implementation-defined.)
On the third hand, if your objective is to write strictly-conforming code then the right thing to do, and indeed the intended usage mode of these particular wide-character functions, would be this:
#include <locale.h>
#include <wchar.h>
int main(void)
{
setlocale(LC_CTYPE, "en_US.UTF-8");
wint_t wc = getwchar();
if (wc != WEOF) {
// No cast is necessary or desirable
putwchar(wc);
}
}
and this:
#include <locale.h>
#include <wchar.h>
#include <wctype.h>
int main(void)
{
setlocale(LC_CTYPE, "en_US.UTF-8");
wchar_t wc = L'ÿ';
// No cast is necessary or desirable
if (iswlower(wc)) return 0;
return 1;
}
Suppose that we write in C the following character constant:
'\xFFFFAA'
Which is its numerical value?
The standard C99 says:
Character constants have type int.
Hexadecimal character constants can be represented as an unsigned char.
The value of a basic character constant is non-negative.
The value of any character constant fits in the range of char.
Besides:
The range of values of signed char is contained in the range of values of int.
The size (in bits) of char, unsigned char and signed char are the same: 1 byte.
The size of a byte is given by CHAR_BIT, whose value is at least 8.
Let's suppose that we have the typical situation with CHAR_BIT == 8.
Also, let's suppose that char is signed char for us.
By following the rules: the constant '\xFFFFAA' has type int, but its value can be represented in an unsigned char, althoug its real value fits in a char.
From these rules, an example as '\xFF' would give us:
(int)(char)(unsigned char)'\xFF' == -1
The 1st cast unsigned char comes from the "can be represented as unsigned char" requirement.
The 2nd cast char comes from the "the value fits in a char" requirement.
The 3rd cast int comes from the "has type int" requirement.
However, the constant '\xFFFFAA' is too big, and cannot be "represented" as unsigned int.
Wich is its value?
I think that the value is the resulting of (char)(0xFFFFAA % 256) since the standard says, more or less, the following:
For unsigned integer types, if a value is bigger that the maximum M that can be represented by the type, the value is the obtained after taking the remainder modulo M.
Am I right with this conclusion?
EDIT I have convinced by #KeithThompson: He says that, according to the standards, a big hexadecimal character constant is a constraint violation.
So, I will accept that answer.
However: For example, with GCC 4.8, MinGW, the compiler triggers a warning message, and the program compiles following the behaviour I have described. Thus, it was considered valid a constant like '\x100020' and its value was 0x20.
The C standard defines the syntax and semantics in section 6.4.4.4. I'll cite the N1570 draft of the C11 standard.
Paragraph 6:
The hexadecimal digits that follow the backslash and the letter x in a
hexadecimal escape sequence are taken to be part of the construction
of a single character for an integer character constant or of a single
wide character for a wide character constant. The numerical value of
the hexadecimal integer so formed specifies the value of the desired
character or wide character.
Paragraph 9:
Constraints
The value of an octal or hexadecimal escape sequence shall be in the
range of representable values for the corresponding type:
followed by a table saying that with no prefix, the "corresponding type" is unsigned char.
So, assuming that 0xFFFFAA is outside the representable range for type unsigned char, the character constant '\xFFFFAA' is a constraint violation, requiring a compile-time diagnostic. A compiler is free to reject your source file altogether.
If your compiler doesn't at least warn you about this, it's failing to conform to the C standard.
Yes, the standard does say that unsigned types have modular (wraparound) semantics, but that only applies to arithmetic expressions and some conversions, not to the meanings of constants.
(If CHAR_BIT >= 24 on your system, it's perfectly valid, but that's rare; usually CHAR_BIT == 8.)
If a compiler chooses to issue a mere warning and then continue to compile your source, the behavior is undefined (simply because the standard doesn't define the behavior).
On the other hand, if you had actually meant 'xFFFFAA', that's not interpreted as hexadecimal. (I see it was merely a typo, and the question has been edited to correct it, but I'm going to leave this here anyway.) Its value is implementation-defined, as described in paragraph 10:
The value of an integer character constant containing more than one
character (e.g.,
'ab'), ..., is implementation-defined.
Character constants containing more than one character are a nearly useless language feature, used by accident more often than they're used intentionally.
Yes, the value of \xFFFFAA should be representable by unsigned char.
6.4.4.4 9 Constraints
The value of an octal or hexadecimal escape sequence shall be in the
range of representable values for the type unsigned char for an
integer character constant.
But C99 also says,
6.4.4.4 10 Semantics
The value of an integer character constant containing more than one
character (e.g., 'ab'), or containing a character or escape sequence
that does not map to a single-byte execution character, is
implementation-defined.
So the resulting value should be in the range of unsigned char([0, 255], if CHAR_BIT == 8). But as to which one, it depends on the compiler, architecture, etc.
The various is... functions (e.g. isalpha, isdigit) in ctype.h aren't entirely predictable. They take int arguments but expect character values in the unsigned char range, so on a platform where char is signed, passing a char value directly could lead to undesirable sign extension. I believe that the typical approach to handling this is to explicitly cast to an unsigned char first.
Okay, but what is the proper, portable way to deal with the various isw... functions in wctype.h? wchar_t, like char, also may be signed or unsigned, but because wchar_t is itself a typedef, a typename of unsigned wchar_t is illegal.
Isn't that what wint_t is for? The iswXxxxx() functions take a wint_t type:
ISO 9899:1999 covers this in various sections, working backwards:
§7.25 Wide character classification and mapping utilities <wctype.h>
§7.25.2.1.1 The iswalnum function
Synopsis
#include <wctype.h>
int iswalnum(wint_t wc);
Description
The iswalnum function tests for any wide character for which iswalpha or iswdigit is true.
§7.24 Extended multibyte and wide character utilities <wchar.h>
§7.24.1 Introduction:
wint_t
which is an integer type unchanged by default argument promotions that can hold any
value corresponding to members of the extended character set, as well as at least one
value that does not correspond to any member of the extended character set (see WEOF
below);269)
269) wchar_t and wint_t can be the same integer type.
The 'unchanged by default argument promotions' should mean that it has to be as big as an int, though it could be a short or unsigned short if sizeof(short) == sizeof(int) (which is seldom the case these days, though it was true for some 16-bit systems).
§7.17 Common definitions <stddef.h>
wchar_t
which is an integer type whose range of values can represent distinct codes for all
members of the largest extended character set specified among the supported locales; the
null character shall have the code value zero and each member of the basic character set
shall have a code value equal to its value when used as the lone character in an integer
character constant.
As long as the value passed to iswalnum() or its kin is a valid wchar_t or WEOF, the function will work correctly. If you manufactured the value out of thin air and manage to get the value wrong, you get undefined behaviour.
Upon re-reading the ISO C99 specification regarding wctype.h, it states:
For all functions described in this subclause that accept an argument of type wint_t, the value shall be representable as a wchar_t or shall equal the value of the macro WEOF. If this argument has any other value, the behavior is undefined. (§7.25.1/5)
Contrast this with the corresponding note for ctype.h:
In all cases the argument is an int, the value of which shall be
representable as an unsigned char or shall equal the value of the macro EOF. If the argument has any other value, the behavior is undefined. (§7.4/1)
(emphasis mine)
I think that it's also worth understanding the motivation for why the ctype.h functions require unsigned char representations. The standard requires that EOF be a negative int (§7.19.1/3), so the ctype.h functions use unsigned char representations to (try to) avoid potential ambiguity.
In contrast, that motivation doesn't exist for wctype.h functions. The standard makes no such requirement of WEOF, elaborated by footnote 270:
The value of the macro WEOF may differ from that of EOF and need not be negative.
because WEOF is already guaranteed to not conflict with any character represented by wchar_t (§7.24.1/3).
Therefore the wctype.h functions don't have or need any of the unsigned nonsense, and wchar_t values can be passed to them directly.
I've read and wondered about the source code of sqlite
static int strlen30(const char *z){
const char *z2 = z;
while( *z2 ){ z2++; }
return 0x3fffffff & (int)(z2 - z);
}
Why use strlen30() instead of strlen() (in string.h)??
The commit message that went in with this change states:
[793aaebd8024896c] part of check-in [c872d55493] Never use strlen(). Use our own internal sqlite3Strlen30() which is guaranteed to never overflow an integer. Additional explicit casts to avoid nuisance warning messages. (CVS 6007) (user: drh branch: trunk)
(this is my answer from Why reimplement strlen as loop+subtraction? , but it was closed)
I can't tell you the reason why they had to re-implement it, and why they chose int instead if size_t as the return type. But about the function:
/*
** Compute a string length that is limited to what can be stored in
** lower 30 bits of a 32-bit signed integer.
*/
static int strlen30(const char *z){
const char *z2 = z;
while( *z2 ){ z2++; }
return 0x3fffffff & (int)(z2 - z);
}
Standard References
The standard says in (ISO/IEC 14882:2003(E)) 3.9.1 Fundamental Types, 4.:
Unsigned integers, declared unsigned, shall obey the laws of arithmetic modulo 2n where n is the number of bits in the value representation of that particular size of integer. 41)
...
41): This implies that unsigned arithmetic does not overflow because a result that cannot be represented by the resulting unsigned integer
type is reduced modulo the number that is one greater than the largest value that can be represented by the resulting unsigned integer
type
That part of the standard does not define overflow-behaviour for signed integers. If we look at 5. Expressions, 5.:
If during the evaluation of an expression, the result is not mathematically defined or not in the range of representable values for its type, the behavior is undefined, unless such an expression is a constant expression
(5.19), in which case the program is ill-formed. [Note: most existing implementations of C + + ignore integer
overflows. Treatment of division by zero, forming a remainder using a zero divisor, and all floating point
exceptions vary among machines, and is usually adjustable by a library function. ]
So far for overflow.
As for subtracting two pointers to array elements, 5.7 Additive operators, 6.:
When two pointers to elements of the same array object are subtracted, the result is the difference of the subscripts of the two array elements. The type of the result is an implementation-defined signed integral type; this type shall be the same type that is defined as ptrdiff_t in the cstddef header (18.1). [...]
Looking at 18.1:
The contents are the same as the Standard C library header stddef.h
So let's look at the C standard (I only have a copy of C99, though), 7.17 Common Definitions :
The types used for size_t and ptrdiff_t should not have an integer conversion rank
greater than that of signed long int unless the implementation supports objects
large enough to make this necessary.
No further guarantee made about ptrdiff_t. Then, Annex E (still in ISO/IEC 9899:TC2) gives the minimum magnitude for signed long int, but not a maximum:
#define LONG_MAX +2147483647
Now what are the maxima for int, the return type for sqlite - strlen30()? Let's skip the C++ quotation that forwards us to the C-standard once again, and we'll see in C99, Annex E, the minimum maximum for int:
#define INT_MAX +32767
Summary
Usually, ptrdiff_t is not bigger than signed long, which is not smaller than 32bits.
int is just defined to be at least 16bits long.
Therefore, subtracting two pointers may give a result that does not fit into the int of your platform.
We remember from above that for signed types, a result that does not fit yields undefined behaviour.
strlen30 does applies a bitwise or upon the pointer-subtract-result:
| 32 bit |
ptr_diff |10111101111110011110111110011111| // could be even larger
& |00111111111111111111111111111111| // == 3FFFFFFF<sub>16</sub>
----------------------------------
= |00111101111110011110111110011111| // truncated
That prevents undefiend behaviour by truncation of the pointer-subtraction result to a maximum value of 3FFFFFFF16 = 107374182310.
I am not sure about why they chose exactly that value, because on most machines, only the most significant bit tells the signedness. It could have made sense versus the standard to choose the minimum INT_MAX, but 1073741823 is indeed slightly strange without knowing more details (though it of course perfectly does what the comment above their function says: truncate to 30bits and prevent overflow).
The CVS commit message says:
Never use strlen(). Use our own internal sqlite3Strlen30() which is guaranteed to never overflow an integer. Additional explicit casts to avoid nuisance warning messages. (CVS 6007)
I couldn't find any further reference to this commit or explanation how they got an overflow in that place. I believe that it was an error reported by some static code analysis tool.