strcmp returning unexpected results - c

I thought strcmp was supposed to return a positive number if the first string was larger than the second string. But this program
#include <stdio.h>
#include <string.h>
int main()
{
char A[] = "A";
char Aumlaut[] = "Ä";
printf("%i\n", A[0]);
printf("%i\n", Aumlaut[0]);
printf("%i\n", strcmp(A, Aumlaut));
return 0;
}
prints 65, -61 and -1.
Why? Is there something I'm overlooking?
I thought that maybe the fact that I'm saving as UTF-8 would influence things.. You know because the Ä consists of 2 chars there. But saving as an 8-bit encoding and making sure that the strings both have length 1 doesn't help, the end result is the same.
What am I doing wrong?
Using GCC 4.3 under 32 bit Linux here, in case that matters.

strcmp and the other string functions aren't actually utf aware. On most posix machines, C/C++ char is internally utf8, which makes most things "just work" with regards to reading and writing and provide the option of a library understanding and manipulating the utf codepoints. But the default string.h functions are not culture sensitive and do not know anything about comparing utf strings. You can look at the source code for strcmp and see for yourself, it's about as naïve an implementation as possible (which means it's also faster than an internationalization-aware compare function).
I just answered this in another question - you need to use a UTF-aware string library such as IBM's excellent ICU - International Components for Unicode.

The strcmp and similar comparison functions treat the bytes in the strings as unsigned chars, as specified by the standard in section 7.24.4, point 1 (was 7.21.4 in C99)
The sign of a nonzero value returned by the comparison functions memcmp, strcmp, and strncmp is determined by the sign of the difference between the values of the first pair of characters (both interpreted as unsigned char) that differ in the objects being compared.
(emphasis mine).
The reason is probably that such an interpretation maintains the ordering between code points in the common encodings, while interpreting them a s signed chars doesn't.

strcmp() takes chars as unsigned ASCII values. So, your A-with-double-dots isn't char -61, it's char 195 (or maybe 196, if I've got my math wrong).

Saving as an 8-bit ASCII encoding, 'A' == 65 and 'Ä' equals whatever -61 is if you consider it to be an unsigned char. Anyway, 'Ä' is strictly positive and greater than 2^7-1, you're just printing it as if it were signed.
If you consider 'Ä' to be an unsigned char (which it is), its value is 195 in your charset. Hence, strcmp(65, 195) correctly reports -1.

Check the strcmp manpage:
The strcmp() function compares the two strings s1 and s2. It returns
an integer less than, equal to, or greater than zero if s1 is found,
respectively, to be less than, to match, or be greater than s2.

To do string handling correctly in C when the input character set exceeds
UTF8 you should use the standard library's wide-character facilities for
strings and i/o. Your program should be:
#include <wchar.h>
#include <stdio.h>
int main()
{
wchar_t A[] = L"A";
wchar_t Aumlaut[] = L"Ä";
wprintf(L"%i\n", A[0]);
wprintf(L"%i\n", Aumlaut[0]);
wprintf(L"%i\n", wcscmp(A, Aumlaut));
return 0;
}
and then it will give the correct results (GCC 4.6.3). You don't need a special library.

Related

Weird return value in strcmp [duplicate]

This question already has answers here:
Inconsistent strcmp() return value when passing strings as pointers or as literals
(2 answers)
Closed 4 years ago.
While checking the return value of strcmp function, I found some strange behavior in gcc. Here's my code:
#include <stdio.h>
#include <string.h>
char str0[] = "hello world!";
char str1[] = "Hello world!";
int main() {
printf("%d\n", strcmp("hello world!", "Hello world!"));
printf("%d\n", strcmp(str0, str1));
}
When I compile this with clang, both calls to strcmp return 32. However, when compiling with gcc, the first call returns 1, and the second call returns 32. I don't understand why the first and second calls to strcmp return different values when compiled using gcc.
Below is my test environment.
Ubuntu 18.04 64bit
gcc 7.3.0
clang 6.0.0
It looks like you didn't enable optimizations (e.g. -O2).
From my tests it looks like gcc always recognizes strcmp with constant arguments and optimizes it, even with -O0 (no optimizations). Clang needs at least -O1 to do so.
That's where the difference comes from: The code produced by clang calls strcmp twice, but the code produced by gcc just does printf("%d\n", 1) in the first case because it knows that 'h' > 'H' (ASCIIbetically, that is). It's just constant folding, really.
Live example: https://godbolt.org/z/8Hg-gI
As the other answers explain, any positive value will do to indicate that the first string is greater than the second, so the compiler optimizer simply chooses 1. The strcmp library function apparently uses a different value.
The standard defines the result of strcmp to be negative, if lhs appears before rhs in lexical order, zero if they are equal, or a positive value if lhs appears lexically after rhs.
It's up to the implementation how to implement that and what exactly to return. You must not depend on a specific value in your programs, or they won't be portable. Simply check with comparisons (<, >, ==).
See https://en.cppreference.com/w/c/string/byte/strcmp
Background
One simple implementation might just calculate the difference of each character c1 - c2 and do that until the result is not zero, or one of the strings ends. The result will then be the numeric difference between the first character, in which the two strings differed.
For example, this GLibC implementation: https://sourceware.org/git/?p=glibc.git;a=blob_plain;f=string/strcmp.c;hb=HEAD
The strcmp function is only specified to return a value larger than zero, zero, or less than zero. There's nothing specified what those positive and negative values have to be.
The exact values returned by strcmp in the case of the strings not being equal are not specified. From the man page:
#include <string.h>
int strcmp(const char *s1, const char *s2);
int strncmp(const char *s1, const char *s2, size_t n);
The strcmp() and strncmp() functions return an integer less than,
equal to, or greater than zero if s1 (or the first n bytes thereof) is
found, respectively, to be less than, to match, or be greater than s2.
Since str1 compares greater than str2, the value must be positive, which it is in both cases.
As for the difference between the two compilers, it appears that clang is returning the difference between the ASCII values for the corresponding characters that mismatched, while gcc is opting for a simple -1, 0, or 1. Both are valid, so your code should only need to check if the value is 0, greater than 0, or less than 0.

C Language: Why int variable can store char?

I am recently reading The C Programming Language by Kernighan.
There is an example which defined a variable as int type but using getchar() to store in it.
int x;
x = getchar();
Why we can store a char data as a int variable?
The only thing that I can think about is ASCII and UNICODE.
Am I right?
The getchar function (and similar character input functions) returns an int because of EOF. There are cases when (char) EOF != EOF (like when char is an unsigned type).
Also, in many places where one use a char variable, it will silently be promoted to int anyway. Ant that includes constant character literals like 'A'.
getchar() attempts to read a byte from the standard input stream. The return value can be any possible value of the type unsigned char (from 0 to UCHAR_MAX), or the special value EOF which is specified to be negative.
On most current systems, UCHAR_MAX is 255 as bytes have 8 bits, and EOF is defined as -1, but the C Standard does not guarantee this: some systems have larger unsigned char types (9 bits, 16 bits...) and it is possible, although I have never seen it, that EOF be defined as another negative value.
Storing the return value of getchar() (or getc(fp)) to a char would prevent proper detection of end of file. Consider these cases (on common systems):
if char is an 8-bit signed type, a byte value of 255, which is the character ÿ in the ISO8859-1 character set, has the value -1 when converted to a char. Comparing this char to EOF will yield a false positive.
if char is unsigned, converting EOF to char will produce the value 255, which is different from EOF, preventing the detection of end of file.
These are the reasons for storing the return value of getchar() into an int variable. This value can later be converted to a char, once the test for end of file has failed.
Storing an int to a char has implementation defined behavior if the char type is signed and the value of the int is outside the range of the char type. This is a technical problem, which should have mandated the char type to be unsigned, but the C Standard allowed for many existing implementations where the char type was signed. It would take a vicious implementation to have unexpected behavior for this simple conversion.
The value of the char does indeed depend on the execution character set. Most current systems use ASCII or some extension of ASCII such as ISO8859-x, UTF-8, etc. But the C Standard supports other character sets such as EBCDIC, where the lowercase letters do not form a contiguous range.
getchar is an old C standard function and the philosophy back then was closer to how the language gets translated to assembly than type correctness and readability. Keep in mind that compilers were not optimizing code as much as they are today. In C, int is the default return type (i.e. if you don't have a declaration of a function in C, compilers will assume that it returns int), and returning a value is done using a register - therefore returning a char instead of an int actually generates additional implicit code to mask out the extra bytes of your value. Thus, many old C functions prefer to return int.
C requires int be at least as many bits as char. Therefore, int can store the same values as char (allowing for signed/unsigned differences). In most cases, int is a lot larger than char.
char is an integer type that is intended to store a character code from the implementation-defined character set, which is required to be compatible with C's abstract basic character set. (ASCII qualifies, so do the source-charset and execution-charset allowed by your compiler, including the one you are actually using.)
For the sizes and ranges of the integer types (char included), see your <limits.h>. Here is somebody else's limits.h.
C was designed as a very low-level language, so it is close to the hardware. Usually, after a bit of experience, you can predict how the compiler will allocate memory, and even pretty accurately what the machine code will look like.
Your intuition is right: it goes back to ASCII. ASCII is really a simple 1:1 mapping from letters (which make sense in human language) to integer values (that can be worked with by hardware); for every letter there is an unique integer. For example, the 'letter' CTRL-A is represented by the decimal number '1'. (For historical reasons, lots of control characters came first - so CTRL-G, which rand the bell on an old teletype terminal, is ASCII code 7. Upper-case 'A' and the 25 remaining UC letters start at 65, and so on. See http://www.asciitable.com/ for a full list.)
C lets you 'coerce' variables into other types. In other words, the compiler cares about (1) the size, in memory, of the var (see 'pointer arithmetic' in K&R), and (2) what operations you can do on it.
If memory serves me right, you can't do arithmetic on a char. But, if you call it an int, you can. So, to convert all LC letters to UC, you can do something like:
char letter;
....
if(letter-is-upper-case) {
letter = (int) letter - 32;
}
Some (or most) C compilers would complain if you did not reinterpret the var as an int before adding/subtracting.
but, in the end, the type 'char' is just another term for int, really, since ASCII assigns a unique integer for each letter.

Can int contain value of char?

Sorry about the beginners question, but I've written this code:
#include <stdio.h>
int main()
{
int y = 's';
printf("%c\n", y);
return 0;
}
The compiler (Visual Studio 2012) does not warn me about possibility data-loss (like from int to float).
I didn't find an answer (or didn't search correctly) in Google.
I wonder if this because int's storage in memory is 4 and it can hold 1 memory storage as char.
I am not sure about this.
Thanks in advance.
Yes, that's fine. Characters are simply small integers, so of course the smaller value fits in the larger int variable, there's nothing to warn about.
Many standard C functions use int to transport single characters, since they then also get the possibility to express EOF (which is not a character).
A char is just an 8-bit integer.
An int is a larger integer (on MSVC 32-bit builds it should be 4 bytes).
's' corresponds to the ASCII code of the lower-case letter 's', i.e. it's the integer number 115.
So, your code is similar to:
int y = 115; // 's'
In C, all characters are stored in and dealt with as integers according to the ASCII standard. This allows for functions such as strcmp() etc.
Despite appearances there is no char anywhere in your example.
For historical reasons character constants are actually ints so the line
int y = 's';
is actually assigning one int to another.
Furthermore the %c format specifier in printf actually expects to receive an int argument, not a char. This is because the default argument promotions are applied to variadic arguments and therefore any char in a call to printf is promoted to an int before the function is called.

In C, when would type casting a char to an integer not return a value between 0 and 127?

A common interview question asks to write an algorithm that detects duplicates in a string.
Using a character array of length 128 to keep track of the characters already seen is a good way to solve this problem in linear time.
In C we would type something like
char seen_chars[128];
unsigned char c;
/* set seen_chars to all zeros, assign c */
seen_chars[ c ] = 1;
To mark character c as seen. Of course this relies on
(int) c
returning a value between 0 and 127.
I'm wondering when would this fail? What are the assumptions that make this code work correctly?
The code will fail (and cause undefined behavior) every time when the integer value of the given char c is not between 0 and 127 (inclusive).
C does in no way limit the maximum range of char - you are only guaranteed that it can hold at least 256 distinct values - so in any given C implementation a valid char value can be out of that boundary. On most desktop systems a char can hold values from -128 to 127, or from 0 to 255. However, as an example:
char aFunction(void);
char c = aFunction();
if ((int)c > 1000000000)
printf("This could be true on some systems\n").
The following would be valid (although it may exhaust your stack on systems with large chars):
#include <limits.h>
_Bool seen[1<<CHAR_BIT] = {0};
seen[(unsigned char)c] = 1;
/* etc. */
On most implementations, an unsigned char has a value going from 0 to 255. Now, ASCII defines values from 0 to 127, but if your string contains characters from an "extended ASCII" character set (Latin1, for instance), then you might get character values above 127.
So, if your text is american, you're safe. Otherwise, you will overflow your buffer.
It works with all 7-bit ascii characters. Other characters, like german umlauts, would be translated to a negative number, with the appropriate consequences for your algorithm.
You're on the safe side if you use unsigned char and make the array 256 entries wide.
I'm wondering when would this fail?
Whenever you have any characters above 127.
Whenever you have a multi-byte encoding like UTF-8.

C Compatibility Between Integers and Characters

How does C handle converting between integers and characters? Say you've declared an integer variable and ask the user for a number but they input a string instead. What would happen?
The user input is treated as a string that needs to be converted to an int using atoi or another conversion function. Atoi will return 0 if the string cannot be interptreted as a number because it contains letters or other non-numeric characters.
You can read a bit more at the atoi documentation on MSDN - http://msdn.microsoft.com/en-us/library/yd5xkb5c(VS.80).aspx
Uh?
You always input a string. Then you parse convert this string to number, with various ways (asking again, taking a default value, etc.) of handling various errors (overflow, incorrect chars, etc.).
Another thing to note is that in C, characters and integers are "compatible" to some degree. Any character can be assigned to an int. The reverse also works, but you'll lose information if the integer value doesn't fit into a char.
char foo = 'a'; // The ascii value representation for lower-case 'a' is 97
int bar = foo; // bar now contains the value 97
bar = 255; // 255 is 0x000000ff in hexadecimal
foo = bar; // foo now contains -1 (0xff)
unsigned char foo2 = foo; // foo now contains 255 (0xff)
As other people have noted, the data is normally entered as a string -- the only question is which function is used for doing the reading. If you're using a GUI, the function may already deal with conversion to integer and reporting errors and so in an appropriate manner. If you're working with Standard C, it is generally easier to read the value into a string (perhaps with fgets() and then convert. Although atoi() can be used, it is seldom the best choice; the trouble is determining whether the conversion succeeded (and produced zero because the user entered a legitimate representation of zero) or not.
Generally, use strtol() or one of its relatives (strtoul(), strtoll(), strtoull()); for converting floating point numbers, use strtod() or a similar function. The advantage of the integer conversion routines include:
optional base selection (for example, base 10, or base 10 - hex, or base 8 - octal, or any of the above using standard C conventions (007 for octal, 0x07 for hex, 7 for decimal).
optional error detection (by knowing where the conversion stopped).
The place I go for many of these function specifications (when I don't look at my copy of the actual C standard) is the POSIX web site (which includes C99 functions). It is Unix-centric rather than Windows-centric.
The program would crash, you need to call atoi function.

Resources