C strcmp implementation using subtraction of characters

C strcmp implementation using subtraction of characters - c

I saw this implementation of strcmp a while back, and I have a question for purely education purposes. Why is it needed to convert the inputs to 16bit integers, do the math and then convert back to 8bit? What is wrong with doing the subtraction in 8bit?
int8_t strcmp (const uint8_t* s1, const uint8_t* s2)
{
while ( *s1 && (*s1 == *s2) )
{
s1++;
s2++;
}
return (int8_t)( (int16_t)*s1 - (int16_t)*s2 );
}
Note: the code assumes 16 bit int type.
EDIT:
It was mentioned that C does conversion to int (suppose 32bit) by default. Is that the case even when the code explicitly states to cast to 16bit int ?

The strcmp(a,b) function is expected to return
<0 if string a < string b
>0 if string a > string b
0 if string a == string b
The test is actually made on the first char being different in the two strings at the same location (0, the string terminator, works as well).
Here since the function takes two uint8_t (unsigned char), the developer was probably worrying about doing a comparison on two unsigned chars would give a number between 0 and 255, hence a negative value would never be returned. For instance, 118 - 236 would return -118, but on 8 bits it would return 138.
Thus the programmer decided to cast to int_16, signed integer (16 bits).
That could have worked, and given the correct negative/positive values (provided that the function returns int_16 instead of int_8).
(*edit: comment from #zwol below, the integer promotion is unavoidable, thus this int16_t casting is not necessary)
However the final int_8 cast breaks the logic. Since returned values may be from -255 to 255, some of these values will see their sign reversed after the cast to int_8.
For instance, doing 255 - 0 gives the positive 255 (on 16 bits, all lower 8 bits to 1, MSB to 0) but in the int_8 world (signed int of 8 bits) this is negative, -1, since we only have the last low 8 bits set to binary 11111111, or decimal -1.
Definitely not a good programming example.
That working function from Apple is better
for ( ; *s1 == *s2; s1++, s2++)
if (*s1 == '\0')
return 0;
return ((*(unsigned char *)s1 < *(unsigned char *)s2) ? -1 : +1);
(Linux does it in assembly code...)

Actually, the difference must be done in at least 16 bits¹ for the obvious reason that the range of the result is -255 to 255 and that does not fit in 8 bits. However, sfstewman is correct in noting that it would happen due to implicit integer promotion anyway.
The eventual cast to 8 bits is incorrect, because it can overflow as the range still does not fit in 8 bits. And anyway, strcmp is indeed supposed to return plain int.
¹ 9 would suffice, but bits normally come in batches of 8.

Input data is unsigned 8-bit, so to avoid truncation and effects of overflow/underflow it should be converted to at least 9-bit signed, therefore int16 is used.

return (int8_t)( (int16_t)*s1 - (int16_t)*s2 );
This could mean one of these two options:
Either the programmer was confused about how implicit type promotions work in C. Both operands will be implicitly converted to int no matter the casts to int16_t. So if intis for example 32 bits, the code is nonsense. Or otherwise if int is equivalent to int16_t for the specific system - then no conversion at all takes place.
Or the programmer is well-aware about how type promotions work and is writing code that needs to confirm to a standard that bans implicit type promotions, such as MISRA-C. In that case, and in case int is 16 bits on the given system, the code makes perfect sense: it forces an explicit type promotion to dodge warnings from the compiler/static analyser.
I would make a guess that the second option is the most likely, and that this code is indended for a small microcontroller system.

There are certain values that would cause the difference between the two numbers to be different if the int16_t weren't there due to overflow. In an int8_t your range is -128 to 127, in a uint8_t your range is 0 to 255, and in a int16_t your range would be -32,768 to 32,767.
Casing to an int8_t from a uint8_t will cause values over 127 to change signs due to overflow so this keeps that from happening, however the output should be an int16_t due to if you had a 255 - 0 result, it would be a truncated return.

Related

What does this piece of code print infinite number of times?

I would like to know why this piece of code printf char c infinitely:
char c;
for (c = 0; c< 256; c++)
printf("Char c=%c\n", c);
Thanks

Assuming CHAR_BIT is 8 (which it almost always is), a char object cannot represent a value of 256; signed char will max out at 127, while unsigned char will max out at 255.
Signed case
When c is 127 and you add 1, the value "overflows". Because one's-complement and sign-magnitude representations are still a thing on some oddball architectures, the exact result can vary. For two's-complement, it will "wrap around" to -128, whereas for one's-complement it wraps around to -127, and for sign-magnitude it becomes -0.
All of these values are less than 256, so the condition c < 256 is always true.
Because the result can vary based on the platform, the C language definition places no requirements on the compiler to handle signed integer overflow in any particular way - the behavior is left undefined, and any result is equally "correct" as far as the language is concerned. A reasonably smart compiler might be able to detect that the condition will always evaluate to true and issue a warning, and it would be free to do that as far as the language definition is concerned. Or not.
Unsigned case
Unlike the signed case, unsigned integer overflow is well-defined; if c is 255 and you add 1, then the result wraps around back to 0. Again, 0 is less than 256, so c < 256 will always be true.

Because the char data type is only one byte long, and therefore can only hold the values 0-255.
So when c is 255 and you do c++, it becomes 0. Thus, c is always < 256 in the loop.

Char multiplication in C

I have a code like this:
#include <stdio.h>
int main()
{
char a=20,b=30;
char c=a*b;
printf("%c\n",c);
return 0;
}
The output of this program is X .
How is this output possible if a*b=600 which overflows as char values lies between -128 and 127 ?

Whether char is signed or unsigned is implementation defined. Either way, it is an integer type.
Anyway, the multiplication is done as int due to integer promotions and the result is converted to char.
If the value does not fit into the "smaller" type, it is implementation defined for a signed char how this is done. Far by most (if not all) implementations simply cut off the upper bits.
For an unsigned char, the standard actually requires (briefly) cutting of the upper bits.
So:
(int)20 * (int)20 -> (int)600 -> (char)(600 % 256) -> 88 == 'X'
(Assuming 8 bit char).
See the link and its surrounding paragraphs for more details.
Note: If you enable compiler warnings (as always recommended), you should get a truncation warning for the assignment. This can be avoided by an explicit cast (only if you are really sure about all implications). The gcc option is -Wconversion.

First off, the behavior is implementation-defined here. A char may be either unsigned char or signed char, so it may be able to hold 0 to 255 or -128 to 127, assuming CHAR_BIT == 8.
600 in decimal is 0x258. What happens is the least significant eight bits are stored, the value is 0x58 a.k.a. X in ASCII.

This code will cause undefined behavior if char is signed.
I thought overflow of signed integer is undefined behavior, but conversion to smaller type is implementation-defined.
quote from N1256 6.3.1.3 Signed and unsigned integers:
3 Otherwise, the new type is signed and the value cannot be represented in it; either the
result is implementation-defined or an implementation-defined signal is raised.
If the value is simply truncated to 8 bits, (20 * 30) & 0xff == 0x58 and 0x58 is ASCII code for X. So, if your system do this and use ASCII code, the output will be X.

First, looks like you have unsigned char with a range from 0 to 255.
You're right about the overflow.
600 - 256 - 256 = 88
This is just an ASCII code of 'X'.

Is arithmetic overflow equivalent to modulo operation?

I need to do modulo 256 arithmetic in C. So can I simply do
unsigned char i;
i++;
instead of
int i;
i=(i+1)%256;

No. There is nothing that guarantees that unsigned char has eight bits. Use uint8_t from <stdint.h>, and you'll be perfectly fine. This requires an implementation which supports stdint.h: any C99 compliant compiler does, but older compilers may not provide it.
Note: unsigned arithmetic never overflows, and behaves as "modulo 2^n". Signed arithmetic overflows with undefined behavior.

Yes, the behavior of both of your examples is the same. See C99 6.2.5 §9 :
A computation involving unsigned operands can never overflow,
because a result that cannot be represented by the resulting unsigned integer type is
reduced modulo the number that is one greater than the largest value that can be
represented by the resulting type.

unsigned char c = UCHAR_MAX;
c++;
Basically yes, there is no overflow, but not because c is of an unsigned type. There is a hidden promotion of c to int here and an integer conversion from int to unsigned char and it is perfectly defined.
For example,
signed char c = SCHAR_MAX;
c++;
is also not undefined behavior, because it is actually equivalent to:
c = (int) c + 1;
and the conversion from int to signed char is implementation-defined here (see c99, 6.3.1.3p3 on integer conversions). To simplify CHAR_BIT == 8 is assumed.
For more information on the example above, I suggest to read this post:
"The Little C Function From Hell"
http://blog.regehr.org/archives/482

Very probably yes, but the reasons for it in this case are actually fairly complicated.
unsigned char i = 255;
i++;
The i++ is equivalent to i = i + 1.
(Well, almost. i++ yields the value of i before it was incremented, so it's really equivalent to (tmp=i; i = i + 1; tmp). But since the result is discarded in this case, that doesn't raise any additional issues.)
Since unsigned char is a narrow type, an unsigned char operand to the + operator is promoted to int (assuming int can hold all possible values in the range of unsigned char). So if i == 255, and UCHAR_MAX == 255, then the result of the addition is 256, and is of type (signed) int.
The assignment implicitly converts the value 256 from int back to unsigned char. Conversion to an unsigned type is well defined; the result is reduced modulo MAX+1, where MAX is the maximum value of the target unsigned type.
If i were declared as an unsigned int:
unsigned int i = UINT_MAX;
i++;
there would be no type conversion, but the semantics of the + operator for unsigned types also specify reduction module MAX+1.
Keep in mind that the value assigned to i is mathematically equivalent to (i+1) % UCHAR_MAX. UCHAR_MAX is usually 255, and is guaranteed to be at least 255, but it can legally be bigger.
There could be an exotic system on which UCHAR_MAX is too be to be stored in a signed int object. This would require UCHAR_MAX > INT_MAX, which means the system would have to have at least 16-bit bytes. On such a system, the promotion would be from unsigned char to unsigned int. The final result would be the same. You're not likely to encounter such a system. I think there are C implementations for some DSPs that have bytes bigger than 8 bits. The number of bits in a byte is specified by CHAR_BIT, defined in <limits.h>.
CHAR_BIT > 8 does not necessarily imply UCHAR_MAX > INT_MAX. For example, you could have CHAR_BIT == 16 and sizeof (int) == 2 i.e., 16-bit bytes and 32 bit ints).

There's another alternative that hasn't been mentioned, if you don't want to use another data type.
unsigned int i;
// ...
i = (i+1) & 0xFF; // 0xFF == 255
This works because the modulo element == 2^n, meaning the range will be [0, 2^n-1] and thus a bitmask will easily keep the value within your desired range. It's possible this method would not be much or any less efficient than the unsigned char/uint8_t version, either, depending on what magic your compiler does behind the scenes and how the targeted system handles non-word loads (for example, some RISC architectures require additional operations to load non-word-size values). This also assumes that your compiler won't detect the usage of power-of-two modulo arithmetic on unsigned values and substitute a bitmask for you, of course, as in cases like that the modulo usage would have greater semantic value (though using that as the basis for your decision is not exactly portable, of course).
An advantage of this method is that you can use it for powers of two that are not also the size of a data type, e.g.
i = (i+1) & 0x1FF; // i %= 512
i = (i+1) & 0x3FF; // i %= 1024
// etc.

This should work fine because it should just overflow back to 0. As was pointed out in a comment on a different answer, you should only do this when the value is unsigned, as you may get undefined behavior with a signed value.
It is probably best to leave this using modulo, however, because the code will be better understood by other people maintaining the code, and a smart compiler may be doing this optimization anyway, which may make it pointless in the first place. Besides, the performance difference will probably be so small that it wouldn't matter in the first place.

It will work if the number of bits that you are using to represent the number is equal to number of bits in binary (unsigned) representation (100000000) of the divisor -1
which in this case is : 9-1= 8 (char)

Type conversion - unsigned to signed int/char

I tried the to execute the below program:
#include <stdio.h>
int main() {
signed char a = -5;
unsigned char b = -5;
int c = -5;
unsigned int d = -5;
if (a == b)
printf("\r\n char is SAME!!!");
else
printf("\r\n char is DIFF!!!");
if (c == d)
printf("\r\n int is SAME!!!");
else
printf("\r\n int is DIFF!!!");
return 0;
}
For this program, I am getting the output:
char is DIFF!!!
int is SAME!!!
Why are we getting different outputs for both?
Should the output be as below ?
char is SAME!!!
int is SAME!!!
A codepad link.

This is because of the various implicit type conversion rules in C. There are two of them that a C programmer must know: the usual arithmetic conversions and the integer promotions (the latter are part of the former).
In the char case you have the types (signed char) == (unsigned char). These are both small integer types. Other such small integer types are bool and short. The integer promotion rules state that whenever a small integer type is an operand of an operation, its type will get promoted to int, which is signed. This will happen no matter if the type was signed or unsigned.
In the case of the signed char, the sign will be preserved and it will be promoted to an int containing the value -5. In the case of the unsigned char, it contains a value which is 251 (0xFB ). It will be promoted to an int containing that same value. You end up with
if( (int)-5 == (int)251 )
In the integer case you have the types (signed int) == (unsigned int). They are not small integer types, so the integer promotions do not apply. Instead, they are balanced by the usual arithmetic conversions, which state that if two operands have the same "rank" (size) but different signedness, the signed operand is converted to the same type as the unsigned one. You end up with
if( (unsigned int)-5 == (unsigned int)-5)

Cool question!
The int comparison works, because both ints contain exactly the same bits, so they are essentially the same. But what about the chars?
Ah, C implicitly promotes chars to ints on various occasions. This is one of them. Your code says if(a==b), but what the compiler actually turns that to is:
if((int)a==(int)b)
(int)a is -5, but (int)b is 251. Those are definitely not the same.
EDIT: As #Carbonic-Acid pointed out, (int)b is 251 only if a char is 8 bits long. If int is 32 bits long, (int)b is -32764.
REDIT: There's a whole bunch of comments discussing the nature of the answer if a byte is not 8 bits long. The only difference in this case is that (int)b is not 251 but a different positive number, which isn't -5. This is not really relevant to the question which is still very cool.

Welcome to integer promotion. If I may quote from the website:
If an int can represent all values of the original type, the value is
converted to an int; otherwise, it is converted to an unsigned int.
These are called the integer promotions. All other types are unchanged
by the integer promotions.
C can be really confusing when you do comparisons such as these, I recently puzzled some of my non-C programming friends with the following tease:
#include <stdio.h>
#include <string.h>
int main()
{
char* string = "One looooooooooong string";
printf("%d\n", strlen(string));
if (strlen(string) < -1) printf("This cannot be happening :(");
return 0;
}
Which indeed does print This cannot be happening :( and seemingly demonstrates that 25 is smaller than -1!
What happens underneath however is that -1 is represented as an unsigned integer which due to the underlying bits representation is equal to 4294967295 on a 32 bit system. And naturally 25 is smaller than 4294967295.
If we however explicitly cast the size_t type returned by strlen as a signed integer:
if ((int)(strlen(string)) < -1)
Then it will compare 25 against -1 and all will be well with the world.
A good compiler should warn you about the comparison between an unsigned and signed integer and yet it is still so easy to miss (especially if you don't enable warnings).
This is especially confusing for Java programmers as all primitive types there are signed. Here's what James Gosling (one of the creators of Java) had to say on the subject:
Gosling: For me as a language designer, which I don't really count
myself as these days, what "simple" really ended up meaning was could
I expect J. Random Developer to hold the spec in his head. That
definition says that, for instance, Java isn't -- and in fact a lot of
these languages end up with a lot of corner cases, things that nobody
really understands. Quiz any C developer about unsigned, and pretty
soon you discover that almost no C developers actually understand what
goes on with unsigned, what unsigned arithmetic is. Things like that
made C complex. The language part of Java is, I think, pretty simple.
The libraries you have to look up.

The hex representation of -5 is:
8-bit, two's complement signed char: 0xfb
32-bit, two's complement signed int: 0xfffffffb
When you convert a signed number to an unsigned number, or vice versa, the compiler does ... precisely nothing. What is there to do? The number is either convertible or it isn't, in which case undefined or implementation-defined behaviour follows (I've not actually checked which) and the most efficient implementation-defined behaviour is to do nothing.
So, the hex representation of (unsigned <type>)-5 is:
8-bit, unsigned char: 0xfb
32-bit, unsigned int: 0xfffffffb
Look familiar? They're bit-for-bit the same as the signed versions.
When you write if (a == b), where a and b are of type char, what the compiler is actually required to read is if ((int)a == (int)b). (This is that "integer promotion" that everyone else is banging on about.)
So, what happens when we convert char to int?
8-bit signed char to 32-bit signed int: 0xfb -> 0xfffffffb
Well, that makes sense because it matches the representations of -5 above!
It's called a "sign-extend", because it copies the top bit of the byte, the "sign-bit", leftwards into the new, wider value.
8-bit unsigned char to 32-bit signed int: 0xfb -> 0x000000fb
This time it does a "zero-extend" because the source type is unsigned, so there is no sign-bit to copy.
So, a == b really does 0xfffffffb == 0x000000fb => no match!
And, c == d really does 0xfffffffb == 0xfffffffb => match!

My point is: didn't you get a warning at compile time "comparing signed and unsigned expression"?
The compiler is trying to inform you that he is entitled to do crazy stuff! :) I would add, crazy stuff will happen using big values, close to the capacity of the primitive type. And
unsigned int d = -5;
is assigning definitely a big value to d, it's equivalent (even if, probably not guaranteed to be equivalent) to be:
unsigned int d = UINT_MAX -4; ///Since -1 is UINT_MAX
Edit:
However, it is interesting to notice that only the second comparison gives a warning (check the code). So it means that the compiler applying the conversion rules is confident that there won't be errors in the comparison between unsigned char and char (during comparison they will be converted to a type that can safely represent all its possible values). And he is right on this point. Then, it informs you that this won't be the case for unsigned int and int: during the comparison one of the 2 will be converted to a type that cannot fully represent it.
For completeness, I checked it also for short: the compiler behaves in the same way than for chars, and, as expected, there are no errors at runtime.
.
Related to this topic, I recently asked this question (yet, C++ oriented).

What does it mean for a char to be signed?

Given that signed and unsigned ints use the same registers, etc., and just interpret bit patterns differently, and C chars are basically just 8-bit ints, what's the difference between signed and unsigned chars in C? I understand that the signedness of char is implementation defined, and I simply can't understand how it could ever make a difference, at least when char is used to hold strings instead of to do math.

It won't make a difference for strings. But in C you can use a char to do math, when it will make a difference.
In fact, when working in constrained memory environments, like embedded 8 bit applications a char will often be used to do math, and then it makes a big difference. This is because there is no byte type by default in C.

In terms of the values they represent:
unsigned char:
spans the value range 0..255 (00000000..11111111)
values overflow around low edge as:
0 - 1 = 255 (00000000 - 00000001 = 11111111)
values overflow around high edge as:
255 + 1 = 0 (11111111 + 00000001 = 00000000)
bitwise right shift operator (>>) does a logical shift:
10000000 >> 1 = 01000000 (128 / 2 = 64)
signed char:
spans the value range -128..127 (10000000..01111111)
values overflow around low edge as:
-128 - 1 = 127 (10000000 - 00000001 = 01111111)
values overflow around high edge as:
127 + 1 = -128 (01111111 + 00000001 = 10000000)
bitwise right shift operator (>>) does an arithmetic shift:
10000000 >> 1 = 11000000 (-128 / 2 = -64)
I included the binary representations to show that the value wrapping behaviour is pure, consistent binary arithmetic and has nothing to do with a char being signed/unsigned (expect for right shifts).
Update
Some implementation-specific behaviour mentioned in the comments:
char != signed char. The type "char" without "signed" or "unsinged" is implementation-defined which means that it can act like a signed or unsigned type.
Signed integer overflow leads to undefined behavior where a program can do anything, including dumping core or overrunning a buffer.

#include <stdio.h>
int main(int argc, char** argv)
{
char a = 'A';
char b = 0xFF;
signed char sa = 'A';
signed char sb = 0xFF;
unsigned char ua = 'A';
unsigned char ub = 0xFF;
printf("a > b: %s\n", a > b ? "true" : "false");
printf("sa > sb: %s\n", sa > sb ? "true" : "false");
printf("ua > ub: %s\n", ua > ub ? "true" : "false");
return 0;
}
[root]# ./a.out
a > b: true
sa > sb: true
ua > ub: false
It's important when sorting strings.

There are a couple of difference. Most importantly, if you overflow the valid range of a char by assigning it a too big or small integer, and char is signed, the resulting value is implementation defined or even some signal (in C) could be risen, as for all signed types. Contrast that to the case when you assign something too big or small to an unsigned char: the value wraps around, you will get precisely defined semantics. For example, assigning a -1 to an unsigned char, you will get an UCHAR_MAX. So whenever you have a byte as in a number from 0 to 2^CHAR_BIT, you should really use unsigned char to store it.
The sign also makes a difference when passing to vararg functions:
char c = getSomeCharacter(); // returns 0..255
printf("%d\n", c);
Assume the value assigned to c would be too big for char to represent, and the machine uses two's complement. Many implementation behave for the case that you assign a too big value to the char, in that the bit-pattern won't change. If an int will be able to represent all values of char (which it is for most implementations), then the char is being promoted to int before passing to printf. So, the value of what is passed would be negative. Promoting to int would retain that sign. So you will get a negative result. However, if char is unsigned, then the value is unsigned, and promoting to an int will yield a positive int. You can use unsigned char, then you will get precisely defined behavior for both the assignment to the variable, and passing to printf which will then print something positive.
Note that a char, unsigned and signed char all are at least 8 bits wide. There is no requirement that char is exactly 8 bits wide. However, for most systems that's true, but for some, you will find they use 32bit chars. A byte in C and C++ is defined to have the size of char, so a byte in C also is not always exactly 8 bits.
Another difference is, that in C, a unsigned char must have no padding bits. That is, if you find CHAR_BIT is 8, then an unsigned char's values must range from 0 .. 2^CHAR_BIT-1. THe same is true for char if it's unsigned. For signed char, you can't assume anything about the range of values, even if you know how your compiler implements the sign stuff (two's complement or the other options), there may be unused padding bits in it. In C++, there are no padding bits for all three character types.

"What does it mean for a char to be signed?"
Traditionally, the ASCII character set consists of 7-bit character encodings. (As opposed to the 8 bit EBCIDIC.)
When the C language was designed and implemented this was a significant issue. (For various reasons like data transmission over serial modem devices.) The extra bit has uses like parity.
A "signed character" happens to be perfect for this representation.
Binary data, OTOH, is simply taking the value of each 8-bit "chunk" of data, thus no sign is needed.

Arithmetic on bytes is important for computer graphics (where 8-bit values are often used to store colors). Aside from that, I can think of two main cases where char sign matters:
converting to a larger int
comparison functions
The nasty thing is, these won't bite you if all your string data is 7-bit. However, it promises to be an unending source of obscure bugs if you're trying to make your C/C++ program 8-bit clean.

Signedness works pretty much the same way in chars as it does in other integral types. As you've noted, chars are really just one-byte integers. (Not necessarily 8-bit, though! There's a difference; a byte might be bigger than 8 bits on some platforms, and chars are rather tied to bytes due to the definitions of char and sizeof(char). The CHAR_BIT macro, defined in <limits.h> or C++'s <climits>, will tell you how many bits are in a char.).
As for why you'd want a character with a sign: in C and C++, there is no standard type called byte. To the compiler, chars are bytes and vice versa, and it doesn't distinguish between them. Sometimes, though, you want to -- sometimes you want that char to be a one-byte number, and in those cases (particularly how small a range a byte can have), you also typically care whether the number is signed or not. I've personally used signedness (or unsignedness) to say that a certain char is a (numeric) "byte" rather than a character, and that it's going to be used numerically. Without a specified signedness, that char really is a character, and is intended to be used as text.
I used to do that, rather. Now the newer versions of C and C++ have (u?)int_least8_t (currently typedef'd in <stdint.h> or <cstdint>), which are more explicitly numeric (though they'll typically just be typedefs for signed and unsigned char types anyway).

The only situation I can imagine this being an issue is if you choose to do math on chars. It's perfectly legal to write the following code.
char a = (char)42;
char b = (char)120;
char c = a + b;
Depending on the signedness of the char, c could be one of two values. If char's are unsigned then c will be (char)162. If they are signed then it will an overflow case as the max value for a signed char is 128. I'm guessing most implementations would just return (char)-32.

One thing about signed chars is that you can test c >= ' ' (space) and be sure it's a normal printable ascii char. Of course, it's not portable, so not very useful.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight