why does while(*s++ == *t++) not work to compare two strings - c

I'm implementing strcmp(char *s, char *t) which returns <0 if s<t, 0 if s==t, and >0 if s>t by comparing the fist value that is different between the two strings.
implementing by separating the postfix increment and relational equals operators works:
for (; *s==*t; s++, t++)
if (*s=='\0')
return 0;
return *s - *t;
however, grouping the postfix increment and relational equals operators doesn't work (like so):
while (*s++ == *t++)
if (*s=='\0')
return 0;
return *s - *t;
The latter always returns 0. I thought this could be because we're incrementing the pointers too soon, but even with a difference in the two string occurring at index 5 out of 10 still produces the same result.
Example input:
strcomp("hello world", "hello xorld");
return value:
0
My hunch is this is because of operator precedence but I'm not positive and if so, I cannot exactly pinpoint why.
Thank you for your time!

Because in the for loop, the increment (s++, t++ in your case) is not called if the condition (*s==*t in your case) is false. But in your while loop, the increment is called in that case too, so for strcomp("hello world", "hello xorld"), both pointers end up pointing at os in the strings.

Since you always increment s and t in the test, you should refer to s[-1] for the termination in case of equal strings and s[-1] and t[-1] in case they differ.
Also note that the order is determined by the comparison as unsigned char.
Here is a modified version:
int strcmp(const char *s, const char *t) {
while (*s++ == *t++) {
if (s[-1] == '\0')
return 0;
}
return (unsigned char)s[-1] - (unsigned char)t[-1];
}
Following the comments from LL chux, here is a fully conforming implementation for perverse architectures with non two's complement representation and/or CHAR_MAX > INT_MAX:
int strcmp(const char *s0, const char *t0) {
const unsigned char *s = (const unsigned char *)s0;
const unsigned char *t = (const unsigned char *)t0;
while (*s++ == *t++) {
if (s[-1] == '\0')
return 0;
}
return (s[-1] > t[-1]) - (s[-1] < t[-1]);
}

Everyone is giving the right advice, but are still hardwired to inlining those increment operators within the comparison expression and doing weird off by 1 stuff.
The following just feels simpler and easier to read. No pointer is ever incremented or decremented to an invalid address.
while ((*s == *t) && *s)
{
s++;
t++;
}
return *s - *t;

For completeness in addition to what was already well answered about the wrong offset during subtraction:
*s - *t; is incorrect when *s, *t is negative.
The standard C library specifies that string functions compare as if char was unsigned char. Thus code that subtracts via a char * gives the wrong answer when the characters are negative.
For all functions in this subclause, each character shall be interpreted as if it had the type unsigned char (and therefore every possible object representation is valid and has a different value).
C17dr § 7.24.1 3
int strcmp(const char *s, const char *t) {
const unsigned char *us = (const unsigned char *) s;
const unsigned char *ut = (const unsigned char *) t;
while (*us == *ut && *us) {
us++;
ut++;
}
return (*us > *ut) - (*us < *ut);
}
This code also addresses obscure concerns of non-2's complement access of -0 and char range exceeding int.

Related

Howcome my function to calculate the length of a string returns a negative value?

I've written a function using pointer arithmetic to calculate the length of a string but it only seems to work properly by using a hackish methodology.
I've tried using my understanding of memory addressing to make the function work as intended.
int getLength(const char *str) {
int length;
while (*str != '\0') {
length += str - (++str);
}
return abs(length);
}
int getLength(const char *str) {
int length;
while (*str != '\0') {
length += str + (++str);
}
return length;
}
The first function returns the correct length, but the second one returns 0, why is this?
Both functions are incorrect because:
you do not initialize length, so the behavior is undefined.
taking the absolute value is a lame attempt at fixing the problem... correcting the symptoms, but not addressing the problem. Don't do this, investigate the issue.
length += str - (++str); has undefined behavior because the side effect on str may happen before or after taking the value of the left operand str.
length += str + (++str); is a constraint violation: adding 2 pointers is not allowed in C.
You should instead write:
size_t getLength(const char *str) {
size_t length = 0;
while (*str != '\0') {
length++;
str++;
}
return length;
}
Depending on the target architecture, it may be more efficient to only increment str and compute the difference at the end:
size_t getLength(const char *str) {
const char *p = str;
while (*p++ != '\0')
continue;
/* p was incremented beyond the null terminator, hence decrease the difference by 1 */
return p - str - 1;
}
Your functions have a lots of issues:
You do not initialize the automatic variables.
Your pointer arithmetic does not make too much sense (at leat I cant understand what is the logic behind it)
result pf operations where on one size you have the lvalue and post(pre)increment or decrement is undefined (it saying quicker an Undefined Behaviour)
Below you have two almost the same versions of the strlen function. Sometimes small changes may have large impact on the function peformance depending on the target hardware.
this verison is more optimal for ARM targets
size_t getlen(const char *s)
{
const char *p = s;
while(*p++);
return p - s - 1;
}
this one is better for the x86 targets
size_t getlen(const char *s)
{
const char *p = s;
while(*p)
{
p++;
}
return p - s;
}

Creating a simplified version of strchr() [duplicate]

This question already has answers here:
Why does strchr take an int for the char to be found?
(4 answers)
Closed 6 years ago.
Trying to create a simple function that would look for a single char in a string "like strchr() would", i did the following:
char* findchar(char* str, char c)
{
char* position = NULL;
int i = 0;
for(i = 0; str[i]!='\0';i++)
{
if(str[i] == c)
{
position = &str[i];
break;
}
}
return position;
}
So far it works. However, when i looked at the prototype of strchr():
char *strchr(const char *str, int c);
The second parameter is an int? I'm curious to know.. Why not a char? Does this mean that we can use int for storing characters just like we use a char?
Which brings me to the second question, i tried to change my function to accept an int as a second parameter... but i'm not sure if it's correct and safe to do the following:
char* findchar(char* str, int c)
{
char* position = NULL;
int i = 0;
for(i = 0; str[i]!='\0';i++)
{
if(str[i] == c) //Specifically, is this line correct? Can we test an int against a char?
{
position = &str[i];
break;
}
}
return position;
}
Before ANSI C89, functions were declared without prototypes. The declaration for strchr looked like this back then:
char *strchr();
That's it. No parameters are declared at all. Instead, there were these simple rules:
all pointers are passed as parameters as-is
all integer values of a smaller range than int are converted to int
all floating point values are converted to double
So when you called strchr, what really happened was:
strchr(str, (int)chr);
When ANSI C89 was introduced, it had to maintain backwards compatibility. Therefore it defined the prototype of strchr as:
char *strchr(const char *str, int chr);
This preserves the exact behavior of the above sample call, including the conversion to int. This is important since an implementation may define that passing a char argument works differently than passing an int argument, which makes sense on 8 bit platforms.
Consider the return value of fgetc(), values in the range of unsigned char and EOF, some negative value. This is the kind of value to pass to strchr().
#Roland Illig presents a very good explanation of the history that led to retaining use of int ch with strchr().
OP's code fails/has trouble as follows.
1) char* str is treated like unsigned char *str per §7.23.1.1 3
For all functions in this subclause, each character shall be interpreted as if it had the type unsigned char
2) i should be type size_t, to handle the entire range of the character array.
3) For the purpose of strchr(), the null character is considered part of the search.
The terminating null character is considered to be part of the string.
4) Better to use const as str is not changed.
char* findchar(const char* str, int c) {
const char* position = NULL;
size_t i = 0;
for(i = 0; ;i++) {
if((unsigned char) str[i] == c) {
position = &str[i];
break;
}
if (str[i]=='\0') break;
}
return (char *) position;
}
Further detail
The strchr function locates the first occurrence of c (converted to a char) in the string pointed to by s. C11dr §7.23.5.2 2
So int c is treat like a char. This could imply
if((unsigned char) str[i] == (char) c) {
Yet what I think this is meant:
if((unsigned char) str[i] == (unsigned char)(char) c) {
or simply
if((unsigned char) str[i] == (unsigned char)c) {

Strcmp() function realization on C

I need to make an strcmp function by myself, using operations with pointers. That's what I got:
int mystrcmp(const char *str1, const char *str2) {
while ('\0' != *str1 && *str1 == *str2) {
str1 += 1;
str2++;
}
int result1 = (uint8_t)(*str2) - (uint8_t)(*str1); // I need (uint8_t) to use it with Russian symbols.
return result1;
}
But my tutor told me that there are small mistake in my code. I spend really lot of time making tests, but couldn't find it.
Does this answer the question of what you're doing wrong?
#include <stdio.h>
#include <stdint.h>
#include <string.h>
int mystrcmp(const char *str1, const char *str2);
int main(void)
{
char* javascript = "JavaScript";
char* java = "Java";
printf("%d\n", mystrcmp(javascript, java));
printf("%d\n", strcmp(javascript, java));
return 0;
}
int mystrcmp(const char *str1, const char *str2) {
while ('\0' != *str1 && *str1 == *str2) {
str1 += 1;
str2++;
}
int result1 = (uint8_t)(*str2) - (uint8_t)(*str1); // I need (uint8_t) to use it with Russian symbols.
return result1;
}
Output:
-83
83
I'll propose a quick fix:
Change
int result1 = (uint8_t)(*str2) - (uint8_t)(*str1);
To
int result1 = (uint8_t)(*str1) - (uint8_t)(*str2);
And why you were wrong:
The return values of strcmp() should be:
if Return value < 0 then it indicates str1 is less than str2.
if Return value > 0 then it indicates str2 is less than str1.
if Return value = 0 then it indicates str1 is equal to str2.
And you were doing exactly the opposite.
#yLaguardia well answered the order problem.
int strcmp(const char *s1, const char *s2);
The strcmp function returns an integer greater than, equal to, or less than zero, accordingly as the string pointed to by s1 is greater than, equal to, or less than the string pointed to by s2. C11dr §7.24.4.2 3
Using uint8_t is fine for the vast majority of cases. Rare machines do not use 8-bit char, so uint8_t is not available. In any case, it is not needed as unsigned char handles the required unsigned compare. (See below about unsigned compare.)
int result1 =
((unsigned char)*str1 - (unsigned char)*str2);
Even higher portable code would use the following to handle when char range and unsigned range match as well as all other char, unsigned char, int, unsigned sizes/ranges.
int result1 =
((unsigned char)*str1 > (unsigned char)*str2) -
((unsigned char)*str1 < (unsigned char)*str2);
strcmp() is defined as treating each character as unsigned char, regardless if char is signed or unsigned.
... each character shall be interpreted as if it had the type
unsigned char ... C11 §7.24.1 3
Should the char be ASCII or not is not relevant to the coding of strcmp(). Of course under different character encoding, different results may occur. Example: strcmp("A", "a") may result in a positive answer (seldom used EBCDIC) with one encoding, but negative (ASCII) on another.

Unsigned char vs char in C — comparison of strings

I have a small assignment to do in C and I tried to find the best way I can to compare two strings (char arrays of course since strings are not defined in C).
This is my code :
int equal(char *s1, char *s2)
{
int a = 0;
while(!(a = *(unsigned char *)s1 - *(unsigned char *)s2) && *s2) ++s1, ++s2;
return (a == 0) ? 1 : 0;
}
It works but I don't see why I have to convert my char to an unsigned char.
(Of course I cannot use <string.h> in my assignment.)
The original code is fairly optimal. For simple equality comparisons, there is no need for the (unsigned char *) casts. The following works fine. (but see point #6):
int equal(char *s1, char *s2) {
int a = 0;
while(!(a = *s1 - *s2) && *s2) ++s1, ++s2;
return (a == 0) ? 1 : 0;
}
In making more optimal code, there is no need to compare both strings for the null character '\0' as in if (*s1 || *s2) .... As code checks for a non-zero a, checking only 1 string is sufficient.
"... of course since strings are not defined in C" is not so. C does define "string", though not as a type :
"A string is a contiguous sequence of characters terminated by and including the first null character" C11 §7.1.1 1
Using (unsigned char *) make sense if code is attempting to not only simply compare equality, but order. Even in this case, the type could be char. But by casting to unsigned char or even signed char, code provides consistent results across platforms even where some have char as signed char and others as unsigned char.
// return 0, -1 or +1
int order(const char *s1, const char *s2) {
const unsigned char *uc1 = (const unsigned char *) s1;
const unsigned char *uc2 = (const unsigned char *) s2;
while((*uc1 == *uc2) && *uc1) ++uc1, ++uc2;
return (*uc1 > *uc2) - (*uc1 < *uc2);
}
Using const in the function signature allows code to be used with const char * as order(buffer, "factorial");. Otherwise calling OP's equal(char *s1, char *s2) with equal(buffer, "factorial"); is undefined behavior. The stricken text would be true if the routine modified *s1 or *s2, but it does not. Using const does reduce certain warnings and allow for some optimizations. Credit: #abligh
This is a corner case where the casting is needed. If range of char is the same as the range of int (some graphics processors do that) and char is a signed char, then *s1 - *s2 can overflow and that is undefined behavior (UB). Of course, platforms that have the same range for char and int are rare. IMO, it is doubtful even on such machines, a non-casted version of this code would fail, but it is technically UB.
How about
int equal(const char *s1, const char *s2)
{
int i;
for (i=0; s1[i] || s2[i]; i++)
if (s1[i] != s2[i])
return 0;
return 1;
}
Or if you prefer while loops:
int equal(const char *s1, const char *s2)
{
while (*s1 || *s2)
if (*s1++ != *s2++)
return 0;
return 1;
}
To answer your specific question, in order to compare two strings (or indeed two characters) there is no need to convert them to unsigned char. I hope you agree my method is a little more readable than yours.

Searching for 2 consecutive hex values in a char array of a file

I've read a file into an array of characters using fread. Now I want to search that array for two consecutive hex values, namely FF followed by D9 (its a jpeg marker signifying end of file). Here is the code I use to do that:
char* searchBuffer(char* b) {
char* p1 = b;
char* p2 = ++b;
int count = 0;
while (*p1 != (unsigned char)0xFF && *p2 != (unsigned char)0xD9) {
p1++;
p2++;
count++;
}
count = count;
return p1;
}
Now I know this code works if I search for hex values that don't include 0xFF (eg 4E followed by 46), but every time I try searching for 0xFF it fails. When I don't cast the hex values to unsigned char the program doesn't enter the while loop, when I do the program goes through all the chars in the array and doesn't stop until I get an out of bounds error. I'm stumped, please help.
Ignore count, its just a variable that helps me debug.
Thanks in advance.
Why not use memchr() to find potential matches?
Also, make sure you're dealing with promotions of potentially signed types (char may or may not be signed). Note that while 0xff and 0xd9 have the high bit set when looked at as 8-bit values, they are non-negative integer constants, so there is no 'sign extension' that occurs for them:
char* searchBuffer(char* b) {
unsigned char* p1 = (unsigned char*) b;
int count = 0;
for (;;) {
/* find the next 0xff char */
/* note - this highlights that we really should know the size */
/* of the buffer we're searching, in case we don't find a match */
/* at the moment we're making it up to be some large number */
p1 = memchr(p1, 0xff, UINT_MAX);
if (p1 && (*(p1 + 1) == 0xd9)) {
/* found the 0xff 0xd9 sequence */
break;
}
p1 += 1;
}
return (char *) p1;
}
Also, note that you really should be passing in some notion of the size of the buffer being searched, in case the target isn't found.
Here's a version that takes a buffer size paramter:
char* searchBuffer(char* b, size_t siz) {
unsigned char* p1 = (unsigned char*) b;
unsigned char* end = p1 + siz;
for (;;) {
/* find the next 0xff char */
p1 = memchr(p1, 0xff, end - p1);
if (!p1) {
/* sequnce not found, return NULL */
break;
}
if (((p1 + 1) != end) && (*(p1 + 1) == 0xd9)) {
/* found the 0xff 0xd9 sequence */
break;
}
p1 += 1;
}
return (char *) p1;
}
You are falling foul of integer promotions. Both operands for != (and similar) are promoted to int. And if at least one of them is unsigned, then both of them are treated as unsigned (actually that isn't 100% accurate, but for this particular situation, it should suffice). So this:
*p1 != (unsigned char)0xFF
is equivalent to:
(unsigned int)*p1 != (unsigned int)(unsigned char)0xFF
On your platform, char is evidently signed, in which case it can never take on the value of (unsigned int)0xFF.
So try casting *p1 as follows:
(unsigned char)*p1 != 0xFF
Alternatively, you could have the function take unsigned char arguments instead of char, and avoid all the casting.
[Note that on top of all of this, your loop logic is incorrect, as pointed out in various comments.]
4E will promote itself to a positive integer but *p1 will be negative with FF, and then will be promoted to a very large unsigned value that will be far greater than FF.
You need to make p1 unsigned.
You can write the code a lot shorter as:
char* searchBuffer(const char* b) {
while (*b != '\xff' || *(b+1) != '\xd9') b++;
return b;
}
Also note the function will cause a segmentation fault (or worse, return invalid results) if b does not, in fact, contain the bytes FFD9.
use void *memmem(const void *haystack, size_t haystacklen, const void *needle, size_t needlelen);
which is available in string.h and easy to use.
char* searchBuffer(char* b, int len)
{
unsigned char needle[2] = {0xFF, 0XD9};
char * c;
c = memmem(b, len, needle, sizeof(needle));
return c;
}

Resources