Strcmp() function realization on C - c

I need to make an strcmp function by myself, using operations with pointers. That's what I got:
int mystrcmp(const char *str1, const char *str2) {
while ('\0' != *str1 && *str1 == *str2) {
str1 += 1;
str2++;
}
int result1 = (uint8_t)(*str2) - (uint8_t)(*str1); // I need (uint8_t) to use it with Russian symbols.
return result1;
}
But my tutor told me that there are small mistake in my code. I spend really lot of time making tests, but couldn't find it.

Does this answer the question of what you're doing wrong?
#include <stdio.h>
#include <stdint.h>
#include <string.h>
int mystrcmp(const char *str1, const char *str2);
int main(void)
{
char* javascript = "JavaScript";
char* java = "Java";
printf("%d\n", mystrcmp(javascript, java));
printf("%d\n", strcmp(javascript, java));
return 0;
}
int mystrcmp(const char *str1, const char *str2) {
while ('\0' != *str1 && *str1 == *str2) {
str1 += 1;
str2++;
}
int result1 = (uint8_t)(*str2) - (uint8_t)(*str1); // I need (uint8_t) to use it with Russian symbols.
return result1;
}
Output:
-83
83
I'll propose a quick fix:
Change
int result1 = (uint8_t)(*str2) - (uint8_t)(*str1);
To
int result1 = (uint8_t)(*str1) - (uint8_t)(*str2);
And why you were wrong:
The return values of strcmp() should be:
if Return value < 0 then it indicates str1 is less than str2.
if Return value > 0 then it indicates str2 is less than str1.
if Return value = 0 then it indicates str1 is equal to str2.
And you were doing exactly the opposite.

#yLaguardia well answered the order problem.
int strcmp(const char *s1, const char *s2);
The strcmp function returns an integer greater than, equal to, or less than zero, accordingly as the string pointed to by s1 is greater than, equal to, or less than the string pointed to by s2. C11dr §7.24.4.2 3
Using uint8_t is fine for the vast majority of cases. Rare machines do not use 8-bit char, so uint8_t is not available. In any case, it is not needed as unsigned char handles the required unsigned compare. (See below about unsigned compare.)
int result1 =
((unsigned char)*str1 - (unsigned char)*str2);
Even higher portable code would use the following to handle when char range and unsigned range match as well as all other char, unsigned char, int, unsigned sizes/ranges.
int result1 =
((unsigned char)*str1 > (unsigned char)*str2) -
((unsigned char)*str1 < (unsigned char)*str2);
strcmp() is defined as treating each character as unsigned char, regardless if char is signed or unsigned.
... each character shall be interpreted as if it had the type
unsigned char ... C11 §7.24.1 3
Should the char be ASCII or not is not relevant to the coding of strcmp(). Of course under different character encoding, different results may occur. Example: strcmp("A", "a") may result in a positive answer (seldom used EBCDIC) with one encoding, but negative (ASCII) on another.

Related

why does while(*s++ == *t++) not work to compare two strings

I'm implementing strcmp(char *s, char *t) which returns <0 if s<t, 0 if s==t, and >0 if s>t by comparing the fist value that is different between the two strings.
implementing by separating the postfix increment and relational equals operators works:
for (; *s==*t; s++, t++)
if (*s=='\0')
return 0;
return *s - *t;
however, grouping the postfix increment and relational equals operators doesn't work (like so):
while (*s++ == *t++)
if (*s=='\0')
return 0;
return *s - *t;
The latter always returns 0. I thought this could be because we're incrementing the pointers too soon, but even with a difference in the two string occurring at index 5 out of 10 still produces the same result.
Example input:
strcomp("hello world", "hello xorld");
return value:
0
My hunch is this is because of operator precedence but I'm not positive and if so, I cannot exactly pinpoint why.
Thank you for your time!
Because in the for loop, the increment (s++, t++ in your case) is not called if the condition (*s==*t in your case) is false. But in your while loop, the increment is called in that case too, so for strcomp("hello world", "hello xorld"), both pointers end up pointing at os in the strings.
Since you always increment s and t in the test, you should refer to s[-1] for the termination in case of equal strings and s[-1] and t[-1] in case they differ.
Also note that the order is determined by the comparison as unsigned char.
Here is a modified version:
int strcmp(const char *s, const char *t) {
while (*s++ == *t++) {
if (s[-1] == '\0')
return 0;
}
return (unsigned char)s[-1] - (unsigned char)t[-1];
}
Following the comments from LL chux, here is a fully conforming implementation for perverse architectures with non two's complement representation and/or CHAR_MAX > INT_MAX:
int strcmp(const char *s0, const char *t0) {
const unsigned char *s = (const unsigned char *)s0;
const unsigned char *t = (const unsigned char *)t0;
while (*s++ == *t++) {
if (s[-1] == '\0')
return 0;
}
return (s[-1] > t[-1]) - (s[-1] < t[-1]);
}
Everyone is giving the right advice, but are still hardwired to inlining those increment operators within the comparison expression and doing weird off by 1 stuff.
The following just feels simpler and easier to read. No pointer is ever incremented or decremented to an invalid address.
while ((*s == *t) && *s)
{
s++;
t++;
}
return *s - *t;
For completeness in addition to what was already well answered about the wrong offset during subtraction:
*s - *t; is incorrect when *s, *t is negative.
The standard C library specifies that string functions compare as if char was unsigned char. Thus code that subtracts via a char * gives the wrong answer when the characters are negative.
For all functions in this subclause, each character shall be interpreted as if it had the type unsigned char (and therefore every possible object representation is valid and has a different value).
C17dr § 7.24.1 3
int strcmp(const char *s, const char *t) {
const unsigned char *us = (const unsigned char *) s;
const unsigned char *ut = (const unsigned char *) t;
while (*us == *ut && *us) {
us++;
ut++;
}
return (*us > *ut) - (*us < *ut);
}
This code also addresses obscure concerns of non-2's complement access of -0 and char range exceeding int.

A way to pass an optional argument to a function

Take the following function:
char * slice(const char * str, unsigned int start, unsigned int end) {
int string_len = strlen(str);
int slice_len = (end - start < string_len) ? end - start : string_len;
char * sliced_str = (char *) malloc (slice_len + 1);
sliced_str[slice_len] = '\0';
// Make sure we have a string of length > 0, and it's within the string range
if (slice_len == 0 || start >= string_len || end <= 0) return "";
for (int i=0, j=start; i < slice_len; i++, j++)
sliced_str[i] = str[j];
return sliced_str;
}
I can call this as follows:
char * new_string = slice("old string", 3, 5)
Is there a way to be able to "omit" an argument somehow in C? For example, passing something like the following:
char * new_string = slice("old string", 3, NULL)
// NULL means ignore the `end` parameter and just go all the way to the end.
How would something like that be done? Or is that not possible to do in C?
Optional arguments (or arguments that have default values) are not really a thing in C. I think you have the right idea by passing in 'NULL', except for that NULL is equal to 0 and will be interpreted as an integer. Instead, I would recommend changing the argument to a signed integer instead of unsigned, and passing in a -1 as your flag to indicate that the argument should be ignored.
There's only two ways to pass optional arguments in C, and only one is common. Either pass a pointer to the optional argument and understand NULL as not passed, or pass an out-of-range value as not passed.
Way 1:
char * slice(const char * str, const unsigned int *start, const unsigned int *end);
// ...
const unsigned int three = 3;
char * new_string = slice("old string", &three, NULL)
Way 2:
#include <limits.h>
char * slice(const char * str, const unsigned int start, const unsigned int end);
char * new_string = slice("old string", 3, UINT_MAX);
BTW, this example should really be using size_t and SIZE_MAX but I copied your prototype.
The proposed dupe target is talking about vardiac functions, which do have optional arguments, but it's not like what you're asking for. It's always possible in such a function call to determine if the argument is (intended to be) present by looking at the arguments that come before. In this case, that won't help at all.

Creating a simplified version of strchr() [duplicate]

This question already has answers here:
Why does strchr take an int for the char to be found?
(4 answers)
Closed 6 years ago.
Trying to create a simple function that would look for a single char in a string "like strchr() would", i did the following:
char* findchar(char* str, char c)
{
char* position = NULL;
int i = 0;
for(i = 0; str[i]!='\0';i++)
{
if(str[i] == c)
{
position = &str[i];
break;
}
}
return position;
}
So far it works. However, when i looked at the prototype of strchr():
char *strchr(const char *str, int c);
The second parameter is an int? I'm curious to know.. Why not a char? Does this mean that we can use int for storing characters just like we use a char?
Which brings me to the second question, i tried to change my function to accept an int as a second parameter... but i'm not sure if it's correct and safe to do the following:
char* findchar(char* str, int c)
{
char* position = NULL;
int i = 0;
for(i = 0; str[i]!='\0';i++)
{
if(str[i] == c) //Specifically, is this line correct? Can we test an int against a char?
{
position = &str[i];
break;
}
}
return position;
}
Before ANSI C89, functions were declared without prototypes. The declaration for strchr looked like this back then:
char *strchr();
That's it. No parameters are declared at all. Instead, there were these simple rules:
all pointers are passed as parameters as-is
all integer values of a smaller range than int are converted to int
all floating point values are converted to double
So when you called strchr, what really happened was:
strchr(str, (int)chr);
When ANSI C89 was introduced, it had to maintain backwards compatibility. Therefore it defined the prototype of strchr as:
char *strchr(const char *str, int chr);
This preserves the exact behavior of the above sample call, including the conversion to int. This is important since an implementation may define that passing a char argument works differently than passing an int argument, which makes sense on 8 bit platforms.
Consider the return value of fgetc(), values in the range of unsigned char and EOF, some negative value. This is the kind of value to pass to strchr().
#Roland Illig presents a very good explanation of the history that led to retaining use of int ch with strchr().
OP's code fails/has trouble as follows.
1) char* str is treated like unsigned char *str per §7.23.1.1 3
For all functions in this subclause, each character shall be interpreted as if it had the type unsigned char
2) i should be type size_t, to handle the entire range of the character array.
3) For the purpose of strchr(), the null character is considered part of the search.
The terminating null character is considered to be part of the string.
4) Better to use const as str is not changed.
char* findchar(const char* str, int c) {
const char* position = NULL;
size_t i = 0;
for(i = 0; ;i++) {
if((unsigned char) str[i] == c) {
position = &str[i];
break;
}
if (str[i]=='\0') break;
}
return (char *) position;
}
Further detail
The strchr function locates the first occurrence of c (converted to a char) in the string pointed to by s. C11dr §7.23.5.2 2
So int c is treat like a char. This could imply
if((unsigned char) str[i] == (char) c) {
Yet what I think this is meant:
if((unsigned char) str[i] == (unsigned char)(char) c) {
or simply
if((unsigned char) str[i] == (unsigned char)c) {

Unsigned char vs char in C — comparison of strings

I have a small assignment to do in C and I tried to find the best way I can to compare two strings (char arrays of course since strings are not defined in C).
This is my code :
int equal(char *s1, char *s2)
{
int a = 0;
while(!(a = *(unsigned char *)s1 - *(unsigned char *)s2) && *s2) ++s1, ++s2;
return (a == 0) ? 1 : 0;
}
It works but I don't see why I have to convert my char to an unsigned char.
(Of course I cannot use <string.h> in my assignment.)
The original code is fairly optimal. For simple equality comparisons, there is no need for the (unsigned char *) casts. The following works fine. (but see point #6):
int equal(char *s1, char *s2) {
int a = 0;
while(!(a = *s1 - *s2) && *s2) ++s1, ++s2;
return (a == 0) ? 1 : 0;
}
In making more optimal code, there is no need to compare both strings for the null character '\0' as in if (*s1 || *s2) .... As code checks for a non-zero a, checking only 1 string is sufficient.
"... of course since strings are not defined in C" is not so. C does define "string", though not as a type :
"A string is a contiguous sequence of characters terminated by and including the first null character" C11 §7.1.1 1
Using (unsigned char *) make sense if code is attempting to not only simply compare equality, but order. Even in this case, the type could be char. But by casting to unsigned char or even signed char, code provides consistent results across platforms even where some have char as signed char and others as unsigned char.
// return 0, -1 or +1
int order(const char *s1, const char *s2) {
const unsigned char *uc1 = (const unsigned char *) s1;
const unsigned char *uc2 = (const unsigned char *) s2;
while((*uc1 == *uc2) && *uc1) ++uc1, ++uc2;
return (*uc1 > *uc2) - (*uc1 < *uc2);
}
Using const in the function signature allows code to be used with const char * as order(buffer, "factorial");. Otherwise calling OP's equal(char *s1, char *s2) with equal(buffer, "factorial"); is undefined behavior. The stricken text would be true if the routine modified *s1 or *s2, but it does not. Using const does reduce certain warnings and allow for some optimizations. Credit: #abligh
This is a corner case where the casting is needed. If range of char is the same as the range of int (some graphics processors do that) and char is a signed char, then *s1 - *s2 can overflow and that is undefined behavior (UB). Of course, platforms that have the same range for char and int are rare. IMO, it is doubtful even on such machines, a non-casted version of this code would fail, but it is technically UB.
How about
int equal(const char *s1, const char *s2)
{
int i;
for (i=0; s1[i] || s2[i]; i++)
if (s1[i] != s2[i])
return 0;
return 1;
}
Or if you prefer while loops:
int equal(const char *s1, const char *s2)
{
while (*s1 || *s2)
if (*s1++ != *s2++)
return 0;
return 1;
}
To answer your specific question, in order to compare two strings (or indeed two characters) there is no need to convert them to unsigned char. I hope you agree my method is a little more readable than yours.

strcmp() and signed / unsigned chars

I am confused by strcmp(), or rather, how it is defined by the standard. Consider comparing two strings where one contains characters outside the ASCII-7 range (0-127).
The C standard defines:
int strcmp(const char *s1, const char *s2);
The strcmp function compares the string pointed to by s1 to the string
pointed to by s2.
The strcmp function returns an integer greater than, equal to, or
less than zero, accordingly as the
string pointed to by s1 is greater
than, equal to, or less than the
string pointed to by s2.
The parameters are char *. Not unsigned char *. There is no notion that "comparison should be done as unsigned".
But all the standard libraries I checked consider the "high" character to be just that, higher in value than the ASCII-7 characters.
I understand this is useful and the expected behaviour. I don't want to say the existing implementations are wrong or something. I just want to know, which part in the standard specs have I missed?
int strcmp_default( const char * s1, const char * s2 )
{
while ( ( *s1 ) && ( *s1 == *s2 ) )
{
++s1;
++s2;
}
return ( *s1 - *s2 );
}
int strcmp_unsigned( const char * s1, const char *s2 )
{
unsigned char * p1 = (unsigned char *)s1;
unsigned char * p2 = (unsigned char *)s2;
while ( ( *p1 ) && ( *p1 == *p2 ) )
{
++p1;
++p2;
}
return ( *p1 - *p2 );
}
#include <stdio.h>
#include <string.h>
int main()
{
char x1[] = "abc";
char x2[] = "abü";
printf( "%d\n", strcmp_default( x1, x2 ) );
printf( "%d\n", strcmp_unsigned( x1, x2 ) );
printf( "%d\n", strcmp( x1, x2 ) );
return 0;
}
Output is:
103
-153
-153
7.21.4/1 (C99), emphasis is mine:
The sign of a nonzero value returned by the comparison functions memcmp, strcmp,
and strncmp is determined by the sign of the difference between the values of the first
pair of characters (both interpreted as unsigned char) that differ in the objects being
compared.
There is something similar in C90.
Note that strcoll() may be more adapted than strcmp() especially if you have character outside the basic character set.

Resources