I have a small assignment to do in C and I tried to find the best way I can to compare two strings (char arrays of course since strings are not defined in C).
This is my code :
int equal(char *s1, char *s2)
{
int a = 0;
while(!(a = *(unsigned char *)s1 - *(unsigned char *)s2) && *s2) ++s1, ++s2;
return (a == 0) ? 1 : 0;
}
It works but I don't see why I have to convert my char to an unsigned char.
(Of course I cannot use <string.h> in my assignment.)
The original code is fairly optimal. For simple equality comparisons, there is no need for the (unsigned char *) casts. The following works fine. (but see point #6):
int equal(char *s1, char *s2) {
int a = 0;
while(!(a = *s1 - *s2) && *s2) ++s1, ++s2;
return (a == 0) ? 1 : 0;
}
In making more optimal code, there is no need to compare both strings for the null character '\0' as in if (*s1 || *s2) .... As code checks for a non-zero a, checking only 1 string is sufficient.
"... of course since strings are not defined in C" is not so. C does define "string", though not as a type :
"A string is a contiguous sequence of characters terminated by and including the first null character" C11 §7.1.1 1
Using (unsigned char *) make sense if code is attempting to not only simply compare equality, but order. Even in this case, the type could be char. But by casting to unsigned char or even signed char, code provides consistent results across platforms even where some have char as signed char and others as unsigned char.
// return 0, -1 or +1
int order(const char *s1, const char *s2) {
const unsigned char *uc1 = (const unsigned char *) s1;
const unsigned char *uc2 = (const unsigned char *) s2;
while((*uc1 == *uc2) && *uc1) ++uc1, ++uc2;
return (*uc1 > *uc2) - (*uc1 < *uc2);
}
Using const in the function signature allows code to be used with const char * as order(buffer, "factorial");. Otherwise calling OP's equal(char *s1, char *s2) with equal(buffer, "factorial"); is undefined behavior. The stricken text would be true if the routine modified *s1 or *s2, but it does not. Using const does reduce certain warnings and allow for some optimizations. Credit: #abligh
This is a corner case where the casting is needed. If range of char is the same as the range of int (some graphics processors do that) and char is a signed char, then *s1 - *s2 can overflow and that is undefined behavior (UB). Of course, platforms that have the same range for char and int are rare. IMO, it is doubtful even on such machines, a non-casted version of this code would fail, but it is technically UB.
How about
int equal(const char *s1, const char *s2)
{
int i;
for (i=0; s1[i] || s2[i]; i++)
if (s1[i] != s2[i])
return 0;
return 1;
}
Or if you prefer while loops:
int equal(const char *s1, const char *s2)
{
while (*s1 || *s2)
if (*s1++ != *s2++)
return 0;
return 1;
}
To answer your specific question, in order to compare two strings (or indeed two characters) there is no need to convert them to unsigned char. I hope you agree my method is a little more readable than yours.
Related
I am writing a re-implementation of strlcat as an exercise. I have perform several tests and they produce similar result. However on one particular case, my function gives an segmentation fault error while the original does not, could you explain to me why? I am not allowed to use any of the standard library function, that is why I have re-implemented strlen().
Here is the code I have written :
#include <stdio.h>
#include <string.h>
int ft_strlen(char *s)
{
int i;
i = 0;
while (s[i] != '\0')
i++;
return (i);
}
unsigned int ft_strlcat(char *dest, char *src, unsigned int size)
{
size_t i;
int d_len;
int s_len;
i = 0;
d_len = ft_strlen(dest);
s_len = ft_strlen(src);
if (!src || !*src)
return (d_len);
while ((src[i] && (i < (size - d_len - 1))))
{
dest[i + d_len] = src[i];
i++;
}
dest[i + d_len] = '\0';
return (d_len + s_len);
}
int main(void)
{
char s1[5] = "Hello";
char s2[] = " World!";
printf("ft_strcat :: %s :: %u :: sizeof %lu\n", s1, ft_strlcat(s1, s2, sizeof(s1)), sizeof(s1));
// printf("strlcat :: %s :: %lu :: sizeof %lu\n", s1, strlcat(s1, s2, sizeof(s1)), sizeof(s1));
}
The output using strlcat is : strlcat :: Hello World! :: 12 :: sizeof 5. I am on macOS and I am using clang to compile if that can be of some help.
ft_strlcat() is not so bad, but it expects pointers to strings. main() is troublesome: s1 lacks a null character: so s1 is not a string.
//char s1[5] = "Hello";
char s1[] = "Hello"; // Use a string
s1[] too small for the concatenated string "HelloWorld"
char s1[11 /* or more */] = "Hello"; // Use a string
"%lu" matches unsigned long. size_t from sizeof matches "%zu".
Some ft_strlcat() issues:
unsigned, int vs. size_t
unsigned, int too narrow for long strings. Use size_t to handle all strings.
Test too late
if (!src || ...) is too late as prior ft_strlen(src); invokes UB when src == NULL.
const
ft_strlcat() should use a pointer to const to allow usage with const strings with src.
Advanced: restrict
Use restrict so the compiler can assume dest, src do not overlap and emit more efficient code - assuming they should not overlap.
Corner cases
It does not handle some pesky corner cases like when d_len >= size, but I will leave that detailed analysis for later.
Suggested signature
// unsigned int ft_strlcat(char *dest, char *src, unsigned int size)
size_t ft_strlcat(char * restrict dest, const char * restrict src, size_t size)
Some untested code for your consideration:
Tries to mimic strlcat().
Returns sum of string lengths, but not more that size.
Does not examine more than size characters to prevent reading out of bounds.
Does not append a null character when not enough room.
Does not check for dst, src as NULL. Add if you like.
Does not handle overlapping dest, src. To do so is tricky unless library routines available.
Use unsigned char * pointer to properly handle rare signed non-2's complement char.
size_t my_strlcat(char * restrict dst, const char * restrict src, size_t size) {
const size_t size_org = size;
// Walk dst
unsigned char *d = (unsigned char*) dst;
while (size > 0 && *d) {
d++;
size--;
}
if (size == 0) {
return size_org;
}
// Copy src to dst
const unsigned char *s = (const unsigned char*) src;
while (size > 0 && *s) {
*d++ = *s++;
size--;
}
if (size == 0) {
return size_org;
}
*d = '\0';
return (size_t) (d - (unsigned char*) dst);
}
If the return value is less than size, success!
s1 is not even long enough to accommodate the "Hello"
Use the correct type for sizes.
size_t ft_strlcat(char *dest, const char *src, size_t len)
{
char *savedDest = dest;
if(dest && src && len)
{
while(*dest && len)
{
len--;
dest++;
}
if(len)
{
while((*dest = *src) && len)
{
len--;
dest++;
*src++;
}
}
if(!len) dest[-1] = 0;
}
return dest ? dest - savedDest : 0;
}
Also your printf invokes undefined behaviour as order of function parameters evaluation is not determined. It should be:
int main(void)
{
char s1[5] = "Hello"; //will only work for len <= sizeof(s1) as s1 is not null character terminated
char s2[] = " World!";
size_t result = ft_strlcat(s1, s2, sizeof(s1));
printf("ft_strcat :: %s :: %zu :: sizeof %zu\n", s1, result, sizeof(s1));
}
https://godbolt.org/z/8hhbKjsbx
I'm implementing strcmp(char *s, char *t) which returns <0 if s<t, 0 if s==t, and >0 if s>t by comparing the fist value that is different between the two strings.
implementing by separating the postfix increment and relational equals operators works:
for (; *s==*t; s++, t++)
if (*s=='\0')
return 0;
return *s - *t;
however, grouping the postfix increment and relational equals operators doesn't work (like so):
while (*s++ == *t++)
if (*s=='\0')
return 0;
return *s - *t;
The latter always returns 0. I thought this could be because we're incrementing the pointers too soon, but even with a difference in the two string occurring at index 5 out of 10 still produces the same result.
Example input:
strcomp("hello world", "hello xorld");
return value:
0
My hunch is this is because of operator precedence but I'm not positive and if so, I cannot exactly pinpoint why.
Thank you for your time!
Because in the for loop, the increment (s++, t++ in your case) is not called if the condition (*s==*t in your case) is false. But in your while loop, the increment is called in that case too, so for strcomp("hello world", "hello xorld"), both pointers end up pointing at os in the strings.
Since you always increment s and t in the test, you should refer to s[-1] for the termination in case of equal strings and s[-1] and t[-1] in case they differ.
Also note that the order is determined by the comparison as unsigned char.
Here is a modified version:
int strcmp(const char *s, const char *t) {
while (*s++ == *t++) {
if (s[-1] == '\0')
return 0;
}
return (unsigned char)s[-1] - (unsigned char)t[-1];
}
Following the comments from LL chux, here is a fully conforming implementation for perverse architectures with non two's complement representation and/or CHAR_MAX > INT_MAX:
int strcmp(const char *s0, const char *t0) {
const unsigned char *s = (const unsigned char *)s0;
const unsigned char *t = (const unsigned char *)t0;
while (*s++ == *t++) {
if (s[-1] == '\0')
return 0;
}
return (s[-1] > t[-1]) - (s[-1] < t[-1]);
}
Everyone is giving the right advice, but are still hardwired to inlining those increment operators within the comparison expression and doing weird off by 1 stuff.
The following just feels simpler and easier to read. No pointer is ever incremented or decremented to an invalid address.
while ((*s == *t) && *s)
{
s++;
t++;
}
return *s - *t;
For completeness in addition to what was already well answered about the wrong offset during subtraction:
*s - *t; is incorrect when *s, *t is negative.
The standard C library specifies that string functions compare as if char was unsigned char. Thus code that subtracts via a char * gives the wrong answer when the characters are negative.
For all functions in this subclause, each character shall be interpreted as if it had the type unsigned char (and therefore every possible object representation is valid and has a different value).
C17dr § 7.24.1 3
int strcmp(const char *s, const char *t) {
const unsigned char *us = (const unsigned char *) s;
const unsigned char *ut = (const unsigned char *) t;
while (*us == *ut && *us) {
us++;
ut++;
}
return (*us > *ut) - (*us < *ut);
}
This code also addresses obscure concerns of non-2's complement access of -0 and char range exceeding int.
This question already has answers here:
Why does strchr take an int for the char to be found?
(4 answers)
Closed 6 years ago.
Trying to create a simple function that would look for a single char in a string "like strchr() would", i did the following:
char* findchar(char* str, char c)
{
char* position = NULL;
int i = 0;
for(i = 0; str[i]!='\0';i++)
{
if(str[i] == c)
{
position = &str[i];
break;
}
}
return position;
}
So far it works. However, when i looked at the prototype of strchr():
char *strchr(const char *str, int c);
The second parameter is an int? I'm curious to know.. Why not a char? Does this mean that we can use int for storing characters just like we use a char?
Which brings me to the second question, i tried to change my function to accept an int as a second parameter... but i'm not sure if it's correct and safe to do the following:
char* findchar(char* str, int c)
{
char* position = NULL;
int i = 0;
for(i = 0; str[i]!='\0';i++)
{
if(str[i] == c) //Specifically, is this line correct? Can we test an int against a char?
{
position = &str[i];
break;
}
}
return position;
}
Before ANSI C89, functions were declared without prototypes. The declaration for strchr looked like this back then:
char *strchr();
That's it. No parameters are declared at all. Instead, there were these simple rules:
all pointers are passed as parameters as-is
all integer values of a smaller range than int are converted to int
all floating point values are converted to double
So when you called strchr, what really happened was:
strchr(str, (int)chr);
When ANSI C89 was introduced, it had to maintain backwards compatibility. Therefore it defined the prototype of strchr as:
char *strchr(const char *str, int chr);
This preserves the exact behavior of the above sample call, including the conversion to int. This is important since an implementation may define that passing a char argument works differently than passing an int argument, which makes sense on 8 bit platforms.
Consider the return value of fgetc(), values in the range of unsigned char and EOF, some negative value. This is the kind of value to pass to strchr().
#Roland Illig presents a very good explanation of the history that led to retaining use of int ch with strchr().
OP's code fails/has trouble as follows.
1) char* str is treated like unsigned char *str per §7.23.1.1 3
For all functions in this subclause, each character shall be interpreted as if it had the type unsigned char
2) i should be type size_t, to handle the entire range of the character array.
3) For the purpose of strchr(), the null character is considered part of the search.
The terminating null character is considered to be part of the string.
4) Better to use const as str is not changed.
char* findchar(const char* str, int c) {
const char* position = NULL;
size_t i = 0;
for(i = 0; ;i++) {
if((unsigned char) str[i] == c) {
position = &str[i];
break;
}
if (str[i]=='\0') break;
}
return (char *) position;
}
Further detail
The strchr function locates the first occurrence of c (converted to a char) in the string pointed to by s. C11dr §7.23.5.2 2
So int c is treat like a char. This could imply
if((unsigned char) str[i] == (char) c) {
Yet what I think this is meant:
if((unsigned char) str[i] == (unsigned char)(char) c) {
or simply
if((unsigned char) str[i] == (unsigned char)c) {
I need to make an strcmp function by myself, using operations with pointers. That's what I got:
int mystrcmp(const char *str1, const char *str2) {
while ('\0' != *str1 && *str1 == *str2) {
str1 += 1;
str2++;
}
int result1 = (uint8_t)(*str2) - (uint8_t)(*str1); // I need (uint8_t) to use it with Russian symbols.
return result1;
}
But my tutor told me that there are small mistake in my code. I spend really lot of time making tests, but couldn't find it.
Does this answer the question of what you're doing wrong?
#include <stdio.h>
#include <stdint.h>
#include <string.h>
int mystrcmp(const char *str1, const char *str2);
int main(void)
{
char* javascript = "JavaScript";
char* java = "Java";
printf("%d\n", mystrcmp(javascript, java));
printf("%d\n", strcmp(javascript, java));
return 0;
}
int mystrcmp(const char *str1, const char *str2) {
while ('\0' != *str1 && *str1 == *str2) {
str1 += 1;
str2++;
}
int result1 = (uint8_t)(*str2) - (uint8_t)(*str1); // I need (uint8_t) to use it with Russian symbols.
return result1;
}
Output:
-83
83
I'll propose a quick fix:
Change
int result1 = (uint8_t)(*str2) - (uint8_t)(*str1);
To
int result1 = (uint8_t)(*str1) - (uint8_t)(*str2);
And why you were wrong:
The return values of strcmp() should be:
if Return value < 0 then it indicates str1 is less than str2.
if Return value > 0 then it indicates str2 is less than str1.
if Return value = 0 then it indicates str1 is equal to str2.
And you were doing exactly the opposite.
#yLaguardia well answered the order problem.
int strcmp(const char *s1, const char *s2);
The strcmp function returns an integer greater than, equal to, or less than zero, accordingly as the string pointed to by s1 is greater than, equal to, or less than the string pointed to by s2. C11dr §7.24.4.2 3
Using uint8_t is fine for the vast majority of cases. Rare machines do not use 8-bit char, so uint8_t is not available. In any case, it is not needed as unsigned char handles the required unsigned compare. (See below about unsigned compare.)
int result1 =
((unsigned char)*str1 - (unsigned char)*str2);
Even higher portable code would use the following to handle when char range and unsigned range match as well as all other char, unsigned char, int, unsigned sizes/ranges.
int result1 =
((unsigned char)*str1 > (unsigned char)*str2) -
((unsigned char)*str1 < (unsigned char)*str2);
strcmp() is defined as treating each character as unsigned char, regardless if char is signed or unsigned.
... each character shall be interpreted as if it had the type
unsigned char ... C11 §7.24.1 3
Should the char be ASCII or not is not relevant to the coding of strcmp(). Of course under different character encoding, different results may occur. Example: strcmp("A", "a") may result in a positive answer (seldom used EBCDIC) with one encoding, but negative (ASCII) on another.
Do we have any alternative for the strrspn and strfind functions(libgen functions in Solaris) for gcc compiler in AIX?
The functionalities are mentioned below -
int strfind(const char *s1, const char *s2); - The strfind() function returns the offset of the first occurrence of the second string, s2, if it is a substring of string s1. If the second string is not a substring of the first string strfind() returns -1.
char *strrspn(const char *string, const char *cset); - The strrspn() function trims chartacters from a string. It searches from the end of string for the first character that is not contained in cset. If such a character is found, strrspn() returns a pointer to the next character; otherwise, it returns a pointer to string.
Please help with this?
There is nothing exactly like strfind that I know of. but you could implement it using strstr:
int
strfind (const char *haystack, const char *needle)
{
const char *res = strstr(haystack, needle);
// if not found, return -1
if (res == NULL)
return -1;
// else return the offset in haystack
return res - haystack;
}
strrspn is maybe a bit trickier, but you could do something along these lines:
char*
strrspn (const char *string, const char *cset)
{
size_t len = strlen(strign);
const char *p = string + len;
// start from the back, and look for a char not in cset
while (--p >= string)
if (NULL == strchr(cset, *p))
return p;
return string
}
needless to say, these functions are entirely untested and willl likely not work as they stand, but they should give you an idea.