strcmp() and signed / unsigned chars - c

I am confused by strcmp(), or rather, how it is defined by the standard. Consider comparing two strings where one contains characters outside the ASCII-7 range (0-127).
The C standard defines:
int strcmp(const char *s1, const char *s2);
The strcmp function compares the string pointed to by s1 to the string
pointed to by s2.
The strcmp function returns an integer greater than, equal to, or
less than zero, accordingly as the
string pointed to by s1 is greater
than, equal to, or less than the
string pointed to by s2.
The parameters are char *. Not unsigned char *. There is no notion that "comparison should be done as unsigned".
But all the standard libraries I checked consider the "high" character to be just that, higher in value than the ASCII-7 characters.
I understand this is useful and the expected behaviour. I don't want to say the existing implementations are wrong or something. I just want to know, which part in the standard specs have I missed?
int strcmp_default( const char * s1, const char * s2 )
{
while ( ( *s1 ) && ( *s1 == *s2 ) )
{
++s1;
++s2;
}
return ( *s1 - *s2 );
}
int strcmp_unsigned( const char * s1, const char *s2 )
{
unsigned char * p1 = (unsigned char *)s1;
unsigned char * p2 = (unsigned char *)s2;
while ( ( *p1 ) && ( *p1 == *p2 ) )
{
++p1;
++p2;
}
return ( *p1 - *p2 );
}
#include <stdio.h>
#include <string.h>
int main()
{
char x1[] = "abc";
char x2[] = "abü";
printf( "%d\n", strcmp_default( x1, x2 ) );
printf( "%d\n", strcmp_unsigned( x1, x2 ) );
printf( "%d\n", strcmp( x1, x2 ) );
return 0;
}
Output is:
103
-153
-153

7.21.4/1 (C99), emphasis is mine:
The sign of a nonzero value returned by the comparison functions memcmp, strcmp,
and strncmp is determined by the sign of the difference between the values of the first
pair of characters (both interpreted as unsigned char) that differ in the objects being
compared.
There is something similar in C90.
Note that strcoll() may be more adapted than strcmp() especially if you have character outside the basic character set.

Related

Trying to create a strcat in c

So, I'm trying to code a strcat function using pointers, just for studying purposes.
#include <stdio.h>
#include <string.h>
char *strcpyy(char *dest, char *orig){
char *tmp = dest;
while (*dest++ = *orig++);
return tmp;
}
char *strcatt(char *dest, char *orig){
strcpyy(dest + strlen(dest), orig);
return dest;
}
int main(){
char *a = "one";
char *b = "two";
printf("%s", strcatt(a,b));
}
When I run this code, the output is empty. Can anyone point out the problem?
String literals are read-only. Any attempt to write to a string literal will invoke undefined behavior, which means that your program may crash or not behave as intended.
Therefore, you should not use a pointer to a string literal as the first argument to strcat or your equivalent function. Instead, you must provide a pointer to an object which is writable and has sufficient space for the result (including the terminating null character), for example a char array of length 7. This array can be initialized using a string literal.
Therefore, I recommend that you change the line
char *a = "one";
to the following:
char a[7] = "one";
After making this change, your program should work as intended.
You declared two pointers to string literals
char *a = "one";
char *b = "two";
You may not append one string literal to another.
Instead you need to define the variable a as a character array large enough to contain the appended string literal pointed to by the pointer b.
And the both functions should be declared like
char *strcpyy(char *dest, const char *orig);
char *strcatt(char *dest, const char *orig);
Also as you are using standard C string functions like strlen
strcpyy(dest + strlen(dest), orig);
then it will be logically consistent to use standard C function strcpy instead of your own function strcpyy.
Otherwise without using standard string functions your function strcatt can look the following way
char * strcatt( char *s1, const char *s2 )
{
char *p = s1;
while ( *p ) ++p;
while ( ( *p++ = *s2++ ) != '\0' );
return s1;
}
Here is a demonstration program.
#include <stdio.h>
char * strcatt( char *s1, const char *s2 )
{
char *p = s1;
while ( *p ) ++p;
while ( ( *p++ = *s2++ ) != '\0' );
return s1;
}
int main( void )
{
char a[7] = "one";
const char *b = "two";
puts( strcatt( a, b ) );
}
The program output is
onetwo
You cannot modify "string literals". Those are not mutable.
The usual idiom for this sort of operation is to build up a string in a temporary working buffer that should be pre-dimensioned large enough to hold all that is required.
The following also shows more obvious code in both your functions.
#include <stdio.h>
char *strcpyy( char *dst, const char *org ) {
for( char *p = dst; (*p++ = *org++) != '\0'; /**/ )
; // loop
return dst;
}
char *strcatt( char *dst, const char *org ) {
char *p = dst;
while( *p != '\0' )
p++; //loop
while( (*p = *org++) != '\0' )
p++; // loop
return dst;
}
int main(){
const char *a = "one ";
const char *b = "two ";
const char *c = "three";
char wrk[ 64 ]; // sufficient mutable space defined
printf( "%s\n", strcatt( strcatt( strcpyy( wrk, a ), b ), c ) );
return 0;
}
one two three

passing string with pointer through function c

i,m trying to write this code, it should counting the number of substring, which are not including in the string, for examples(below), in the main i was trying with pointer to work with String without using arrays but it didnt work at all!!
// count_target_string("abc of", "of") -> 1
// count_target_string("abcof", "of") -> 0
// count_target_string("abc of abc of", "of") -> 2
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int countTargetString(char* text , char* string){
char d[]=" ";
char * portion = strtok(text,d);
int result=0;
while (portion!=NULL){
if (strcmp(portion,string)==0){
result++;
}
portion = strtok(NULL,d);
}
return result;
}
int main(){
printf("%d\n",countTargetString("abc of abc of","of"));
char *test ="abc of abc of";
char *d = "of";
printf("%d\n",countTargetString(test,d));
return 0;
}
strtok modifies the string.
char *test ="abc of abc of"; defines the pointer to the string literal. Modification of the string literal invokes Undefined Behaviour (UB). It is why your code does "not work at all" Same if you pass string literal reference directly to the function (ie use a string literal as a parameter) countTargetString("abc of abc of","of"));.
Your pointer must reference a modifiable string:
int main()
{
char mystring[] = "abc of abc of";
char *test = mystring;
char *d = "of";
printf("%d\n",countTargetString(test,d));
}
In the both calls of the function countTargetString
printf("%d\n",countTargetString("abc of abc of","of"));
char *test ="abc of abc of";
char *d = "of";
printf("%d\n",countTargetString(test,d));
you are passing pointers to string literals.
Though in C opposite to C++ string literals have types of non-constant character arrays nevertheless you may not change a string literal. Any attempt to change a string literal results in undefined behavior.
From the C Standard (6.4.5 String literals)
7 It is unspecified whether these arrays are distinct provided their
elements have the appropriate values. If the program attempts to
modify such an array, the behavior is undefined.
And the function strtok changes the source string inserting terminating zero characters '\0' to extract substrings.
It is always better even in C to declare pointers to string literals with the qualifier const.
Instead of the function strtok you can use function strstr.
Here is a demonstration program.
#include <stdio.h>
#include <string.h>
#include <ctype.h>
size_t countTargetString( const char *s1, const char *s2 )
{
size_t count = 0;
size_t n = strlen( s2 );
for ( const char *p = s1; ( p = strstr( p, s2 ) ) != NULL; p += n )
{
if ( ( p == s1 || isblank( ( unsigned char )p[-1] ) ) &&
( p[n] == '\0' || isblank( ( unsigned char )p[n] ) ) )
{
++count;
}
}
return count;
}
int main( void )
{
printf("%zu\n",countTargetString("abc of abc of","of"));
const char *test ="abc of abc of";
const char *d = "of";
printf("%zu\n",countTargetString(test,d));
}
The program output is
2
2
As you can see the function parameters are also declared with the qualifier const because the function does not change passed strings.
Pay attention to that in any case to count occurrences of substrings in a string it is a bad idea to change the original string.
While strtok will not work with a string literal, strspn and strcspn can be used.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int countTargetString(char* text , char* string){
char d[]=" ";
int result = 0;
size_t span = 0;
while ( *text) {
text += strspn ( text, d);
span = strcspn ( text, d);
if ( strncmp ( text, string, span)) {
++result;
}
text += span;
}
return result;
}
int main( void) {
printf("%d\n",countTargetString("abc of abc of","of"));
char *test ="abc of abc of";
char *d = "of";
printf("%d\n",countTargetString(test,d));
return 0;
}
int count_substr(const char* target, const char* searched) {
int found = 0;
unsigned long s_len = strlen(searched);
for (int i = 0; target[i]; i++) {
// use memcmp to NOT compare the null terminator of searched
if (memcmp(target + i, searched, s_len) == 0) {
found++;
i += s_len - 1;
}
}
return found;
}
This is a very basic implementation of substring counting. For the fastest solution possible, copy the boyer moore pattern matching algorithm from wikipedia or wherever you want and modify it to cound instead of terminationg on a match.

Compare string with ">", "<", "=" in c

I want to know what is compared when we have s1 > s2 and what is the result (for example when we compare chars their ASCII corresponding codes are compared). I read that with this operation are compared the first chars of s1 and s2, but if I try to compare with s1[0] > s2[0] the result is different, then it cannot be that.
Comparing with == means checking if the pointers point to the same object, so this:
char s1[] = "foo";
char s2[] = "foo";
if(s1 == s2)
puts("Same object");
else
puts("Different object");
would print "Different object".
< and > makes absolutely no sense unless the pointers are pointing to the same object. For instance, you could do like this:
char s[] = "asdfasdf";
char *p1 = &s[3], *p2 = &s[6];
if(p1 < p2)
// Code
If you have character arrays like for example
char s1[] = "Hello";
char s2[] = "Hello";
then in an if statement like for example this
if ( s1 == s2 ) { /*,,,*/ }
the character arrays are implicitly converted to pointers to their first elements.
So the above statement is equivalent to
if ( &s1[0] == &s2[0] ) { /*,,,*/ }
As the arrays occupy different extents of memory then the result of such a comparison will be always equal to logical false.
If you want to compare strings stored in the arrays you need to use the standard string function strcmp declared in the header <string.h>.
For example
#include <string.h>
#include <stdio.h>
//...
if ( strcmp( s1, s2 ) == 0 )
{
puts( "The strings are equal." );
}
else if ( strcmp( s1, s2 ) < 0 )
{
puts( "The first string is less than the second string." );
}
else
{
puts( "The first string is greater than the second string." );
}
You could use the following macro that might be considered a slight abuse of preprocessor.
#define STROP(a, OP, b) (strcmp(a, b) OP 0)
Examplary usage:
STROP(s1, >=, s2) // expands to `strcmp(s1,s2) >= 0`
STROP(s1, ==, s2) // expands to `strcmp(s1, s2) == 0`

Strcmp() function realization on C

I need to make an strcmp function by myself, using operations with pointers. That's what I got:
int mystrcmp(const char *str1, const char *str2) {
while ('\0' != *str1 && *str1 == *str2) {
str1 += 1;
str2++;
}
int result1 = (uint8_t)(*str2) - (uint8_t)(*str1); // I need (uint8_t) to use it with Russian symbols.
return result1;
}
But my tutor told me that there are small mistake in my code. I spend really lot of time making tests, but couldn't find it.
Does this answer the question of what you're doing wrong?
#include <stdio.h>
#include <stdint.h>
#include <string.h>
int mystrcmp(const char *str1, const char *str2);
int main(void)
{
char* javascript = "JavaScript";
char* java = "Java";
printf("%d\n", mystrcmp(javascript, java));
printf("%d\n", strcmp(javascript, java));
return 0;
}
int mystrcmp(const char *str1, const char *str2) {
while ('\0' != *str1 && *str1 == *str2) {
str1 += 1;
str2++;
}
int result1 = (uint8_t)(*str2) - (uint8_t)(*str1); // I need (uint8_t) to use it with Russian symbols.
return result1;
}
Output:
-83
83
I'll propose a quick fix:
Change
int result1 = (uint8_t)(*str2) - (uint8_t)(*str1);
To
int result1 = (uint8_t)(*str1) - (uint8_t)(*str2);
And why you were wrong:
The return values of strcmp() should be:
if Return value < 0 then it indicates str1 is less than str2.
if Return value > 0 then it indicates str2 is less than str1.
if Return value = 0 then it indicates str1 is equal to str2.
And you were doing exactly the opposite.
#yLaguardia well answered the order problem.
int strcmp(const char *s1, const char *s2);
The strcmp function returns an integer greater than, equal to, or less than zero, accordingly as the string pointed to by s1 is greater than, equal to, or less than the string pointed to by s2. C11dr §7.24.4.2 3
Using uint8_t is fine for the vast majority of cases. Rare machines do not use 8-bit char, so uint8_t is not available. In any case, it is not needed as unsigned char handles the required unsigned compare. (See below about unsigned compare.)
int result1 =
((unsigned char)*str1 - (unsigned char)*str2);
Even higher portable code would use the following to handle when char range and unsigned range match as well as all other char, unsigned char, int, unsigned sizes/ranges.
int result1 =
((unsigned char)*str1 > (unsigned char)*str2) -
((unsigned char)*str1 < (unsigned char)*str2);
strcmp() is defined as treating each character as unsigned char, regardless if char is signed or unsigned.
... each character shall be interpreted as if it had the type
unsigned char ... C11 §7.24.1 3
Should the char be ASCII or not is not relevant to the coding of strcmp(). Of course under different character encoding, different results may occur. Example: strcmp("A", "a") may result in a positive answer (seldom used EBCDIC) with one encoding, but negative (ASCII) on another.

Unsigned char vs char in C — comparison of strings

I have a small assignment to do in C and I tried to find the best way I can to compare two strings (char arrays of course since strings are not defined in C).
This is my code :
int equal(char *s1, char *s2)
{
int a = 0;
while(!(a = *(unsigned char *)s1 - *(unsigned char *)s2) && *s2) ++s1, ++s2;
return (a == 0) ? 1 : 0;
}
It works but I don't see why I have to convert my char to an unsigned char.
(Of course I cannot use <string.h> in my assignment.)
The original code is fairly optimal. For simple equality comparisons, there is no need for the (unsigned char *) casts. The following works fine. (but see point #6):
int equal(char *s1, char *s2) {
int a = 0;
while(!(a = *s1 - *s2) && *s2) ++s1, ++s2;
return (a == 0) ? 1 : 0;
}
In making more optimal code, there is no need to compare both strings for the null character '\0' as in if (*s1 || *s2) .... As code checks for a non-zero a, checking only 1 string is sufficient.
"... of course since strings are not defined in C" is not so. C does define "string", though not as a type :
"A string is a contiguous sequence of characters terminated by and including the first null character" C11 §7.1.1 1
Using (unsigned char *) make sense if code is attempting to not only simply compare equality, but order. Even in this case, the type could be char. But by casting to unsigned char or even signed char, code provides consistent results across platforms even where some have char as signed char and others as unsigned char.
// return 0, -1 or +1
int order(const char *s1, const char *s2) {
const unsigned char *uc1 = (const unsigned char *) s1;
const unsigned char *uc2 = (const unsigned char *) s2;
while((*uc1 == *uc2) && *uc1) ++uc1, ++uc2;
return (*uc1 > *uc2) - (*uc1 < *uc2);
}
Using const in the function signature allows code to be used with const char * as order(buffer, "factorial");. Otherwise calling OP's equal(char *s1, char *s2) with equal(buffer, "factorial"); is undefined behavior. The stricken text would be true if the routine modified *s1 or *s2, but it does not. Using const does reduce certain warnings and allow for some optimizations. Credit: #abligh
This is a corner case where the casting is needed. If range of char is the same as the range of int (some graphics processors do that) and char is a signed char, then *s1 - *s2 can overflow and that is undefined behavior (UB). Of course, platforms that have the same range for char and int are rare. IMO, it is doubtful even on such machines, a non-casted version of this code would fail, but it is technically UB.
How about
int equal(const char *s1, const char *s2)
{
int i;
for (i=0; s1[i] || s2[i]; i++)
if (s1[i] != s2[i])
return 0;
return 1;
}
Or if you prefer while loops:
int equal(const char *s1, const char *s2)
{
while (*s1 || *s2)
if (*s1++ != *s2++)
return 0;
return 1;
}
To answer your specific question, in order to compare two strings (or indeed two characters) there is no need to convert them to unsigned char. I hope you agree my method is a little more readable than yours.

Resources