I'm trying to do exercise 5-4 in the K&R C book. I have written the methods for strncpy and strncat, but I'm having some trouble understanding exactly what to return for the strncmp part of the exercise.
The definition of strncmp (from Appendix B in K&R book) is:
compare at most n characters of string s to string t; return <0 if s<t, 0 if s==t, or >0 if s>t
Lets say I have 3 strings:
char s[128] = "abc"
char t[128] = "abcdefghijk"
char u[128] = "hello"
And I want to compare them using the strncmp function I have to write. I know that
strncmp(s, t, 3)
will return 0 ,because abc == abc. Where I'm confused is the other comparisons. For example
strncmp(s, t, 5) and
strncmp(s, u, 4)
The first matches up the 3th position and then after that they no longer match and the second example doesn't match at all.
I really just want know what those 2 other comparisons return and why so that I can write my version of strncmp and finish the exercise.
Both return a negative number (it just compares using character order). I just did a quick test and on my machine it's returning the difference of the last-compared characters. So:
strncmp(s, t, 5) = -100 // '\0' - 'd'
strncmp(s, u, 4) = -7 // 'a' - 'h'
Is that what you're looking for?
The characters in the first non-matching positions are cast to unsigned char and then compared numerically - if that character in s1 is less than the corresponding character in s2, then a negative number is returned; if it's greater, a positive number is returned.
The contract for strncmp is to return an integral value whose sign indicates the result of the comparison:
a negative value indicates that the 1st operand compares as being "less than" the 2nd operand,
a positive, non-zero value indicates that the 1st operand compares as being "greater than" than the 2nd operand, and
0 indicates that the two operands compare as being "equal to" each other.
The reason it's defined that way, rather than, say, "return -1 for "less than", 0 for "equal to" and +1 for "greater than" is to not constrain the implementation.
The value returned for a particular C runtime library is dependent upon how the function is implemented. The Posix specification (IEEE 1003.1) for strncmp() (which tracks the C Standard) says:
The strncmp() function shall compare not more than n bytes (bytes that follow a null
byte are not compared) from the array pointed to by s1 to the array pointed to by s2.
The sign of a non-zero return value is determined by the sign of the difference
between the values of the first pair of bytes (both interpreted as type unsigned
char) that differ in the strings being compared.
That should be about all you need to know to implement it. You should note, though that:
strncmp() is not "safe", in the sense that it is subject to buffer overflows. A proper implementation will merrily compare characters until it encounters an ASCII NUL, hits the maximum length, or tries to access protected memory.
The specification says that the sign of the return value is based on the delta between the 1st pair of characters that differ; no particular return value is mandated.
Good luck.
it is lexicographic order, strings are compared in alphabetical order from left to right.
So abc < abcdefghijk < hello
strncmp(s, t, 5) = -1
strncmp(s, t, 5) = -1
Related
I want to understand the following code:
//...
#define _C 0x20
extern const char *_ctype_;
//...
__only_inline int iscntrl(int _c)
{
return (_c == -1 ? 0 : ((_ctype_ + 1)[(unsigned char)_c] & _C));
}
It originates from the file ctype.h from the obenbsd operating system source code. This function checks if a char is a control character or a printable letter inside the ascii range. This is my current chain of thought:
iscntrl('a') is called and 'a' is converted to it's integer value
first check if _c is -1 then return 0 else...
increment the adress the undefined pointer points to by 1
declare this adress as a pointer to an array of length (unsigned char)((int)'a')
apply the bitwise and operator to _C (0x20) and the array (???)
Somehow, strangely, it works and everytime when 0 is returned the given char _c is not a printable character. Otherwise when it's printable the function just returns an integer value that's not of any special interest. My problem of understanding is in step 3, 4 (a bit) and 5.
Thank you for any help.
_ctype_ appears to be a restricted internal version of the symbol table and I'm guessing the + 1 is that they didn't bother saving index 0 of it since that one isn't printable. Or possibly they are using a 1-indexed table instead of 0-indexed as is custom in C.
The C standard dictates this for all ctype.h functions:
In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF
Going through the code step by step:
int iscntrl(int _c) The int types are really characters, but all ctype.h functions are required to handle EOF, so they must be int.
The check against -1 is a check against EOF, since it has the value -1.
_ctype+1 is pointer arithmetic to get an address of an array item.
[(unsigned char)_c] is simply an array access of that array, where the cast is there to enforce the standard requirement of the parameter being representable as unsigned char. Note that char can actually hold a negative value, so this is defensive programming. The result of the [] array access is a single character from their internal symbol table.
The & masking is there to get a certain group of characters from the symbol table. Apparently all characters with bit 5 set (mask 0x20) are control characters. There's no making sense of this without viewing the table.
Anything with bit 5 set will return the value masked with 0x20, which is a non-zero value. This sates the requirement of the function returning non-zero in case of boolean true.
_ctype_ is a pointer to a global array of 257 bytes. I don't know what _ctype_[0] is used for. _ctype_[1] through _ctype_[256]_ represent the character categories of characters 0, …, 255 respectively: _ctype_[c + 1] represents the category of the character c. This is the same thing as saying that _ctype_ + 1 points to an array of 256 characters where (_ctype_ + 1)[c] represents the categorty of the character c.
(_ctype_ + 1)[(unsigned char)_c] is not a declaration. It's an expression using the array subscript operator. It's accessing position (unsigned char)_c of the array that starts at (_ctype_ + 1).
The code casts _c from int to unsigned char is not strictly necessary: ctype functions take char values cast to unsigned char (char is signed on OpenBSD): a correct call is char c; … iscntrl((unsigned char)c). They have the advantage of guaranteeing that there is no buffer overflow: if the application calls iscntrl with a value that is outside the range of unsigned char and isn't -1, this function returns a value which may not be meaningful but at least won't cause a crash or a leak of private data that happened to be at the address outside of the array bounds. The value is even correct if the function is called as char c; … iscntrl(c) as long as c isn't -1.
The reason for the special case with -1 is that it's EOF. Many standard C functions that operate on a char, for example getchar, represent the character as an int value which is the char value wrapped to a positive range, and use the special value EOF == -1 to indicate that no character could be read. For functions like getchar, EOF indicates the end of the file, hence the name end-of-file. Eric Postpischil suggests that the code was originally just return _ctype_[_c + 1], and that's probably right: _ctype_[0] would be the value for EOF. This simpler implementation yields to a buffer overflow if the function is misused, whereas the current implementation avoids this as discussed above.
If v is the value found in the array, v & _C tests if the bit at 0x20 is set in v. The values in the array are masks of the categories that the character is in: _C is set for control characters, _U is set for uppercase letters, etc.
I'll start with step 3:
increment the adress the undefined pointer points to by 1
The pointer is not undefined. It's just defined in some other compilation unit. That is what the extern part tells the compiler. So when all files are linked together, the linker will resolve the references to it.
So what does it point to?
It points to an array with information about each character. Each character has its own entry. An entry is a bitmap representation of characteristics for the character. For example: If bit 5 is set, it means that the character is a control character. Another example: If bit 0 is set, it means that the character is a upper character.
So something like (_ctype_ + 1)['x'] will get the characteristics that apply to 'x'. Then a bitwise and is performed to check if bit 5 is set, i.e. check whether it is a control character.
The reason for adding 1 is probably that the real index 0 is reserved for some special purpose.
All information here is based on analyzing the source code (and programming experience).
The declaration
extern const char *_ctype_;
tells the compiler that there is a pointer to const char somewhere named _ctype_.
(4) This pointer is accessed as an array.
(_ctype_ + 1)[(unsigned char)_c]
The cast (unsigned char)_c makes sure the index value is in the range of an unsigned char (0..255).
The pointer arithmetic _ctype_ + 1 effectively shifts the array position by 1 element. I don't know why they implemented the array this way. Using the range _ctype_[1].._ctype[256] for the character values 0..255 leaves the value _ctype_[0] unused for this function. (The offset of 1 could be implemented in several alternative ways.)
The array access retrieves a value (of type char, to save space) using the character value as array index.
(5) The bitwise AND operation extracts a single bit from the value.
Apparently the value from the array is used as a bit field where the bit 5 (counting from 0 starting at least significant bit, = 0x20) is a flag for "is a control character". So the array contains bit field values describing the properties of the characters.
The key here is to understand what the expression (_ctype_ + 1)[(unsigned char)_c] does (which is then fed to the bitwise and operation, & 0x20 to get the result!
Short answer: It returns element _c + 1 of the array pointed-to by _ctype_.
How?
First, although you seem to think _ctype_ is undefined it actually isn't! The header declares it as an external variable - but it is defined in (almost certainly) one of the run-time libraries that your program is linked with when you build it.
To illustrate how the syntax corresponds to array indexing, try working through (even compiling) the following short program:
#include <stdio.h>
int main() {
// Code like the following two lines will be defined somewhere in the run-time
// libraries with which your program is linked, only using _ctype_ in place of _qlist_ ...
const char list[] = "abcdefghijklmnopqrstuvwxyz";
const char* _qlist_ = list;
// These two lines show how expressions like (a)[b] and (a+1)[b] just boil down to
// a[b] and a[b+1], respectively ...
char p = (_qlist_)[6];
char q = (_qlist_ + 1)[6];
printf("p = %c q = %c\n", p, q);
return 0;
}
Feel free to ask for further clarification and/or explanation.
The functions declared in ctype.h accept objects of the type int. For characters used as arguments it is assumed that they are preliminary casted to the type unsigned char. This character is used as an index in a table that determines the characteristic of the character.
It seems the check _c == -1 is used in case when the _c contains the value of EOF. If it is not EOF then _c is casted to the type unsigned char that is used as an index in the table pointed to by the expression _ctype_ + 1. And if the bit specified by the mask 0x20 is set then the character is a control symbol.
To understand the expression
(_ctype_ + 1)[(unsigned char)_c]
take into account that the array subscripting is a postfix operator that is defined like
postfix-expression [ expression ]
You may not write like
_ctype_ + 1[(unsigned char)_c]
because this expression is equivalent to
_ctype_ + ( 1[(unsigned char)_c] )
So the expression _ctype_ + 1 is enclosed in parentheses to get a primary expression.
So in fact you have
pointer[integral_expression]
that yields the object of an array at index that is calculated as the expression integral_expression where pointer is (_ctype_ + 1) (gere is used the pointer arithmetuc) and integral_expression that is the index is the expression (unsigned char)_c.
#include <stdio.h>
#include <string.h>
int main()
{
int test1 = 8410092; // 0x8053EC
int test2 = 8404974; // 0x803FEE
char *t1 = ( char*) &test1;
char *t2 = (char*) &test2;
int ret2 = memcmp(t1,t2,4);
printf("%d",ret2);
}
Here's a very basic function that when run prints -2. Maybe I am totally misunderstanding memcmp, but I thought if it returns the difference between the first different bytes. Since test1 is a larger num than test2, shouldn't the printed value be positive?
I am using the standard gcc.7 compiler for ubuntu.
As pointed out in the comments, memcmp() runs byte comparison. Here is a man quote
int memcmp(const void *s1, const void *s2, size_t n);
RETURN VALUE:
The memcmp() function returns an integer less than, equal to, or
greater than zero if the first n bytes of s1 is found, respectively,
to be less than, to match, or be greater than the first n bytes of
s2
For a nonzero return value, the sign is determined by the sign of the
difference between the first pair of bytes (interpreted as unsigned
char) that differ in s1 and s2.
If n is zero, the return value is zero.
http://man7.org/linux/man-pages/man3/memcmp.3.html
If the bytes are not the same, the sign of the difference depends on the target endianness.
One application of memcmp() is testing if two large arrays are the same, which could be faster than writing a loop that runs element by element comparison. Refer to this stack questions for more details. Why is memcmp so much faster than a for loop check?
memcmp compares memory. That is, it compares the bytes used to represent objects. The bytes used to represent objects may vary from one C implementation to another. Per C 2018 6.2.6 2:
Except for bit-fields, objects are composed of contiguous sequences of one or more bytes, the number,
order, and encoding of which are either explicitly specified or implementation-defined.
To compare the values represented by objects, use the ordinary operators <, <=, >, >=, ==, and !=. Comparing the memory of objects with memcmp should be used for limited purposes, such as inserting objects into a tree that only needs to be able to store and retrieve items without caring about what their values mean.
My following code for testing strcmp is as follows:
char s1[10] = "racecar";
char *s2 = "raceCar"; //yes, a capital 'C'
int diff;
diff = strcmp(s1,s2);
printf(" %d\n", diff);
So I am confused on why the output is 32. What exactly is it comparing to get that result? I appreciate your time and help.
Whatever it wants. In this case, it looks like the value you're getting is 'c' - 'C' (the difference between the two characters at the first point where the strings differ), which is equal to 32 on many systems, but you shouldn't by any means count on that. The only thing that you can count on is that the return will be 0 if the two strings are equal, negative if s1 comes before s2, and positive if s1 comes after s2.
The man pages states that the output will be greater than 0 or less than 0 if the strings are not the same. It doesn't say anything else regarding the exact value (if not 0).
That being said, the ASCII codes for c and C differ by 32. That's probably where the result is coming from. You can't depend on this behavior being identical in any two given implementations however.
It is not specified. According to the standard:
7.24.4.2 The strcmp function
#include <string.h>
int strcmp(const char *s1, const char *s2);
Description
The strcmp function compares the string pointed to by s1 to the string pointed to by
s2.
Returns
The strcmp function returns an integer greater than, equal to, or less than zero,
accordingly as the string pointed to by s1 is greater than, equal to, or less than the string pointed to by s2.
According to the C standard (N1570 7.24.4.2):
The strcmp function returns an integer greater than, equal to,
or less than zero, accordingly as the string pointed to by s1 is
greater than, equal to, or less than the string pointed to by
s2.
It says nothing about which positive or negative value it will return if the strings are unequal, and portable code should only check whether the result is less than, equal to, or greater than zero.
Having said that, a straightforward implementation of strcmp would likely return the numeric difference in the values of the first characters that don't match. In your case, the first non-matching characters are 'c' and 'C', which happen to differ by 32 in ASCII.
Don't count on this.
"strcmp" compares strings and when it reaches a different character, it will return the difference between them.
In your case, it reaches 'c' in your first string, and 'C' in your second string. 'c' in hex is 0x63 while 'C' is 0x43. Subtract and you get 0x20, which is 32 in decimal.
We use strcmp to check if strings are equal if the function returns 0.
strcmp compares the strings character by character until it reaches characters that don't match or the terminating null-character.
so the strcmp function sees that c (which is 99 in ASCII) is greater than C (which is 67 in ascii), so it returns a positive integer. Whatever positive integer it returns is I think usually defined by your system or whatever version of c you are compiling.
i have this below program
#include <stdio.h>
#include <stdlib.h>
int main()
{
char text1[30],text2[30];
int diff;
puts("Enter text1:");
fgets(text1,30,stdin);
puts("Enter text2:");
fgets(text2,30,stdin);
diff=strcmp(text1,text2);
printf("Difference between %s and %s is %d",text1,text2,diff);
}
if i give text1 as inputtext and text2 as differencetext , then the difference should be 5 , but i am getting as 1 for different inputs , i am not sure where i am going wrong.
The specification for strcmp in the C standard says only that it “returns an integer greater than, equal to, or less than zero, accordingly as the string pointed to by s1 is greater than, equal to, or less than the string pointed to by s2” (C 2011 N1570 7.24.4.2 3, C 2018 ibid).
You may not rely on more specific behavior, such as returning a specific value, unless you have an additional guarantee from your C implementation.
All that the specifications say is that strcmp will return a number "less than", "greater than" or "equal to" zero depending on the result of the comparison.
I'm not sure why you believe that the difference should be 5.
I think you misunderstood what strcmp does:
int strcmp(const char *s1, const char *s2);
Upon completion, strcmp() shall return an integer greater than, equal to, or less than 0, if the string pointed to by s1 is greater than, equal to, or less than the string pointed to by s2, respectively.
From cplusplus.com:
About strcmp return value
Returns an integral value indicating the relationship between the strings:
A zero value indicates that both strings are equal.
A value greater than zero indicates that the first character that does not match has a greater value in str1 than in str2; And a value less than zero indicates the opposite.
That's because strcmp return an int: negative if first is less than second, positive non-zero if second is less that first and 0 if equal.
The following piece of code behaves differently in 32-bit and 64-bit operating systems.
char *cat = "v,a";
if (strcmp(cat, ",") == 1)
...
The above condition is true in 32-bit but false in 64-bit. I wonder why this is different?
Both 32-bit and 64-bit OS are Linux (Fedora).
The strcmp() function is only defined to return a negative value if argument 1 precedes argument 2, zero if they're identical, or a positive value if argument 1 follows argument 2.
There is no guarantee of any sort that the value returned will be +1 or -1 at any time. Any equality test based on that assumption is faulty. It is conceivable that the 32-bit and 64-bit versions of strcmp() return different numbers for a given string comparison, but any test that looks for +1 from strcmp() is inherently flawed.
Your comparison code should be one of:
if (strcmp(cat, ",") > 0) // cat > ","
if (strcmp(cat, ",") == 0) // cat == ","
if (strcmp(cat, ",") >= 0) // cat >= ","
if (strcmp(cat, ",") <= 0) // cat <= ","
if (strcmp(cat, ",") < 0) // cat < ","
if (strcmp(cat, ",") != 0) // cat != ","
Note the common theme — all the tests compare with 0. You'll also see people write:
if (strcmp(cat, ",")) // != 0
if (!strcmp(cat, ",")) // == 0
Personally, I prefer the explicit comparisons with zero; I mentally translate the shorthands into the appropriate longhand (and resent being made to do so).
Note that the specification of strcmp() says:
ISO/IEC 9899:2011 §7.24.4.2 The strcmp function
¶3 The strcmp function returns an integer greater than, equal to, or less than zero,
accordingly as the string pointed to by s1 is greater than, equal to, or less than the string pointed to by s2.
It says nothing about +1 or -1; you cannot rely on the magnitude of the result, only on its signedness (or that it is zero when the strings are equal).
Standard functions doesn't exhibit different behaviour based on the "bittedness" of your OS unless you're doing something silly like, for example, not including the relevant header file. They are required to exhibit exactly the behaviour specified in the standard, unless you violate the rules. Otherwise, your compiler, while close, will not be a C compiler.
However, as per the standard, the return value from strcmp() is either zero, positive or negative, it's not guaranteed to be +/-1 when non-zero.
Your expression would be better written as:
strcmp (cat, ",") > 0
The faultiness of using strcmp (cat, ",") == 1 has nothing to do with whether your OS is 32 or 64 bits, and everything to do with the fact you've misunderstood the return value. From the ISO C11 standard (my bold):
The strcmp function returns an integer greater than, equal to, or less than zero, accordingly as the string pointed to by s1 is greater than, equal to, or less than the string pointed to by s2.
The semantics guaranteed by strcmp() are well explained above in Jonathan's answer.
Coming back to your original question i.e.
Q. Why strcmp() behaviour differs in 32-bit and 64-bit systems?
Answer: strcmp() is implemented in glibc, wherein there exist different implementations for various architectures, all highly optimised for the corresponding architecture.
strcmp() on x86
strcmp() on x86-64
As the spec simply defines that the the return value is one of 3 possibilities (-ve, 0, +ve), the various implementations are free to return any value as long as the sign indicates the result appropriately.
On certain architectures (in this case x86), it is faster to simply compare each byte without storing the result. Hence its quicker to simply return -/+1 on a mismatch.
(Note that one could use subb instead of cmpb on x86 to obtain the difference in magnitude of the non-matching bytes. But this would require 1 additional clock cycle per byte. This would mean an addition 3% increase in total time taken as each complete iteration runs in less than 30 clock cycles.)
On other architectures (in this case x86-64), the difference between the byte values of the corresponding characters is already available as a by-product of the comparision. Hence it faster to simply return it rather than test them again and return -/+1.
Both are perfectly valid output as the strcmp() function is ONLY guaranteed to return the result using the proper sign and the magnitude is architecture/implementation specific.