qsort function compare confused me - c

I see lots of people use subtraction in a qsort comparator function. I think it is wrong because when dealing with these numbers: int nums[]={-2147483648,1,2,3}; INT_MIN = -2147483648;
int compare (const void * a, const void * b)
{
return ( *(int*)a - *(int*)b );
}
I wrote this function to test:
#include <stdio.h>
#include <limits.h>
int compare (const void * a, const void * b)
{
return ( *(int*)a - *(int*)b );
}
int main(void)
{
int a = 1;
int b = INT_MIN;
printf("%d %d\n", a,b);
printf("%d\n",compare((void *)&a,(void *)&b));
return 0;
}
The output is:
1 -2147483648
-2147483647
but a > b so the output should be positive。
I have seen many books write like this. I think it is wrong; it should be written like this when dealing with int types:
int compare (const void * a, const void * b)
{
if(*(int *)a < *(int *)b)
return -1;
else if(*(int *)a > *(int *)b)
return 1;
else
return 0;
}
I just cannot figure out why many books and web sites write in such a misleading way.
If you have any different view, please let me know.

I think it is wrong
Yes, a simple subtraction can lead to int overflow which is undefined behavior and should be avoided.
return *(int*)a - *(int*)b; // Potential undefined behavior.
A common idiom is to subtract two integer compares. Various compilers recognize this and create efficient well behaved code. Preserving const-ness also is good form.
const int *ca = a;
const int *cb = b;
return (*ca > *cb) - (*ca < *cb);
why many books and web sites write in such a misleading way.
return *a - *b; is conceptually easy to digest - even if it provides the wrong answer with extreme values - often learner code omits edge conditions to get the idea across - "knowing" that values will never be large.
Or consider the complexities of comparing long doubles with regard to NaN.

Your understanding is absolutely correct. This common idiom cannot be used for int values.
Your proposed solution works correctly, although it would be more readable with local variables to avoid so many casts:
int compare(const void *a, const void *b) {
const int *aa = a;
const int *bb = b;
if (*aa < *bb)
return -1;
else if (*aa > *bb)
return 1;
else
return 0;
}
Note that modern compilers will generate the same code with or without these local variables: always prefer the more readable form.
A more compact solution with the same exact result is commonly used although a bit more difficult to understand:
int compare(const void *a, const void *b) {
const int *aa = a;
const int *bb = b;
return (*aa > *bb) - (*aa < *bb);
}
Note that this approach works for all numeric types, but will return 0 for NaN floating point values.
As for your remark: I just cannot figure out why many books and web sites write in such a misleading way:
Many books and websites contain mistakes, and so do most programs. Many programming bugs get caught and squashed before they reach production if the program is tested wisely. Code fragments in books are not tested, and although they never reach production, the bugs they contain do propagate virally via unsuspecting readers who learn bogus methods and idioms. A very bad and lasting side effect.
Kudos to you for catching this! You have a rare skill among programmers: you are a good reader. There are far more programmers who write code than programmers who can read code correctly and see mistakes. Hone this skill by reading other people's code, on stack overflow or from open source projects... And do report the bugs.
The subtraction method is in common use, I have seen it in many places like you and it does work for most value pairs. This bug may go unnoticed for eons. A similar problem was latent in the zlib for decades: int m = (a + b) / 2; causes a fateful integer overflow for large int values of a and b.
The author probably saw it used and thought the subtraction was cool and fast, worth showing in print.
Note however that the erroneous function does work correctly for types smaller than int: signed or unsigned char and short, if these types are indeed smaller than int on the target platform, which the C Standard does not mandate.
Indeed similar code can be found in The C Programming Language by Brian Kernighan and Dennis Ritchie, the famous K&R C bible by its inventors. They use this approach in a simplistic implementation of strcmp() in chapter 5. The code in the book is dated, going all the way back to the late seventies. Although it has implementation defined behavior, it does not invoke undefined behavior in any but the rarest architectures among which the infamous DeathStation-9000, yet it should not be used to compare int values.

You are correct, *(int*)a - *(int*)b poses a risk of integer overflow and ought to be avoided as a method of comparing two int values.
It is possible it could be valid code in a controlled situation where one knows the values are such that the subtraction will not overflow. In general, though, it should be avoided.

The reason why so many books are wrong is likely the root of all evil: the K&R book. In chapter 5.5 they try to teach how to implement strcmp:
int strcmp(char *s, char *t)
{
int i;
for (i = 0; s[i] == t[i]; i++)
if (s[i] == '\0')
return 0;
return s[i] - t[i];
}
This code is questionable since char has implementation-defined signedness. Ignoring that, and ignoring that they fail to use const correctness as in the standard C version, the code otherwise works, partially because it relies on implicit type promotion to int (which is ugly), partially since they assume 7 bit ASCII, and the worst case 0 - 127 cannot underflow.
Further down in the book, 5.11, they try to teach how to use qsort:
qsort((void**) lineptr, 0, nlines-1,
(int (*)(void*,void*))(numeric ? numcmp : strcmp));
Ignoring the fact that this code invokes undefined behavior, since strcmp is not compatible with the function pointer int (*)(void*, void*), they teach to use the above method from strcmp.
However, looking at their numcmp function, it looks like this:
/* numcmp: compare s1 and s2 numerically */
int numcmp(char *s1, char *s2)
{
double v1, v2;
v1 = atof(s1);
v2 = atof(s2);
if (v1 < v2)
return -1;
else if (v1 > v2)
return 1;
else
return 0;
}
Ignoring the fact that this code will crash and burn if an invalid character is found by atof (such as the very likely locale issue with . versus ,), they actually manage to teach the correct method of writing such a comparison function. Since this function uses floating point, there's really no other way to write it.
Now someone might want to come up with an int version of this. If they do it based on the strcmp implementation rather than the floating point implementation, they'll get bugs.
Overall, just by flipping a few pages in this once canonical book, we already found some 3-4 cases of reliance on undefined behavior and 1 case of reliance on implementation-defined behavior. So it is really no wonder if people who learn C from this book writes code full of undefined behavior.

First, it's of course correct that an integer during the comparison could create serious problems for you.
On the other hand, doing a single subtraction is cheaper than going through an if/then/else, and the comparison gets performed O(n^2) times in a quicksort, so if this sort is performance-critical and we can get away with it we may want to use the difference.
It will work fine so long as all the values are in some range of size less than 2^31, because then their differences have to be smaller. So if whatever is generating the list you want to sort is going to keep values between a billion and minus one billion then you're fine using subtraction.
Note that checking that the values are in such a range prior to the sort is an O(n) operation.
On the other hand if there's a chance that the overflow could happen, you'd want to use something like the code you wrote in your question
Note that lots of stuff you see doesn't explicitly take overflow into account; it's just that maybe that's more expected in something that's more obviously an "arithmetic" context.

Related

Is there a case where bitwise swapping won't work?

At school someday several years ago I had to do a swap function that swaps two integers, I wanted to do this using bitwise operations without using a third variable, so I came up with this:
void swap( int * a, int * b ) {
*a = *a ^ *b;
*b = *a ^ *b;
*a = *a ^ *b;
}
I thought it was good but when my function was tested by the school's correction program it found an error (of course when I asked they didn't want to tell me), and still today I don't know what didn't work, so I wonder in which case this method wouldn't work.
I wanted to do this using bitwise operations without using a third variable
Do you mind if I ask why? Was there a practical reason for this limitation, or was it just an intellectual puzzle?
when my function was tested by the school's correction program it found an error
I can't be sure what the correction program was complaining about, but one class of inputs this sort of solution is known to fail on is exemplified by
int x = 5;
swap(&x, &x);
printf("%d\n", x);
This prints 0, not 5.
You might say, "Why would anyone swap something with itself?"
They probably wouldn't, as I've shown it, but perhaps you can imagine that, in a mediocrely-written sort algorithm, it might end up doing the equivalent of
if(a[i] < a[j]) {
/* they are in order */
} else {
swap(&a[i], &a[j]);
}
Now, if it ever happens that i and j are the same, the swap function will wrongly zero out a[i].
See also What is the difference between two different swapping function?

Feel kind of confused by the book "Programming in C" (Stephen Kochan)

I've been teaching myself in C programming with the book recommended by a friend who is great in C. The book title is "Programming in C" by Stephen Kochan.
I have a background in Java, and I feel a little bit crazy with the way the codes were written in Stephen's book. For example, the following code, in which I commented my confusion. Maybe I'm missing something important here, so I'm looking to hear some inputs about the correct way of coding in C.
#include <stdio.h>
void test(int *int_pointer)
{
*int_pointer = 100;
}
int main(void)
{
void test(int *int_pointer); // why call the test() function here without any real argument? what's the point?
int i = 50, *p = &i;
printf("Before the call to test i = %i\n", i);
test(p);
printf("After the call to test i = %i\n", i);
int t;
for (t = 0; t < 5; ++t) // I'm more used to "t++" in a loop like this. As I know ++t is different than t++ in some cases. Writting ++t in a loop just drives me crazy
{
if (4 == t) // isn't it normal to write "t == 4" ?? this is driving me crazy again!
printf("skip the number %i\n", t);
else
printf("the value of t is now %i\n", t);
}
return 0;
}
// why call the test() function here without any real argument? what's the point?
It is not a call, it is function declaration. Completely unnecessary at this location, since the function is defined few lines before. In real world such declarations are not used often.
// I'm more used to "t++" in a loop like this. As I know ++t is different than t++ in some cases. Writting ++t in a loop just drives me crazy
In this case they are equivalent, but if you think of going to C++ it is better to switch completely to ++t form, since there in some cases (e.g. with iterators) it makes difference.
// isn't it normal to write "t == 4" ?? this is driving me crazy again!
Some people tend to use 4 == t to avoid a problem when t = 4 is used instead of t == 4 (both are valid in C as if condition). Since all normal compilers signal a warning for t = 4 anyway, 4 == t is rather unnecessary.
Please read about pointers then you will understand that a pointer to an int has been passed as an argument here...
void test(int *int_pointer);
You can see the difference between ++t and t++ nicely explained in this link . It doesn't make a difference in this code. Result will be the same.
if(4 == t) is same as if(t == 4) . Just different styles in writing. 4 == t is mostly used to avoid typing = instead of ==. Compiler will complain if you write 4 = t but wont complain if you write t = 4
why call the test() function here without any real argument? what's the point?
Here test is declared as function (with void return type) which expects an argument of the type a pointer to int.
I'm more used to "t++" in a loop like this. As I know ++t is different than t++ in some cases. Writting ++t in a loop just drives me crazy
Note that, when incrementing or decrementing a variable in a statement by itself (t++; or ++t), the pre-increment and post-increment have same effect.
The difference can be seen when these expression appears in a large or complex expressions ( int x = t++ and int x = ++t have different results for the same value of t).
isn't it normal to write "t == 4" ?? this is driving me crazy again!
4 == t is much safer than t == 4, although both have same meaning. In case of t == 4, if user type accidentally t = 4 then compiler would not going to throw any error and you may get erroneous result. While in case of 4 == t, if user accidentally type 4 = t then compiler would through you a warning like:
lvalue is required as left operand of assignment operator.
void test(int *int_pointer); is a function prototype. It's not required in this particular instance since the function is defined above main() but you would need it (though not necessarily in the function body) if test was defined later in the file. (Some folk rely on implicit declaration but let's not get into that here.)
++t will never be slower than t++ since, conceptually, the latter has to store and return the previous value. (Most compilers will optimise the copy out, although I prefer not to rely on that: I always use ++t but plenty of experienced programmers don't.)
4 == t is often used in place of t == 4 in case you accidentally omit one of the =. It's easily done but once you've spent a day or two hunting down a bug caused by a single = in place of == you won't ever do it again! 4 = t will generate a compile error but t = 4 is actually an expression of value 4 which will compare true and assigns the value of 4 to t: a particularly dangerous side-effect. Personally though I find 4 == t obfuscating.

if statement in C & assignment without a double call

Would it be possible to implement an if that checks for -1 and if not negative -1 than assign the value. But without having to call the function twice? or saving the return value to a local variable. I know this is possible in assembly, but is there a c implementation?
int i, x = -10;
if( func1(x) != -1) i = func1(x);
saving the return value to a local variable
In my experience, avoiding local variables is rarely worth the clarity forfeited. Most compilers (most of the time) can often avoid the corresponding load/stores and just use registers for those locals. So don't avoid it, embrace it! The maintainer's sanity that gets preserved just might be your own.
I know this is possible in assembly, but is there a c implementation?
If it turns out your case is one where assembly is actually appropriate, make a declaration in a header file and link against the assembly routine.
Suggestion:
const int x = -10;
const int y = func1(x);
const int i = y != -1
? y
: 0 /* You didn't really want an uninitialized value here, right? */ ;
It depends whether or not func1 generates any side-effects. Consider rand(), or getchar() as examples. Calling these functions twice in a row might result in different return values, because they generate side effects; rand() changes the seed, and getchar() consumes a character from stdin. That is, rand() == rand() will usually1 evaluate to false, and getchar() == getchar() can't be predicted reliably. Supposing func1 were to generate a side-effect, the return value might differ for consecutive calls with the same input, and hence func1(x) == func1(x) might evaluate to false.
If func1 doesn't generate any side-effect, and the output is consistent based solely on the input, then I fail to see why you wouldn't settle with int i = func1(x);, and base logic on whether or not i == -1. Writing the least repetitive code results in greater legibility and maintainability. If you're concerned about the efficiency of this, don't be. Your compiler is most likely smart enough to eliminate dead code, so it'll do a good job at transforming this into something fairly efficient.
1. ... at least in any sane standard library implementation.
int c;
if((c = func1(x)) != -1) i = c;
The best implementation I could think of would be:
int i = 0; // initialize to something
const int x = -10;
const int y = func1(x);
if (y != -1)
{
i = y;
}
The const would let the compiler to any optimizations that it thinks is best (perhaps inline func1). Notice that func is only called once, which is probably best. The const y would also allow y to be kept in a register (which it would need to be anyway in order to perform the if). If you wanted to give more of a suggestion, you could do:
register const int y = func1(x);
However, the compiler is not required to honor your register keyword suggestion, so its probably best to leave it out.
EDIT BASED ON INSPIRATION FROM BRIAN'S ANSWER:
int i = ((func1(x) + 1) ?:0) - 1;
BTW, I probably wouldn't suggest using this, but it does answer the question. This is based on the SO question here. To me, I'm still confused as to the why for the question, it seems like more of a puzzle or job interview question than something that would be encountered in a "real" program? I'd certainly like to hear why this would be needed.

Can I please get some feedback for this strcmp() function I implemented in C?

I'm learning C.
I find I learn programming well when I try things and received feedback from established programmers in the language.
I decided to write my own strcmp() function, just because I thought I could :)
int strcompare(char *a, char *b) {
while (*a == *b && *a != '\0') {
a++;
b++;
}
return *a - *b;
}
I was trying to get it to work by incrementing the pointer in the condition of the while but couldn't figure out how to do the return. I was going for the C style code, of doing as much as possible on one line :)
Can I please get some feedback from established C programmers? Can this code be improved? Do I have any bad habits?
Thanks.
If you want to do everything in the while statement, you could write
while (*a != '\0' && *a++ == *b++) {}
I'm not personally a huge fan of this style of programming - readers need to mentally "unpack" the order of operations anyway, when trying to understand it (and work out if the code is buggy or not). Memory bugs are particularly insidious in C, where overwriting memory one byte beyond or before where you should can cause all sorts of inexplicable crashes or bugs much later on, away from the original cause.
Modern styles of C programming emphasize correctness, consistency, and discipline more than terseness. The terse expression features, like pre- and post-increment operations, were originally a way of getting the compiler to generate better machine code, but optimizers can easily do that themselves these days.
As #sbi writes, I'd prefer const char * arguments instead of plain char * arguments.
The function doesn't change the content of a and b. it should probably announce that by taking pointers to const strings.
Most C styles are much terser than many other languages' styles, but don't try to be too clever. (In your code, with several conditions ANDed in the loop conditions, I don't think there's way to put incrementing in there, so this isn't even a question of style, but of correctness.)
I don't know since when putting as much as possible is considered as C-style... I rather associate (obfuscated) Perl with that..
Please DO NOT do this. The best thing to do is one command per line. You will understand why when you try to debug your code :)
To your implementation: Seems quite fine to me, but I would put in the condition that *b is not '\0' either, because you can't know that a is always bigger than b... Otherwise you risk reading in unallocated memory...
You may find this interesting, from eglibc-2.11.1. It's not far different to your own implementation.
/* Compare S1 and S2, returning less than, equal to or
greater than zero if S1 is lexicographically less than,
equal to or greater than S2. */
int
strcmp (p1, p2)
const char *p1;
const char *p2;
{
register const unsigned char *s1 = (const unsigned char *) p1;
register const unsigned char *s2 = (const unsigned char *) p2;
unsigned reg_char c1, c2;
do
{
c1 = (unsigned char) *s1++;
c2 = (unsigned char) *s2++;
if (c1 == '\0')
return c1 - c2;
}
while (c1 == c2);
return c1 - c2;
}
A very subtle bug: strcmp compares bytes interpreted as unsigned char, but your function is interpreting them as char (which is signed on most implementations). This will cause non-ascii characters to sort before ascii instead of after.
This function will fail if limits of (insigned) char is equal to or greater than limits of int because of integer overflow.
For example if you compile it on DSP which have 16 bit char with limits 0...65536 and 16 bit int with limits of -32768...32767, then if you try to compare strings like
"/uA640" and "A" the result will be negative, which is not true.
This is exotic and weird problem but it appear when you write universal implementation.

Behaviour of printf when printing a %d without supplying variable name

I've just encountered a weird problem, I'm trying to printf an integer variable, but I forgot to specify the variable name, i.e.
printf("%d");
instead of
printf("%d", integerName);
Surprisingly the program compiles, there is output and it is not random. In fact, it happens to be the very integer I wanted to print in the first place, which happens to be m-1.
The errorneous printf statement will consistently output m-1 for as long as the program keeps running... In other words, it's behaving exactly as if the statement reads
printf("%d", m-1);
Anybody knows the reason behind this behaviour? I'm using g++ without any command line options.
#include <iostream>
#define maxN 100
#define ON 1
#define OFF 0
using namespace std;
void clearArray(int* array, int n);
int fillArray(int* array, int m, int n);
int main()
{
int n = -1, i, m;
int array[maxN];
int found;
scanf("%d", &n);
while(n!=0)
{
found=0;
m = 1;
while(found!=1)
{
if(m != 2 && m != 3 && m != 4 && m != 6 && m != 12)
{
clearArray(array, n);
if(fillArray(array, m, n) == 0)
{
found = 1;
}
}
m++;
}
printf("%d\n");
scanf("%d", &n);
}
return 0;
}
void clearArray(int* array, int n)
{
for(int i = 1; i <= n; i++)
array[i] = ON;
}
int fillArray(int* array, int m, int n)
{
int i = 1, j, offCounter = 0, incrementCounter;
while(offCounter != n)
{
if(*(array+i)==ON)
{
*(array+i) = OFF;
offCounter++;
}
else
{
j = 0;
while((*array+i+j)==OFF)
{
j++;
}
*(array+i+j) = OFF;
offCounter++;
}
if(*(array+13) == OFF && offCounter != n) return 1;
if(offCounter ==n) break;
incrementCounter = 0;
while(incrementCounter != m)
{
i++;
if(i > n) i = 1;
if(*(array+i) == ON) incrementCounter++;
}
}
return 0;
}
You say that "surprisingly the program compiles". Actually, it is not surprising at all. C & C++ allow for functions to have variable argument lists. The definition for printf is something like this:
int printf(char*, ...);
The "..." signifies that there are zero or more optional arguments to the function. In fact, one of the main reasons C has optional arguments is to support the printf & scanf family of functions.
C has no special knowledge of the printf function. In your example:
printf("%d");
The compiler doesn't analyse the format string and determine that an integer argument is missing. This is perfectly legal C code. The fact that you are missing an argument is a semantic issue that only appears at runtime. The printf function will assume that you have supplied the argument and go looking for it on the stack. It will pick up whatever happens to be on there. It just happens that in your special case it is printing the right thing, but this is an exception. In general you will get garbage data. This behaviour will vary from compiler to compiler and will also change depending on what compile options you use; if you switch on compiler optimisation you will likely get different results.
As pointed out in one of the comments to my answer, some compilers have "lint" like capabilities that can actually detect erroneous printf/scanf calls. This involves the compiler parsing the format string and determining the number of extra arguments expected. This is very special compiler behaviour and will not detect errors in the general case. i.e. if you write your own "printf_better" function which has the same signature as printf, the compiler will not detect if any arguments are missing.
What happens looks like this.
printf("%d", m);
On most systems the address of the string will get pushed on the stack, and then 'm' as an integer (assuming it's an int/short/char). There is no warning because printf is basically declared as 'int printf(const char *, ...);' - the ... meaning 'anything goes'.
So since 'anything goes' some odd things happen when you put variables there. Any integral type smaller than an int goes as an int - things like that. Sending nothing at all is ok as well.
In the printf implementation (or at least a 'simple' implementation) you will find usage of va_list and va_arg (names sometime differ slightly based on conformance). These are what an implementation uses to walk around the '...' part of the argument list. Problem here is that there is NO type checking. Since there is no type checking, printf will pull random data off the execution stack when it looks at the format string ("%d") and thinks there is supposed to be an 'int' next.
Random shot in the dark would say that the function call you made just before printf possibly passed 'm-1' as it's second parm? That's one of many possibilities - but it would be interesting if this happened to be the case. :)
Good luck.
By the way - most modern compilers (GCC I believe?) have warnings that can be enabled to detect this problem. Lint does as well I believe. Unfortunately I think with VC you need to use the /analyze flag instead of getting for free.
It got an int off the stack.
http://en.wikipedia.org/wiki/X86_calling_conventions
You're peering into the stack. Change the optimizer values, and this may change. Change the order of the declarations your variables (particularly) m. Make m a register variable. Make m a global variable.
You'll see some variations in what happens.
This is similar to the famous buffer overrun hacks that you get when you do simplistic I/O.
While I would highly doubt this would result in a memory violation, the integer you get is undefined garbage.
You found one behavior. It could have been any other behavior, including an invalid memory access.

Resources