Divide without losing remainder - c

In C, is it possible to divide a dividend by a constant and get the result and the remainder at the same time?
I want to avoid execution of 2 division instructions, as in this example:
val=num / 10;
mod=num % 10;

I wouldn't worry about the instruction count because the x86 instruction set will provide a idivl instruction that computes the dividend and remainder in one instruction. Any decent compiler will make use of this instruction. The documenation here http://programminggroundup.blogspot.com/2007/01/appendix-b-common-x86-instructions.html describes the instruction as follows:
Performs unsigned division. Divides the contents of the double-word
contained in the combined %edx:%eax registers by the value in the
register or memory location specified. The %eax register contains the
resulting quotient, and the %edx register contains the resulting
remainder. If the quotient is too large to fit in %eax, it triggers a
type 0 interrupt.
For example, compiling this sample program:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
int x = 39;
int divisor = 1;
int div = 0;
int rem = 0;
printf("Enter the divisor: ");
scanf("%d", &divisor);
div = x/divisor;
rem = x%divisor;
printf("div = %d, rem = %d\n", div, rem);
}
With gcc -S -O2 (-S saves the tempory file created that shows the asm listing), shows that the division and mod in the following lines
div = x/divisor;
rem = x%divisor;
is effectively reduced to the following instruction:
idivl 28(%esp)
As you can see theres one instruction to perform the division and mod calculation. The idivl instruction remains even if the mod calculation in the C program is removed. After the idivl there are calls to mov:
movl $.LC2, (%esp)
movl %edx, 8(%esp)
movl %eax, 4(%esp)
call printf
These calls copy the quotient and the remainder onto the stack for the call to printf.
Update
Interestingly the function div doesn't do anything special other than wrap the / and % operators in a function call. Therefore, from a performance perspective, it will not improve the performance by replacing the lines
val=num / 10;
mod=num % 10;
with a single call to div.

There's div():
div_t result = div(num, 10);
// quotient is result.quot
// remainder is result.rem

Don't waste your time with div() Like Nemo said, the compiler will easily optimize the use of a division followed by the use of a modulus operation into one. Write code that makes optimal sense, and let the computer remove the cruft.

You could always use the div function.

Related

Why is using a third variable faster than an addition trick?

When computing fibonacci numbers, a common method is mapping the pair of numbers (a, b) to (b, a + b) multiple times. This can usually be done by defining a third variable c and doing a swap. However, I realised you could do the following, avoiding the use of a third integer variable:
b = a + b; // b2 = a1 + b1
a = b - a; // a2 = b2 - a1 = b1, Ta-da!
I expected this to be faster than using a third variable, since in my mind this new method should only have to consider two memory locations.
So I wrote the following C programs comparing the processes. These mimic the calculation of fibonacci numbers, but rest assured I am aware that they will not calculate the correct values due to size limitations.
(Note: I realise now that it was unnecessary to make n a long int, but I will keep it as it is because that is how I first compiled it)
File: PlusMinus.c
// Using the 'b=a+b;a=b-a;' method.
#include <stdio.h>
int main() {
long int n = 1000000; // Number of iterations.
long int a,b;
a = 0; b = 1;
while (n--) {
b = a + b;
a = b - a;
}
printf("%lu\n", a);
}
File: ThirdVar.c
// Using the third-variable method.
#include <stdio.h>
int main() {
long int n = 1000000; // Number of iterations.
long int a,b,c;
a = 0; b = 1;
while (n--) {
c = a;
a = b;
b = b + c;
}
printf("%lu\n", a);
}
When I run the two with GCC (no optimisations enabled) I notice a consistent difference in speed:
$ time ./PlusMinus
14197223477820724411
real 0m0.014s
user 0m0.009s
sys 0m0.002s
$ time ./ThirdVar
14197223477820724411
real 0m0.012s
user 0m0.008s
sys 0m0.002s
When I run the two with GCC with -O3, the assembly outputs are equal. (I suspect I had confirmation bias when stating that one just outperformed the other in previous edits.)
Inspecting the assembly for each, I see that PlusMinus.s actually has one less instruction than ThirdVar.s, but runs consistently slower.
Question
Why does this time difference occur? Not only at all, but also why is my addition/subtraction method slower contrary to my expectations?
Why does this time difference occur?
There is no time difference when compiled with optimizations (under recent versions of gcc and clang). For instance, gcc 8.1 for x86_64 compiles both to:
Live at Godbolt
.LC0:
.string "%lu\n"
main:
sub rsp, 8
mov eax, 1000000
mov esi, 1
mov edx, 0
jmp .L2
.L3:
mov rsi, rcx
.L2:
lea rcx, [rdx+rsi]
mov rdx, rsi
sub rax, 1
jne .L3
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
mov eax, 0
add rsp, 8
ret
Not only at all, but also why is my addition/subtraction method slower contrary to my expectations?
Adding and subtracting could be slower than just moving. However, in most architectures (e.g. a x86 CPU), it is basically the same (1 cycle plus the memory latency); so this does not explain it.
The real problem is, most likely, the dependencies between the data. See:
b = a + b;
a = b - a;
To compute the second line, you have to have finished computing the value of the first. If the compiler uses the expressions as they are (which is the case under -O0), that is what the CPU will see.
In your second example, however:
c = a;
a = b;
b = b + c;
You can compute both the new a and b at the same time, since they do not depend on each other. And, in a modern processor, those operations can actually be computed in parallel. Or, putting it another way, you are not "stopping" the processor by making it wait on a previous result. This is called Instruction-level parallelism.

multiplication instruction error in inline assembly

Consider following program:
#include <stdio.h>
int main(void) {
int foo = 10, bar = 15;
__asm__ __volatile__("add %%ebx,%%eax"
:"=a"(foo)
:"a"(foo), "b"(bar)
);
printf("foo+bar=%d\n", foo);
}
I know that add instruction is used for addition, sub instruction is used for subtraction & so on. But I didn't understand these lines:
__asm__ __volatile__("add %%ebx,%%eax"
:"=a"(foo)
:"a"(foo), "b"(bar)
);
What is the exact meaning of :"=a"(foo) :"a"(foo), "b"(bar) ); ? What it does ? And when I try to use mul instruction here I get following error for the following program:
#include <stdio.h>
int main(void) {
int foo = 10, bar = 15;
__asm__ __volatile__("mul %%ebx,%%eax"
:"=a"(foo)
:"a"(foo), "b"(bar)
);
printf("foo*bar=%d\n", foo);
}
Error: number of operands mismatch for `mul'
So, why I am getting this error ? How do I solve this error ? I've searched on google about these, but I couldn't find solution of my problem. I am using windows 10 os & processor is intel core i3.
What is the exact meaning of :"=a"(foo) :"a"(foo), "b"(bar) );
There is a detailed description of how parameters are passed to the asm instruction here. In short, this is saying that bar goes into the ebx register, foo goes into eax, and after the asm is executed, eax will contain an updated value for foo.
Error: number of operands mismatch for `mul'
Yeah, that's not the right syntax for mul. Perhaps you should spend some time with an x86 assembler reference manual (for example, this).
I'll also add that using inline asm is usually a bad idea.
Edit: I can't fit a response to your question into a comment.
I'm not quite sure where to start. These questions seem to indicate that you don't have a very good grasp of how assembler works at all. Trying to teach you asm programming in a SO answer is not really practical.
But I can point you in the right direction.
First of all, consider this bit of asm code:
movl $10, %eax
movl $15, %ebx
addl %ebx, %eax
Do you understand what that does? What will be in eax when this completes? What will be in ebx? Now, compare that with this:
int foo = 10, bar = 15;
__asm__ __volatile__("add %%ebx,%%eax"
:"=a"(foo)
:"a"(foo), "b"(bar)
);
By using the "a" constraint, you are asking gcc to move the value of foo into eax. By using the "b" constraint you are asking it to move bar into ebx. It does this, then executes the instructions for the asm (ie add). On exit from the asm, the new value for foo will be in eax. Get it?
Now, let's look at mul. According to the docs I linked you to, we know that the syntax is mul value. That seems weird, doesn't it? How can there only be one parameter to mul? What does it multiple the value with?
But if you keep reading, you see "Always multiplies EAX by a value." Ahh. So the "eax" register is always implied here. So if you were to write mul %ebx, that would really be mean mul ebx, eax, but since it always has to be eax, there's no real point it writing it out.
However, it's a little more complicated than that. ebx can hold a 32bit value number. Since we are using ints (instead of unsigned ints), that means that ebx could have a number as big as 2,147,483,647. But wait, what happens if you multiply 2,147,483,647 * 10? Well, since 2,147,483,647 is already as big a number as you can store in a register, the result is much too big to fit into eax. So the multiplication (always) uses 2 registers to output the result from mul. This is what that link meant when it referred "stores the result in EDX:EAX."
So, you could write your multiplication like this:
int foo = 10, bar = 15;
int upper;
__asm__ ("mul %%ebx"
:"=a"(foo), "=d"(upper)
:"a"(foo), "b"(bar)
:"cc"
);
As before, this puts bar in ebx and foo in eax, then executes the multiplication instruction.
And after the asm is done, eax will contain the lower part of the result and edx will contain the upper. If foo * bar < 2,147,483,647, then foo will contain the result you need and upper will be zero. Otherwise, things get more complicated.
But that's as far as I'm willing to go. Other than that, take an asm class. Read a book.
PS You might also look at this answer and the 3 comments that follow that show why even your "add" example is "wrong."
PPS If this answer has resolved your question, don't forget to click the check mark next to it so I get my karma points.

divide and store quotient and reminder in different arrays

The standard div() function returns a div_t struct as parameter, for example:
/* div example */
#include <stdio.h> /* printf */
#include <stdlib.h> /* div, div_t */
int main ()
{
div_t divresult;
divresult = div (38,5);
printf ("38 div 5 => %d, remainder %d.\n", divresult.quot, divresult.rem);
return 0;
}
My case is a bit different; I have this
#define NUM_ELTS 21433
int main ()
{
unsigned int quotients[NUM_ELTS];
unsigned int remainders[NUM_ELTS];
int i;
for(i=0;i<NUM_ELTS;i++) {
divide_single_instruction(&quotient[i],&reminder[i]);
}
}
I know that the assembly language for division does everything in single instruction, so I need to do the same here to save on cpu cycles, which is bassicaly move the quotient from EAX and reminder from EDX into a memory locations where my arrays are stored. How can this be done without including the asm {} or SSE intrinsics in my C code ? It has to be portable.
Since you're writing to the arrays in-place (replacing numerator and denominator with quotient and remainder) you should store the results to temporary variables before writing to the arrays.
void foo (unsigned *num, unsigned *den, int n) {
int i;
for(i=0;i<n;i++) {
unsigned q = num[i]/den[i], r = num[i]%den[i];
num[i] = q, den[i] = r;
}
}
produces this main loop assembly
.L5:
movl (%rdi,%rcx,4), %eax
xorl %edx, %edx
divl (%rsi,%rcx,4)
movl %eax, (%rdi,%rcx,4)
movl %edx, (%rsi,%rcx,4)
addq $1, %rcx
cmpl %ecx, %r8d
jg .L5
There are some more complicated cases where it helps to save the quotient and remainder when they are first used. For example in testing for primes by trial division you often see a loop like this
for (p = 3; p <= n/p; p += 2)
if (!(n % p)) return 0;
It turns out that GCC does not use the remainder from the first division and therefore it does the division instruction twice which is unnecessary. To fix this you can save the remainder when the first division is done like this:
for (p = 3, q=n/p, r=n%p; p <= q; p += 2, q = n/p, r=n%p)
if (!r) return 0;
This speeds up the result by a factor of two.
So in general GCC does a good job particularly if you save the quotient and remainder when they are first calculated.
The general rule here is to trust your compiler to do something fast. You can always disassemble the code and check that the compiler is doing something sane. It's important to realise that a good compiler knows a lot about the machine, often more than you or me.
Also let's assume you have a good reason for needing to "count cycles".
For your example code I agree that the x86 "idiv" instruction is the obvious choice. Let's see what my compiler (MS visual C 2013) will do if I just write out the most naive code I can
struct divresult {
int quot;
int rem;
};
struct divresult divrem(int num, int den)
{
return (struct divresult) { num / den, num % den };
}
int main()
{
struct divresult res = divrem(5, 2);
printf("%d, %d", res.quot, res.rem);
}
And the compiler gives us:
struct divresult res = divrem(5, 2);
printf("%d, %d", res.quot, res.rem);
01121000 push 1
01121002 push 2
01121004 push 1123018h
01121009 call dword ptr ds:[1122090h] ;;; this is printf()
Wow, I was outsmarted by the compiler. Visual C knows how division works so it just precalculated the result and inserted constants. It didn't even bother to include my function in the final code. We have to read in the integers from console to force it to actually do the calculation:
int main()
{
int num, den;
scanf("%d, %d", &num, &den);
struct divresult res = divrem(num, den);
printf("%d, %d", res.quot, res.rem);
}
Now we get:
struct divresult res = divrem(num, den);
01071023 mov eax,dword ptr [num]
01071026 cdq
01071027 idiv eax,dword ptr [den]
printf("%d, %d", res.quot, res.rem);
0107102A push edx
0107102B push eax
0107102C push 1073020h
01071031 call dword ptr ds:[1072090h] ;;; printf()
So you see, the compiler (or this compiler at least) already does what you want, or something even more clever.
From this we learn to trust the compiler and only second-guess it when we know it isn't doing a good enough job already.

Bithacks: Determine whether value is less, greater, or equal to some value

An algorithm I am working on must frequently check whether some arbitrary integer value 'x' is less-than, greater-than, or equal to another arbitrary integer value 'y'. The language I am implementing it in is C.
A naive way of doing it would be to use if-then-else branching to check this, but that would not work optimally because the processor's branch predictor would mess up. I am trying to implement this comparison only using arithmetic / logical evaluations as well as bitwise operations but, honestly, my brain is stuck right now.
I will call the function f(x, y). The function will return 1, if x < y; 2, if x == y; or 3, if x > y.
One of my ideas I have had was to evaluate:
x = 3 * (x > y)
which will return 3 when x > y, and 0 otherwise. There could be an operation, which returns either 1 or 2, if x == 0 using some bitwise operators and either condition x == y or x < y, but I have not found any such combinations of operations to achieve what I need.
Finally, I am looking for any function f(x, y) which will give me my results with the least number of operations possible, be it with or without bithacks; it just needs to be fast. So if you have any other ideas I may not have considered, pointing me to another solution is also greatly appreciated.
The following expression will do what you want.
1 + (x >= y) + (x > y)
On x86-64 this compiles to a fairly-efficient code using SETcc instead of branches:
compare(int, int):
xorl %edx, %edx
cmpl %esi, %edi
setg %al
setge %dl
movzbl %al, %eax
leal 1(%rdx,%rax), %eax
ret
On ARM:
compare(int, int):
cmp r0, r1
ite lt
movlt r0, #1
movge r0, #2
it gt
addgt r0, r0, #1
bx lr
Simply subtract the 2 variables x and y.
You'll get:
if x<y result is res<0
if x>y result is res>0
if x==y result is res==0.
Implement it in macro
#define Chk(x, y) ((x)-(y))
Another advantage is that you can simply use the ! operator to check for equality or disequality:
if (!Chk(x, y))
{
// x == y
}
else
{
// x != y
}
P.S. this is the same result that comes from many standard functions as strcmp().
P.P.S. Please consider that processors machine instruction cmp, at least for all CPU types I know, executes a subtraction between the two operands and set the flags to reflect the result. Even the just comparing two values in C produce code that have a cmp instruction and some branch like jz, jl, etc.
Just storing the difference of the values, a single value, permit you to keep an information, even for later evaluation, holding all elements you may need.
One option is:
int f(int x,int y)
{
return ((x-y)>>31)-((y-x)>>31) + 2;
}
int main(int argc, char *argv[])
{
int x,y;
for(x=-3;x<=3;x++)
for(y=-3;y<=3;y++)
printf("x=%d y=%d f(x,y)=%d\n",x,y,f(x,y));
return 0;
}
This relies on the int type being a 32bit quantity.
You may also want to look into SIMD instructions (e.g. SSE on x86 or Neon on Arm) as these may help you accelerate your code.

Profile C Execution

So, just for fun, and out of curiosity, I wanted to see what executes faster when doing an even-odd check, modulus or bitwise comparisons.
So, I whipped up the following, but I'm not sure that it's behaving correctly, as the difference is so small. I read somewhere online that bitwise should be an order of magnitude faster than modulus checking.
Is it possible that it's getting optimized away? I've just started tinkering with assembly, otherwise I'd attempt to dissect the executable a bit.
EDIT 3: Here is a working test, thanks in a large way to #phonetagger:
#include <stdio.h>
#include <time.h>
#include <stdint.h>
// to reset the global
static const int SEED = 0x2A;
// 5B iterations, each
static const int64_t LOOPS = 5000000000;
int64_t globalVar;
// gotta call something
int64_t doSomething( int64_t input )
{
return 1 + input;
}
int main(int argc, char *argv[])
{
globalVar = SEED;
// mod
clock_t startMod = clock();
for( int64_t i=0; i<LOOPS; ++i )
{
if( ( i % globalVar ) == 0 )
{
globalVar = doSomething(globalVar);
}
}
clock_t endMod = clock();
double modTime = (double)(endMod - startMod) / CLOCKS_PER_SEC;
globalVar = SEED;
// bit
clock_t startBit = clock();
for( int64_t j=0; j<LOOPS; ++j )
{
if( ( j & globalVar ) == 0 )
{
globalVar = doSomething(globalVar);
}
}
clock_t endBit = clock();
double bitTime = (double)(endBit - startBit) / CLOCKS_PER_SEC;
printf("Mod: %lf\n", modTime);
printf("Bit: %lf\n", bitTime);
printf("Dif: %lf\n", ( modTime > bitTime ? modTime-bitTime : bitTime-modTime ));
}
5 billion iterations of each loop, with a global removing compiler optimization yields the following:
Mod: 93.099101
Bit: 16.701401
Dif: 76.397700
gcc foo.c -std=c99 -S -O0 (note, I specifically did -O0) for x86 gave me the same assembly for both loops. Operator strength reduction meant that both ifs used an andl to get the job done (which is faster than a modulo on Intel machines):
First Loop:
.L6:
movl 72(%esp), %eax
andl $1, %eax
testl %eax, %eax
jne .L5
call doNothing
.L5:
addl $1, 72(%esp)
.L4:
movl LOOPS, %eax
cmpl %eax, 72(%esp)
jl .L6
Second Loop:
.L9:
movl 76(%esp), %eax
andl $1, %eax
testl %eax, %eax
jne .L8
call doNothing
.L8:
addl $1, 76(%esp)
.L7:
movl LOOPS, %eax
cmpl %eax, 76(%esp)
jl .L9
The miniscule difference you see is probably because of the resolution/inaccuracy of clock.
Most compilers will compile both of the following to EXACTLY the same machine instruction(s):
if( ( i % 2 ) == 0 )
if( ( i & 1 ) == 0 )
...even without ANY "optimization" turned on. The reason is that you are MOD-ing and AND-ing with constant values, and a %2 operation is, as any compiler writer should know, functionally equivalent to an &1 operation. In fact, MOD by any power-of-2 has an equivalent AND operation. If you really want to test the difference, you'll need to make the right-hand-side of both operations be variable, and to be absolutely sure the compiler's cleverness isn't thwarting your efforts, you'll need to bury the variables' initializations somewhere that the compiler can't tell at that point what its runtime value will be; i.e. you'll need to pass the values into a GLOBALLY-DECLARED (i.e. not 'static') test function as parameters, in which case the compiler can't trace back to their definition & substitute the variables with constants, because theoretically any external caller could pass any values in for those parameters. Alternatively, you could leave the code in main() and define the variables globally, in which case the compiler can't substitute them with constants because it can't know for sure that another function may have altered the value of the global variables.
Incidentally, this same issue exists for divide operations.... Divisions by constant powers-of-two can be substituted with an equivalent right-shift (>>) operation. The same trick works for multiplication (<<), but the benefits are less (or nonexistant) for multiplications. True division operations just take a long time in hardware, though significant improvements have been made in most modern processors vs. even 15 years ago, division operations still take maybe 80 clock cycles, while a >> operation takes only a single cycle. You're not going to see an "order of magnitude" improvement using bitwise tricks on modern processors, but most compilers will still use those tricks because there is still some noticeable improvement.
EDIT: On some embedded processors (and, unbelievable though it was, the original Sparc desktop/workstation processor versions before v8), there isn't even a divide instruction at all. All true divide & mod operations on such processors must be performed entirely in software, which can be a monstrously expensive operation. In that sort of environment, you surely would see an order of magnitude difference.
Bitwise checking takes only a single machine instruction ("and ...,0x01"); that's pretty hard to beat.
Modulo check will absolutely be slower if you have a dumb compiler that actually computes modulo by taking remainders (sometimes including a subroutine call to modulo routine!). Smart compilers know about the modulo function and generate code for it directly; if they have any decent optimization they know that "modulo(x,2)" can be implemented with the same AND trick above.
Our PARLANSE compiler does this as a matter of course. I'd be surprised if widely available C and C++ compilers don't do this too.
With such "good" compilers, it won't matter which way you write odd/even (or even "is power of two") checks; it will be pretty damn fast.

Resources