Assembly Modulo and Divide in IDA [duplicate]

Assembly Modulo and Divide in IDA [duplicate] - c

I've been reading about div and mul assembly operations, and I decided to see them in action by writing a simple program in C:
File division.c
#include <stdlib.h>
#include <stdio.h>
int main()
{
size_t i = 9;
size_t j = i / 5;
printf("%zu\n",j);
return 0;
}
And then generating assembly language code with:
gcc -S division.c -O0 -masm=intel
But looking at generated division.s file, it doesn't contain any div operations! Instead, it does some kind of black magic with bit shifting and magic numbers. Here's a code snippet that computes i/5:
mov rax, QWORD PTR [rbp-16] ; Move i (=9) to RAX
movabs rdx, -3689348814741910323 ; Move some magic number to RDX (?)
mul rdx ; Multiply 9 by magic number
mov rax, rdx ; Take only the upper 64 bits of the result
shr rax, 2 ; Shift these bits 2 places to the right (?)
mov QWORD PTR [rbp-8], rax ; Magically, RAX contains 9/5=1 now,
; so we can assign it to j
What's going on here? Why doesn't GCC use div at all? How does it generate this magic number and why does everything work?

Integer division is one of the slowest arithmetic operations you can perform on a modern processor, with latency up to the dozens of cycles and bad throughput. (For x86, see Agner Fog's instruction tables and microarch guide).
If you know the divisor ahead of time, you can avoid the division by replacing it with a set of other operations (multiplications, additions, and shifts) which have the equivalent effect. Even if several operations are needed, it's often still a heck of a lot faster than the integer division itself.
Implementing the C / operator this way instead of with a multi-instruction sequence involving div is just GCC's default way of doing division by constants. It doesn't require optimizing across operations and doesn't change anything even for debugging. (Using -Os for small code size does get GCC to use div, though.) Using a multiplicative inverse instead of division is like using lea instead of mul and add
As a result, you only tend to see div or idiv in the output if the divisor isn't known at compile-time.
For information on how the compiler generates these sequences, as well as code to let you generate them for yourself (almost certainly unnecessary unless you're working with a braindead compiler), see libdivide.

Dividing by 5 is the same as multiplying 1/5, which is again the same as multiplying by 4/5 and shifting right 2 bits. The value concerned is CCCCCCCCCCCCCCCD in hex, which is the binary representation of 4/5 if put after a hexadecimal point (i.e. the binary for four fifths is 0.110011001100 recurring - see below for why). I think you can take it from here! You might want to check out fixed point arithmetic (though note it's rounded to an integer at the end).
As to why, multiplication is faster than division, and when the divisor is fixed, this is a faster route.
See Reciprocal Multiplication, a tutorial for a detailed writeup about how it works, explaining in terms of fixed-point. It shows how the algorithm for finding the reciprocal works, and how to handle signed division and modulo.
Let's consider for a minute why 0.CCCCCCCC... (hex) or 0.110011001100... binary is 4/5. Divide the binary representation by 4 (shift right 2 places), and we'll get 0.001100110011... which by trivial inspection can be added the original to get 0.111111111111..., which is obviously equal to 1, the same way 0.9999999... in decimal is equal to one. Therefore, we know that x + x/4 = 1, so 5x/4 = 1, x=4/5. This is then represented as CCCCCCCCCCCCD in hex for rounding (as the binary digit beyond the last one present would be a 1).

In general multiplication is much faster than division. So if we can get away with multiplying by the reciprocal instead we can significantly speed up division by a constant
A wrinkle is that we cannot represent the reciprocal exactly (unless the division was by a power of two but in that case we can usually just convert the division to a bit shift). So to ensure correct answers we have to be careful that the error in our reciprocal does not cause errors in our final result.
-3689348814741910323 is 0xCCCCCCCCCCCCCCCD which is a value of just over 4/5 expressed in 0.64 fixed point.
When we multiply a 64 bit integer by a 0.64 fixed point number we get a 64.64 result. We truncate the value to a 64-bit integer (effectively rounding it towards zero) and then perform a further shift which divides by four and again truncates By looking at the bit level it is clear that we can treat both truncations as a single truncation.
This clearly gives us at least an approximation of division by 5 but does it give us an exact answer correctly rounded towards zero?
To get an exact answer the error needs to be small enough not to push the answer over a rounding boundary.
The exact answer to a division by 5 will always have a fractional part of 0, 1/5, 2/5, 3/5 or 4/5 . Therefore a positive error of less than 1/5 in the multiplied and shifted result will never push the result over a rounding boundary.
The error in our constant is (1/5) * 2-64. The value of i is less than 264 so the error after multiplying is less than 1/5. After the division by 4 the error is less than (1/5) * 2−2.
(1/5) * 2−2 < 1/5 so the answer will always be equal to doing an exact division and rounding towards zero.
Unfortunately this doesn't work for all divisors.
If we try to represent 4/7 as a 0.64 fixed point number with rounding away from zero we end up with an error of (6/7) * 2-64. After multiplying by an i value of just under 264 we end up with an error just under 6/7 and after dividing by four we end up with an error of just under 1.5/7 which is greater than 1/7.
So to implement divison by 7 correctly we need to multiply by a 0.65 fixed point number. We can implement that by multiplying by the lower 64 bits of our fixed point number, then adding the original number (this may overflow into the carry bit) then doing a rotate through carry.

Here is link to a document of an algorithm that produces the values and code I see with Visual Studio (in most cases) and that I assume is still used in GCC for division of a variable integer by a constant integer.
http://gmplib.org/~tege/divcnst-pldi94.pdf
In the article, a uword has N bits, a udword has 2N bits, n = numerator = dividend, d = denominator = divisor, ℓ is initially set to ceil(log2(d)), shpre is pre-shift (used before multiply) = e = number of trailing zero bits in d, shpost is post-shift (used after multiply), prec is precision = N - e = N - shpre. The goal is to optimize calculation of n/d using a pre-shift, multiply, and post-shift.
Scroll down to figure 6.2, which defines how a udword multiplier (max size is N+1 bits), is generated, but doesn't clearly explain the process. I'll explain this below.
Figure 4.2 and figure 6.2 show how the multiplier can be reduced to a N bit or less multiplier for most divisors. Equation 4.5 explains how the formula used to deal with N+1 bit multipliers in figure 4.1 and 4.2 was derived.
In the case of modern X86 and other processors, multiply time is fixed, so pre-shift doesn't help on these processors, but it still helps to reduce the multiplier from N+1 bits to N bits. I don't know if GCC or Visual Studio have eliminated pre-shift for X86 targets.
Going back to Figure 6.2. The numerator (dividend) for mlow and mhigh can be larger than a udword only when denominator (divisor) > 2^(N-1) (when ℓ == N => mlow = 2^(2N)), in this case the optimized replacement for n/d is a compare (if n>=d, q = 1, else q = 0), so no multiplier is generated. The initial values of mlow and mhigh will be N+1 bits, and two udword/uword divides can be used to produce each N+1 bit value (mlow or mhigh). Using X86 in 64 bit mode as an example:
; upper 8 bytes of dividend = 2^(ℓ) = (upper part of 2^(N+ℓ))
; lower 8 bytes of dividend for mlow = 0
; lower 8 bytes of dividend for mhigh = 2^(N+ℓ-prec) = 2^(ℓ+shpre) = 2^(ℓ+e)
dividend dq 2 dup(?) ;16 byte dividend
divisor dq 1 dup(?) ; 8 byte divisor
; ...
mov rcx,divisor
mov rdx,0
mov rax,dividend+8 ;upper 8 bytes of dividend
div rcx ;after div, rax == 1
mov rax,dividend ;lower 8 bytes of dividend
div rcx
mov rdx,1 ;rdx:rax = N+1 bit value = 65 bit value
You can test this with GCC. You're already seen how j = i/5 is handled. Take a look at how j = i/7 is handled (which should be the N+1 bit multiplier case).
On most current processors, multiply has a fixed timing, so a pre-shift is not needed. For X86, the end result is a two instruction sequence for most divisors, and a five instruction sequence for divisors like 7 (in order to emulate a N+1 bit multiplier as shown in equation 4.5 and figure 4.2 of the pdf file). Example X86-64 code:
; rbx = dividend, rax = 64 bit (or less) multiplier, rcx = post shift count
; two instruction sequence for most divisors:
mul rbx ;rdx = upper 64 bits of product
shr rdx,cl ;rdx = quotient
;
; five instruction sequence for divisors like 7
; to emulate 65 bit multiplier (rbx = lower 64 bits of multiplier)
mul rbx ;rdx = upper 64 bits of product
sub rbx,rdx ;rbx -= rdx
shr rbx,1 ;rbx >>= 1
add rdx,rbx ;rdx = upper 64 bits of corrected product
shr rdx,cl ;rdx = quotient
; ...
To explain the 5 instruction sequence, a simple 3 instruction sequence could overflow. Let u64() mean upper 64 bits (all that is needed for quotient)
mul rbx ;rdx = u64(dvnd*mplr)
add rdx,rbx ;rdx = u64(dvnd*(2^64 + mplr)), could overflow
shr rdx,cl
To handle this case, cl = post_shift-1. rax = multiplier - 2^64, rbx = dividend. u64() is upper 64 bits. Note that rax = rax<<1 - rax. Quotient is:
u64( ( rbx * (2^64 + rax) )>>(cl+1) )
u64( ( rbx * (2^64 + rax<<1 - rax) )>>(cl+1) )
u64( ( (rbx * 2^64) + (rbx * rax)<<1 - (rbx * rax) )>>(cl+1) )
u64( ( (rbx * 2^64) - (rbx * rax) + (rbx * rax)<<1 )>>(cl+1) )
u64( ( ((rbx * 2^64) - (rbx * rax))>>1) + (rbx*rax) )>>(cl ) )
mul rbx ; (rbx*rax)
sub rbx,rdx ; (rbx*2^64)-(rbx*rax)
shr rbx,1 ;( (rbx*2^64)-(rbx*rax))>>1
add rdx,rbx ;( ((rbx*2^64)-(rbx*rax))>>1)+(rbx*rax)
shr rdx,cl ;((((rbx*2^64)-(rbx*rax))>>1)+(rbx*rax))>>cl

I will answer from a slightly different angle: Because it is allowed to do it.
C and C++ are defined against an abstract machine. The compiler transforms this program in terms of the abstract machine to concrete machine following the as-if rule.
The compiler is allowed to make ANY changes as long as it doesn't change the observable behaviour as specified by the abstract machine. There is no reasonable expectation that the compiler will transform your code in the most straightforward way possible (even when a lot of C programmer assume that). Usually, it does this because the compiler wants to optimize the performance compared to the straightforward approach (as discussed in the other answers at length).
If under any circumstances the compiler "optimizes" a correct program to something that has a different observable behaviour, that is a compiler bug.
Any undefined behaviour in our code (signed integer overflow is a classical example) and this contract is void.

Related

Is this how the + operator is implemented in C?

When understanding how primitive operators such as +, -, * and / are implemented in C, I found the following snippet from an interesting answer.
// replaces the + operator
int add(int x, int y) {
while(x) {
int t = (x & y) <<1;
y ^= x;
x = t;
}
return y;
}
It seems that this function demonstrates how + actually works in the background. However, it's too confusing for me to understand it. I believed that such operations are done using assembly directives generated by the compiler for a long time!
Is the + operator implemented as the code posted on MOST implementations? Does this take advantage of two's complement or other implementation-dependent features?

To be pedantic, the C specification does not specify how addition is implemented.
But to be realistic, the + operator on integer types smaller than or equal to the word size of your CPU get translated directly into an addition instruction for the CPU, and larger integer types get translated into multiple addition instructions with some extra bits to handle overflow.
The CPU internally uses logic circuits to implement the addition, and does not use loops, bitshifts, or anything that has a close resemblance to how C works.

When you add two bits, following is the result: (truth table)
a | b | sum (a^b) | carry bit (a&b) (goes to next)
--+---+-----------+--------------------------------
0 | 0 | 0 | 0
0 | 1 | 1 | 0
1 | 0 | 1 | 0
1 | 1 | 0 | 1
So if you do bitwise xor, you can get the sum without carry.
And if you do bitwise and you can get the carry bits.
Extending this observation for multibit numbers a and b
a+b = sum_without_carry(a, b) + carry_bits(a, b) shifted by 1 bit left
= a^b + ((a&b) << 1)
Once b is 0:
a+0 = a
So algorithm boils down to:
Add(a, b)
if b == 0
return a;
else
carry_bits = a & b;
sum_bits = a ^ b;
return Add(sum_bits, carry_bits << 1);
If you get rid of recursion and convert it to a loop
Add(a, b)
while(b != 0) {
carry_bits = a & b;
sum_bits = a ^ b;
a = sum_bits;
b = carrry_bits << 1; // In next loop, add carry bits to a
}
return a;
With above algorithm in mind explanation from code should be simpler:
int t = (x & y) << 1;
Carry bits. Carry bit is 1 if 1 bit to the right in both operands is 1.
y ^= x; // x is used now
Addition without carry (Carry bits ignored)
x = t;
Reuse x to set it to carry
while(x)
Repeat while there are more carry bits
A recursive implementation (easier to understand) would be:
int add(int x, int y) {
return (y == 0) ? x : add(x ^ y, (x&y) << 1);
}
Seems that this function demonstrates how + actually works in the
background
No. Usually (almost always) integer addition translates to machine instruction add. This just demonstrate an alternate implementation using bitwise xor and and.

Seems that this function demonstrates how + actually works in the background
No. This is translated to the native add machine instruction, which is actually using the hardware adder, in the ALU.
If you're wondering how does the computer add, here is a basic adder.
Everything in the computer is done using logic gates, which are mostly made of transistors. The full adder has half-adders in it.
For a basic tutorial on logic gates, and adders, see this. The video is extremely helpful, though long.
In that video, a basic half-adder is shown. If you want a brief description, this is it:
The half adder add's two bits given. The possible combinations are:
Add 0 and 0 = 0
Add 1 and 0 = 1
Add 1 and 1 = 10 (binary)
So now how does the half adder work? Well, it is made up of three logic gates, the and, xor and the nand. The nand gives a positive current if both the inputs are negative, so that means this solves the case of 0 and 0. The xor gives a positive output one of the input is positive, and the other negative, so that means that it solves the problem of 1 and 0. The and gives a positive output only if both the inputs are positive, so that solves the problem of 1 and 1. So basically, we have now got our half-adder. But we still can only add bits.
Now we make our full-adder. A full adder consists of calling the half-adder again and again. Now this has a carry. When we add 1 and 1, we get a carry 1. So what the full-adder does is, it takes the carry from the half-adder, stores it, and passes it as another argument to the half-adder.
If you're confused how can you pass the carry, you basically first add the bits using the half-adder, and then add the sum and the carry. So now you've added the carry, with the two bits. So you do this again and again, till the bits you have to add are over, and then you get your result.
Surprised? This is how it actually happens. It looks like a long process, but the computer does it in fractions of a nanosecond, or to be more specific, in half a clock cycle. Sometimes it is performed even in a single clock cycle. Basically, the computer has the ALU (a major part of the CPU), memory, buses, etc..
If you want to learn computer hardware, from logic gates, memory and the ALU, and simulate a computer, you can see this course, from which I learnt all this: Build a Modern Computer from First Principles
It's free if you do not want an e-certificate. The part two of the course is coming up in spring this year

C uses an abstract machine to describe what C code does. So how it works is not specified. There are C "compilers" that actually compile C into a scripting language, for example.
But, in most C implementations, + between two integers smaller than the machine integer size will be translated into an assembly instruction (after many steps). The assembly instruction will be translated into machine code and embedded within your executable. Assembly is a language "one step removed" from machine code, intended to be easier to read than a bunch of packed binary.
That machine code (after many steps) is then interpreted by the target hardware platform, where it is interpreted by the instruction decoder on the CPU. This instruction decoder takes the instruction, and translates it into signals to send along "control lines". These signals route data from registers and memory through the CPU, where the values are added together often in an arithmetic logic unit.
The arithmetic logic unit might have separate adders and multipliers, or might mix them together.
The arithmetic logic unit has a bunch of transistors that perform the addition operation, then produce the output. Said output is routed via the signals generated from the instruction decoder, and stored in memory or registers.
The layout of said transistors in both the arithmetic logic unit and instruction decoder (as well as parts I have glossed over) is etched into the chip at the plant. The etching pattern is often produced by compiling a hardware description language, which takes an abstraction of what is connected to what and how they operate and generates transistors and interconnect lines.
The hardware description language can contain shifts and loops that don't describe things happening in time (like one after another) but rather in space -- it describes the connections between different parts of hardware. Said code may look very vaguely like the code you posted above.
The above glosses over many parts and layers and contains inaccuracies. This is both from my own incompetence (I have written both hardware and compilers, but am an expert in neither) and because full details would take a career or two, and not a SO post.
Here is a SO post about an 8-bit adder. Here is a non-SO post, where you'll note some of the adders just use operator+ in the HDL! (The HDL itself understands + and generates the lower level adder code for you).

Almost any modern processor that can run compiled C code will have builtin support for integer addition. The code you posted is a clever way to perform integer addition without executing an integer add opcode, but it is not how integer addition is normally performed. In fact, the function linkage probably uses some form of integer addition to adjust the stack pointer.
The code you posted relies on the observation that when adding x and y, you can decompose it into the bits they have in common and the bits that are unique to one of x or y.
The expression x & y (bitwise AND) gives the bits common to x and y. The expression x ^ y (bitwise exclusive OR) gives the bits that are unique to one of x or y.
The sum x + y can be rewritten as the sum of two times the bits they have in common (since both x and y contribute those bits) plus the bits that are unique to x or y.
(x & y) << 1 is twice the bits they have in common (the left shift by 1 effectively multiplies by two).
x ^ y is the bits that are unique to one of x or y.
So if we replace x by the first value and y by the second, the sum should be unchanged. You can think of the first value as the carries of the bitwise additions, and the second as the low-order bit of the bitwise additions.
This process continues until x is zero, at which point y holds the sum.

The code that you found tries to explain how very primitive computer hardware might implement an "add" instruction. I say "might" because I can guarantee that this method isn't used by any CPU, and I'll explain why.
In normal life, you use decimal numbers and you have learned how to add them: To add two numbers, you add the lowest two digits. If the result is less than 10, you write down the result and proceed to the next digit position. If the result is 10 or more, you write down the result minus 10, proceed to the next digit, buy you remember to add 1 more. For example: 23 + 37, you add 3+7 = 10, you write down 0 and remember to add 1 more for the next position. At the 10s position, you add (2+3) + 1 = 6 and write that down. Result is 60.
You can do the exact same thing with binary numbers. The difference is that the only digits are 0 and 1, so the only possible sums are 0, 1, 2. For a 32 bit number, you would handle one digit position after the other. And that is how really primitive computer hardware would do it.
This code works differently. You know the sum of two binary digits is 2 if both digits are 1. So if both digits are 1 then you would add 1 more at the next binary position and write down 0. That's what the calculation of t does: It finds all places where both binary digits are 1 (that's the &) and moves them to the next digit position (<< 1). Then it does the addition: 0+0 = 0, 0+1 = 1, 1+0 = 1, 1+1 is 2, but we write down 0. That's what the excludive or operator does.
But all the 1's that you had to handle in the next digit position haven't been handled. They still need to be added. That's why the code does a loop: In the next iteration, all the extra 1's are added.
Why does no processor do it that way? Because it's a loop, and processors don't like loops, and it is slow. It's slow, because in the worst case, 32 iterations are needed: If you add 1 to the number 0xffffffff (32 1-bits), then the first iteration clears bit 0 of y and sets x to 2. The second iteration clears bit 1 of y and sets x to 4. And so on. It takes 32 iterations to get the result. However, each iteration has to process all bits of x and y, which takes a lot of hardware.
A primitive processor would do things just as quick in the way you do decimal arithmetic, from the lowest position to the highest. It also takes 32 steps, but each step processes only two bits plus one value from the previous bit position, so it is much easier to implement. And even in a primitive computer, one can afford to do this without having to implement loops.
A modern, fast and complex CPU will use a "conditional sum adder". Especially if the number of bits is high, for example a 64 bit adder, it saves a lot of time.
A 64 bit adder consists of two parts: First, a 32 bit adder for the lowest 32 bit. That 32 bit adder produces a sum, and a "carry" (an indicator that a 1 must be added to the next bit position). Second, two 32 bit adders for the higher 32 bits: One adds x + y, the other adds x + y + 1. All three adders work in parallel. Then when the first adder has produced its carry, the CPU just picks which one of the two results x + y or x + y + 1 is the correct one, and you have the complete result. So a 64 bit adder only takes a tiny bit longer than a 32 bit adder, not twice as long.
The 32 bit adder parts are again implemented as conditional sum adders, using multiple 16 bit adders, and the 16 bit adders are conditional sum adders, and so on.

My question is: Is the + operator implemented as the code posted on MOST implementations?
Let's answer the actual question. All operators are implemented by the compiler as some internal data structure that eventually gets translated into code after some transformations. You can't say what code will be generated by a single addition because almost no real world compiler generates code for individual statements.
The compiler is free to generate any code as long as it behaves as if the actual operations were performed according to the standard. But what actually happens can be something completely different.
A simple example:
static int
foo(int a, int b)
{
return a + b;
}
[...]
int a = foo(1, 17);
int b = foo(x, x);
some_other_function(a, b);
There's no need to generate any addition instructions here. It's perfectly legal for the compiler to translate this into:
some_other_function(18, x * 2);
Or maybe the compiler notices that you call the function foo a few times in a row and that it is a simple arithmetic and it will generate vector instructions for it. Or that the result of the addition is used for array indexing later and the lea instruction will be used.
You simply can't talk about how an operator is implemented because it is almost never used alone.

In case a breakdown of the code helps anyone else, take the example x=2, y=6:
x isn't zero, so commence adding to y:
while(2) {
x & y = 2 because
x: 0 0 1 0 //2
y: 0 1 1 0 //6
x&y: 0 0 1 0 //2
2 <<1 = 4 because << 1 shifts all bits to the left:
x&y: 0 0 1 0 //2
(x&y) <<1: 0 1 0 0 //4
In summary, stash that result, 4, in t with
int t = (x & y) <<1;
Now apply the bitwise XOR y^=x:
x: 0 0 1 0 //2
y: 0 1 1 0 //6
y^=x: 0 1 0 0 //4
So x=2, y=4. Finally, sum t+y by resetting x=t and going back to the beginning of the while loop:
x = t;
When t=0 (or, at the beginning of the loop, x=0), finish with
return y;

Just out of interest, on the Atmega328P processor, with the avr-g++ compiler, the following code implements adding one by subtracting -1 :
volatile char x;
int main ()
{
x = x + 1;
}
Generated code:
00000090 <main>:
volatile char x;
int main ()
{
x = x + 1;
90: 80 91 00 01 lds r24, 0x0100
94: 8f 5f subi r24, 0xFF ; 255
96: 80 93 00 01 sts 0x0100, r24
}
9a: 80 e0 ldi r24, 0x00 ; 0
9c: 90 e0 ldi r25, 0x00 ; 0
9e: 08 95 ret
Notice in particular that the add is done by the subi instruction (subtract constant from register) where 0xFF is effectively -1 in this case.
Also of interest is that this particular processor does not have a addi instruction, which implies that the designers thought that doing a subtract of the complement would be adequately handled by the compiler-writers.
Does this take advantage of two's complement or other implementation-dependent features?
It would probably be fair to say that compiler-writers would attempt to implement the wanted effect (adding one number to another) in the most efficient way possible for that particularly architecture. If that requires subtracting the complement, so be it.

Point inside 2D axis aligned rectangle, no branches

I'm searching for the most optimized method to detect whether a point is inside an axis aligned rectangle.
The easiest solution needs 4 branches (if) which is bad for performance.

Given a segment [x0, x1], a point x is inside the segment when (x0 - x) * (x1 - x) <= 0.
In two dimensions case, you need to do it twice, so it requires two conditionals.

Consider BITWISE-ANDing the values of XMin-X, X-XMax, YMin-Y, Y-YMax and use the resulting sign bit.
Will work with both ints and floats.

I think you will need the four tests no matter what, but if you know if the point is more likely to be in or out of the rectangle, you can make sure those four tests are only run in the worst case.
If the likelihood of the point being inside is higher, you can do
if ((x>Xmax) || (x<Xmin) || (y>Ymax) || (y<Ymin)) {
// point not in rectangle
}
Otherwise, do the opposite:
if ((x<=Xmax) && (x>=Xmin) && (y<=Ymax) && (y>=Ymin)) {
// point in rectangle
}
I am curious if really there would be anything better... (unless you can make some assumption on where the rectangle edges, like they are align to power of 2s or something funky like that)

Many architectures support branchless absolute value operation. If not, it can be simulated by multiplication, or left shifting a signed value and having faith on particular "implementation dependent" behaviour.
Also it's quite possible that in Intel and ARM architectures the operation can be made branchless with
((x0<x) && (x<x1))&((y0<y) && (y<y1))
The reason is that the range check is often optimized to a sequence:
mov ebx, 1 // not needed on arm
sub eax, imm0
sub eax, imm1 // this will cause a carry only when both conditions are met
cmovc eax, ebx // movcs reg, #1 on ARM
The bitwise and between (x) and (y) expressions is also branchless.
EDIT Original idea was:
Given test range: a<=x<=b, first define the middle point. Then both sides can be tested with |(x-mid)| < A; multiplying with a factor B to have A a power of two...
(x-mid)*B < 2^n and squaring
((x-mid)*B)^2 < 2^2n
This value has only bits set at the least significant 2n bits (if the condition is satisfied). Do the same for range y and OR them. In this case the factor C must be chosen so that (y-midy)^2 scales to the same 2^2n.
return (((x-mid)*B)*(((x-mid)*B) | ((y-mid)*C)*((y-mid)*C))) >> (n*2);
The return value is 0 for x,y inside the AABB and non-zero for x,y outside.
(Here the operation is or, as one is interested in the complement of (a&&b) & (c&&d), which is (!(a&&b)) | (!(c&dd));

You don't tell us what you know about the range of possible values and resolution required, nor on what criterion you want to optimize.
A solution is to precompute a 2D array of booleans (if you can affort it) that you look-up for your pair of coordinates. Costs 1 multiply (or shift), 1 add (for address computation) and 1 memory read.
Or two 1D arrays of booleans. Costs 2 adds, two memory reads and 1 AND, with much smaller tables.

How to properly add/subtract a 128-bit number (as two uint64_t)?

I'm working in C and need to add and subtract a 64-bit number and a 128-bit number. The result will be held in the 128-bit number. I am using an integer array to store the upper and lower halves of the 128-bit number (i.e. uint64_t bigNum[2], where bigNum[0] is the least significant).
Can anybody help with an addition and subtraction function that can take in bigNum and add/subtract a uint64_t to it?
I have seen many incorrect examples on the web, so consider this:
bigNum[0] = 0;
bigNum[1] = 1;
subtract(&bigNum, 1);
At this point bigNum[0] should have all bits set, while bigNum[1] should have no bits set.

In many architectures it's very easy to add/subtract any arbitrarily-long integers because there's a carry flag and add/sub-with-flag instruction. For example on x86 rdx:rax += r8:r9 can be done like this
add rax, r9 # add the low parts and store the carry
adc rdx, r8 # add the high parts with carry
In C there's no way to access this carry flag so you must calculate the flag on your own. The easiest way is to check if the unsigned sum is less than either of the operand like this. For example to do a += b we'll do
aL += bL;
aH += bH + (aL < bL);
This is exactly how multi-word add is done in architectures that don't have a flag register. For example in MIPS it's done like this
# alow = blow + clow
addu alow, blow, clow
# set tmp = 1 if alow < clow, else 0
sltu tmp, alow, clow
addu ahigh, bhigh, chigh
addu ahigh, ahigh, tmp
Here's some example assembly output

This should work for the subtraction:
typedef u_int64_t bigNum[2];
void subtract(bigNum *a, u_int64_t b)
{
const u_int64_t borrow = b > a[1];
a[1] -= b;
a[0] -= borrow;
}
Addition is very similar. The above could of course be expressed with an explicit test, too, but I find it cleaner to always do the borrowing. Optimization left as an exercise.
For a bigNum equal to { 0, 1 }, subtracting two would make it equal { ~0UL, ~0UL }, which is the proper bit pattern to represent -1. Here, UL is assumed to promote an integer to 64 bits, which is compiler-dependent of course.

In grade 1 or 2, you should have learn't how to break down the addition of 1 and 10 into parts, by splitting it into multiple separate additions of tens and units. When dealing with big numbers, the same principals can be applied to compute arithmetic operations on arbitrarily large numbers, by realizing your units are now units of 2^bits, your "tens" are 2^bits larger and so on.

For the case the value that your are subtracting is less or equal to bignum[0] you don't have to touch bignum[1].
If it isn't, you subtract it from bignum[0], anyhow. This operation will wrap around, but this is the behavior you need here. In addition you'd then have to substact 1 from bignum[1].

Most compilers support a __int128 type intrinsically.
Try it and you might be lucky.

Most optimized way to calculate modulus in C

I have minimize cost of calculating modulus in C.
say I have a number x and n is the number which will divide x
when n == 65536 (which happens to be 2^16):
mod = x % n (11 assembly instructions as produced by GCC)
or
mod = x & 0xffff which is equal to mod = x & 65535 (4 assembly instructions)
so, GCC doesn't optimize it to this extent.
In my case n is not x^(int) but is largest prime less than 2^16 which is 65521
as I showed for n == 2^16, bit-wise operations can optimize the computation. What bit-wise operations can I preform when n == 65521 to calculate modulus.

First, make sure you're looking at optimized code before drawing conclusion about what GCC is producing (and make sure this particular expression really needs to be optimized). Finally - don't count instructions to draw your conclusions; it may be that an 11 instruction sequence might be expected to perform better than a shorter sequence that includes a div instruction.
Also, you can't conclude that because x mod 65536 can be calculated with a simple bit mask that any mod operation can be implemented that way. Consider how easy dividing by 10 in decimal is as opposed to dividing by an arbitrary number.
With all that out of the way, you may be able to use some of the 'magic number' techniques from Henry Warren's Hacker's Delight book:
Archive of http://www.hackersdelight.org/
Archive of http://www.hackersdelight.org/magic.htm
There was an added chapter on the website that contained "two methods of computing the remainder of division without computing the quotient!", which you may find of some use. The 1st technique applies only to a limited set of divisors, so it won't work for your particular instance. I haven't actually read the online chapter, so I don't know exactly how applicable the other technique might be for you.

x mod 65536 is only equivalent to x & 0xffff if x is unsigned - for signed x, it gives the wrong result for negative numbers. For unsigned x, gcc does indeed optimise x % 65536 to a bitwise and with 65535 (even on -O0, in my tests).
Because 65521 is not a power of 2, x mod 65521 can't be calculated so simply. gcc 4.3.2 on -O3 calculates it using x - (x / 65521) * 65521; the integer division by a constant is done using integer multiplication by a related constant.

rIf you don't have to fully reduce your integers modulo 65521, then you can use the fact that 65521 is close to 2**16. I.e. if x is an unsigned int you want to reduce then you can do the following:
unsigned int low = x &0xffff;
unsigned int hi = (x >> 16);
x = low + 15 * hi;
This uses that 2**16 % 65521 == 15. Note that this is not a full reduction. I.e. starting with a 32-bit input, you only are guaranteed that the result is at most 20 bits and that it is of course congruent to the input modulo 65521.
This trick can be used in applications where there are many operations that have to be reduced modulo the same constant, and where intermediary results do not have to be the smallest element in its residue class.
E.g. one application is the implementation of Adler-32, which uses the modulus 65521. This hash function does a lot of operations modulo 65521. To implement it efficiently one would only do modular reductions after a carefully computed number of additions. A reduction shown as above is enough and only the computation of the hash will need a full modulo operation.

The bitwise operation only works well if the divisor is of the form 2^n. In the general case, there is no such bit-wise operation.

If the constant with which you want to take the modulo is known at compile time
and you have a decent compiler (e.g. gcc), tis usually best to let the compiler
work its magic. Just declare the modulo const.
If you don't know the constant at compile time, but you are going to take - say -
a billion modulos with the same number, then use this http://libdivide.com/

As an approach when we deal with powers of 2, can be considered this one (mostly C flavored):
.
.
#define THE_DIVISOR 0x8U; /* The modulo value (POWER OF 2). */
.
.
uint8 CheckIfModulo(const sint32 TheDividend)
{
uint8 RetVal = 1; /* TheDividend is not modulus THE_DIVISOR. */
if (0 == (TheDividend & (THE_DIVISOR - 1)))
{
/* code if modulo is satisfied */
RetVal = 0; /* TheDividend IS modulus THE_DIVISOR. */
}
else
{
/* code if modulo is NOT satisfied */
}
return RetVal;
}

If x is an increasing index, and the increment i is known to be less than n (e.g. when iterating over a circular array of length n), avoid the modulus completely.
A loop going
x += i; if (x >= n) x -= n;
is way faster than
x = (x + i) % n;
which you unfortunately find in many text books...
If you really need an expression (e.g. because you are using it in a for statement), you can use the ugly but efficient
x = x + (x+i < n ? i : i-n)

idiv — Integer Division
The idiv instruction divides the contents of the 64 bit integer EDX:EAX (constructed by viewing EDX as the most significant four bytes and EAX as the least significant four bytes) by the specified operand value. The quotient result of the division is stored into EAX, while the remainder is placed in EDX.
source: http://www.cs.virginia.edu/~evans/cs216/guides/x86.html

I need a fast 96-bit on 64-bit specific division algorithm for a fixed-point math library

I am currently writing a fast 32.32 fixed-point math library. I succeeded at making adding, subtraction and multiplication work correctly, but I am quite stuck at division.
A little reminder for those who can't remember: a 32.32 fixed-point number is a number having 32 bits of integer part and 32 bits of fractional part.
The best algorithm I came up with needs 96-bit integer division, which is something compilers usually don't have built-ins for.
Anyway, here it goes:
G = 2^32
notation: x is the 64-bit fixed-point number, x1 is its low nibble and x2 is its high
G*(a/b) = ((a1 + a2*G) / (b1 + b2*G))*G // Decompose this
G*(a/b) = (a1*G) / (b1*G + b2) + (a2*G*G) / (b1*G + b2)
As you can see, the (a2*G*G) is guaranteed to be larger than the regular 64-bit integer. If uint128_t's were actually supported by my compiler, I would simply do the following:
((uint128_t)x << 32) / y)
Well they aren't and I need a solution. Thank you for your help.

You can decompose a larger division into multiple chunks that do division with less bits. As another poster already mentioned the algorithm can be found in TAOCP from Knuth.
However, no need to buy the book!
There is a code on the hackers delight website that implements the algorithm in C. It's written to do 64-bit unsigned divisions using 32-bit arithmetic only, so you can't directly cut'n'paste the code. To get from 64 to 128-bit you have to widen all types, masks and constans by two e.g. a short becomes a int, a 0xffff becomes 0xffffffffll ect.
After this easy easy change you should be able to do 128bit divisions.
The code is mirrored on GitHub, but was originally posted on Hackersdelight.org (original link no longer accessible).
Since your largest values only need 96-bit, One of the 64-bit divisions will always return zero, so you can even simplify the code a bit.
Oh - and before I forget this: The code only works with unsigned values. To convert from signed to unsigned divide you can do something like this (pseudo-code style):
fixpoint Divide (fixpoint a, fixpoint b)
{
// check if the integers are of different sign:
fixpoint sign_difference = a ^ b;
// do unsigned division:
fixpoint x = unsigned_divide (abs(a), abs(b));
// if the signs have been different: negate the result.
if (sign_difference < 0)
{
x = -x;
}
return x;
}
The website itself is worth checking out as well: http://www.hackersdelight.org/
By the way - nice task that you're working on.. Do you mind telling us for what you need the fixed-point library?
By the way - the ordinary shift and subtract algorithm for division would work as well.
If you target x86 you can implement it using MMX or SSE intrinsics. The algorithm relies only on primitive operations, so it could perform quite fast as well.

Better self-adjusting answer:
Forgive the C#-ism of the answer, but the following should work in all cases. There is likely a solution possible that finds the right shifts to use quicker, but I'd have to think much deeper than I can right now. This should be reasonably efficient though:
int upshift = 32;
ulong mask = 0xFFFFFFFF00000000;
ulong mod = x % y;
while ((mod & mask) != 0)
{
// Current upshift of the remainder would overflow... so adjust
y >>= 1;
mask <<= 1;
upshift--;
mod = x % y;
}
ulong div = ((x / y) << upshift) + (mod << upshift) / y;
Simple but unsafe answer:
This calculation can cause an overflow in the upshift of the x % y remainder if this remainder has any bits set in the high 32 bits, causing an incorrect answer.
((x / y) << 32) + ((x % y) << 32) / y
The first part uses integer division and gives you the high bits of the answer (shift them back up).
The second part calculates the low bits from the remainder of the high-bit division (the bit that could not be divided any further), shifted up and then divided.

I like Nils' answer, which is probably the best. It's just long division, like we all learned in grade school, except the digits are base 2^32 instead of base 10.
However, you might also consider using Newton's approximation method for division:
x := x (N + N - N * D * x)
where N is the numerator and D is the demoninator.
This just uses multiplies and adds, which you already have, and it converges very quickly to about 1 ULP of precision. On the other hand, you won't be able to acheive the exact 0.5-ULP answer in all cases.
In any case, the tricky bit is detecting and handling the overflows.

Quick -n- dirty.
Do the A/B divide with double precision floating point.
This gives you C~=A/B. It's only approximate because of floating point precision and 53 bits of mantissa.
Round off C to a representable number in your fixed point system.
Now compute (again with your fixed point) D=A-C*B. This should have significantly lower magnitude than A.
Repeat , now computing D/B with floating point. Again, round the answer to an integer. Add each division result together as you go. You can stop when your remainder is so small that your floating point divide returns 0 after rounding.
You're still not done. Now you're very close to the answer, but the divisions weren't exact.
To finalize, you'll have to do a binary search. Using the (very good) starting estimate, see if increasing it improves the error.. you basically want to bracket the proper answer and keep dividing the range in half with new tests.
Yes, you could do Newton iteration here, but binary search will likely be easier since you need only simple multiplies and adds using your existing 32.32 precision toolkit.
This is not the most efficient method, but it's by far the easiest to code.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight