How to do the modulo operation in ARM assembly?

How to do the modulo operation in ARM assembly? - arm

I'm trying to add the values in two registers, and modulo them by 8.
So, in C code, it would be like this
a = a + b;
c = a % 8;
how to do this above operation in ARM assembly.

Not all ARM processors have a direct instruction for division or modulo, so in most cases, a call to the modulo operation would end up as a function call to e.g. ___modsi3.
In this particular case, when doing modulo for 8, if the values can be assumed to be nonnegative, you can do the % 8 part as & 7. In that case, the assembly for your case would be:
add rA, rA, rB
and rC, rA, #7

Related

Assembly Modulo and Divide in IDA [duplicate]

I've been reading about div and mul assembly operations, and I decided to see them in action by writing a simple program in C:
File division.c
#include <stdlib.h>
#include <stdio.h>
int main()
{
size_t i = 9;
size_t j = i / 5;
printf("%zu\n",j);
return 0;
}
And then generating assembly language code with:
gcc -S division.c -O0 -masm=intel
But looking at generated division.s file, it doesn't contain any div operations! Instead, it does some kind of black magic with bit shifting and magic numbers. Here's a code snippet that computes i/5:
mov rax, QWORD PTR [rbp-16] ; Move i (=9) to RAX
movabs rdx, -3689348814741910323 ; Move some magic number to RDX (?)
mul rdx ; Multiply 9 by magic number
mov rax, rdx ; Take only the upper 64 bits of the result
shr rax, 2 ; Shift these bits 2 places to the right (?)
mov QWORD PTR [rbp-8], rax ; Magically, RAX contains 9/5=1 now,
; so we can assign it to j
What's going on here? Why doesn't GCC use div at all? How does it generate this magic number and why does everything work?

Integer division is one of the slowest arithmetic operations you can perform on a modern processor, with latency up to the dozens of cycles and bad throughput. (For x86, see Agner Fog's instruction tables and microarch guide).
If you know the divisor ahead of time, you can avoid the division by replacing it with a set of other operations (multiplications, additions, and shifts) which have the equivalent effect. Even if several operations are needed, it's often still a heck of a lot faster than the integer division itself.
Implementing the C / operator this way instead of with a multi-instruction sequence involving div is just GCC's default way of doing division by constants. It doesn't require optimizing across operations and doesn't change anything even for debugging. (Using -Os for small code size does get GCC to use div, though.) Using a multiplicative inverse instead of division is like using lea instead of mul and add
As a result, you only tend to see div or idiv in the output if the divisor isn't known at compile-time.
For information on how the compiler generates these sequences, as well as code to let you generate them for yourself (almost certainly unnecessary unless you're working with a braindead compiler), see libdivide.

Dividing by 5 is the same as multiplying 1/5, which is again the same as multiplying by 4/5 and shifting right 2 bits. The value concerned is CCCCCCCCCCCCCCCD in hex, which is the binary representation of 4/5 if put after a hexadecimal point (i.e. the binary for four fifths is 0.110011001100 recurring - see below for why). I think you can take it from here! You might want to check out fixed point arithmetic (though note it's rounded to an integer at the end).
As to why, multiplication is faster than division, and when the divisor is fixed, this is a faster route.
See Reciprocal Multiplication, a tutorial for a detailed writeup about how it works, explaining in terms of fixed-point. It shows how the algorithm for finding the reciprocal works, and how to handle signed division and modulo.
Let's consider for a minute why 0.CCCCCCCC... (hex) or 0.110011001100... binary is 4/5. Divide the binary representation by 4 (shift right 2 places), and we'll get 0.001100110011... which by trivial inspection can be added the original to get 0.111111111111..., which is obviously equal to 1, the same way 0.9999999... in decimal is equal to one. Therefore, we know that x + x/4 = 1, so 5x/4 = 1, x=4/5. This is then represented as CCCCCCCCCCCCD in hex for rounding (as the binary digit beyond the last one present would be a 1).

In general multiplication is much faster than division. So if we can get away with multiplying by the reciprocal instead we can significantly speed up division by a constant
A wrinkle is that we cannot represent the reciprocal exactly (unless the division was by a power of two but in that case we can usually just convert the division to a bit shift). So to ensure correct answers we have to be careful that the error in our reciprocal does not cause errors in our final result.
-3689348814741910323 is 0xCCCCCCCCCCCCCCCD which is a value of just over 4/5 expressed in 0.64 fixed point.
When we multiply a 64 bit integer by a 0.64 fixed point number we get a 64.64 result. We truncate the value to a 64-bit integer (effectively rounding it towards zero) and then perform a further shift which divides by four and again truncates By looking at the bit level it is clear that we can treat both truncations as a single truncation.
This clearly gives us at least an approximation of division by 5 but does it give us an exact answer correctly rounded towards zero?
To get an exact answer the error needs to be small enough not to push the answer over a rounding boundary.
The exact answer to a division by 5 will always have a fractional part of 0, 1/5, 2/5, 3/5 or 4/5 . Therefore a positive error of less than 1/5 in the multiplied and shifted result will never push the result over a rounding boundary.
The error in our constant is (1/5) * 2-64. The value of i is less than 264 so the error after multiplying is less than 1/5. After the division by 4 the error is less than (1/5) * 2−2.
(1/5) * 2−2 < 1/5 so the answer will always be equal to doing an exact division and rounding towards zero.
Unfortunately this doesn't work for all divisors.
If we try to represent 4/7 as a 0.64 fixed point number with rounding away from zero we end up with an error of (6/7) * 2-64. After multiplying by an i value of just under 264 we end up with an error just under 6/7 and after dividing by four we end up with an error of just under 1.5/7 which is greater than 1/7.
So to implement divison by 7 correctly we need to multiply by a 0.65 fixed point number. We can implement that by multiplying by the lower 64 bits of our fixed point number, then adding the original number (this may overflow into the carry bit) then doing a rotate through carry.

Here is link to a document of an algorithm that produces the values and code I see with Visual Studio (in most cases) and that I assume is still used in GCC for division of a variable integer by a constant integer.
http://gmplib.org/~tege/divcnst-pldi94.pdf
In the article, a uword has N bits, a udword has 2N bits, n = numerator = dividend, d = denominator = divisor, ℓ is initially set to ceil(log2(d)), shpre is pre-shift (used before multiply) = e = number of trailing zero bits in d, shpost is post-shift (used after multiply), prec is precision = N - e = N - shpre. The goal is to optimize calculation of n/d using a pre-shift, multiply, and post-shift.
Scroll down to figure 6.2, which defines how a udword multiplier (max size is N+1 bits), is generated, but doesn't clearly explain the process. I'll explain this below.
Figure 4.2 and figure 6.2 show how the multiplier can be reduced to a N bit or less multiplier for most divisors. Equation 4.5 explains how the formula used to deal with N+1 bit multipliers in figure 4.1 and 4.2 was derived.
In the case of modern X86 and other processors, multiply time is fixed, so pre-shift doesn't help on these processors, but it still helps to reduce the multiplier from N+1 bits to N bits. I don't know if GCC or Visual Studio have eliminated pre-shift for X86 targets.
Going back to Figure 6.2. The numerator (dividend) for mlow and mhigh can be larger than a udword only when denominator (divisor) > 2^(N-1) (when ℓ == N => mlow = 2^(2N)), in this case the optimized replacement for n/d is a compare (if n>=d, q = 1, else q = 0), so no multiplier is generated. The initial values of mlow and mhigh will be N+1 bits, and two udword/uword divides can be used to produce each N+1 bit value (mlow or mhigh). Using X86 in 64 bit mode as an example:
; upper 8 bytes of dividend = 2^(ℓ) = (upper part of 2^(N+ℓ))
; lower 8 bytes of dividend for mlow = 0
; lower 8 bytes of dividend for mhigh = 2^(N+ℓ-prec) = 2^(ℓ+shpre) = 2^(ℓ+e)
dividend dq 2 dup(?) ;16 byte dividend
divisor dq 1 dup(?) ; 8 byte divisor
; ...
mov rcx,divisor
mov rdx,0
mov rax,dividend+8 ;upper 8 bytes of dividend
div rcx ;after div, rax == 1
mov rax,dividend ;lower 8 bytes of dividend
div rcx
mov rdx,1 ;rdx:rax = N+1 bit value = 65 bit value
You can test this with GCC. You're already seen how j = i/5 is handled. Take a look at how j = i/7 is handled (which should be the N+1 bit multiplier case).
On most current processors, multiply has a fixed timing, so a pre-shift is not needed. For X86, the end result is a two instruction sequence for most divisors, and a five instruction sequence for divisors like 7 (in order to emulate a N+1 bit multiplier as shown in equation 4.5 and figure 4.2 of the pdf file). Example X86-64 code:
; rbx = dividend, rax = 64 bit (or less) multiplier, rcx = post shift count
; two instruction sequence for most divisors:
mul rbx ;rdx = upper 64 bits of product
shr rdx,cl ;rdx = quotient
;
; five instruction sequence for divisors like 7
; to emulate 65 bit multiplier (rbx = lower 64 bits of multiplier)
mul rbx ;rdx = upper 64 bits of product
sub rbx,rdx ;rbx -= rdx
shr rbx,1 ;rbx >>= 1
add rdx,rbx ;rdx = upper 64 bits of corrected product
shr rdx,cl ;rdx = quotient
; ...
To explain the 5 instruction sequence, a simple 3 instruction sequence could overflow. Let u64() mean upper 64 bits (all that is needed for quotient)
mul rbx ;rdx = u64(dvnd*mplr)
add rdx,rbx ;rdx = u64(dvnd*(2^64 + mplr)), could overflow
shr rdx,cl
To handle this case, cl = post_shift-1. rax = multiplier - 2^64, rbx = dividend. u64() is upper 64 bits. Note that rax = rax<<1 - rax. Quotient is:
u64( ( rbx * (2^64 + rax) )>>(cl+1) )
u64( ( rbx * (2^64 + rax<<1 - rax) )>>(cl+1) )
u64( ( (rbx * 2^64) + (rbx * rax)<<1 - (rbx * rax) )>>(cl+1) )
u64( ( (rbx * 2^64) - (rbx * rax) + (rbx * rax)<<1 )>>(cl+1) )
u64( ( ((rbx * 2^64) - (rbx * rax))>>1) + (rbx*rax) )>>(cl ) )
mul rbx ; (rbx*rax)
sub rbx,rdx ; (rbx*2^64)-(rbx*rax)
shr rbx,1 ;( (rbx*2^64)-(rbx*rax))>>1
add rdx,rbx ;( ((rbx*2^64)-(rbx*rax))>>1)+(rbx*rax)
shr rdx,cl ;((((rbx*2^64)-(rbx*rax))>>1)+(rbx*rax))>>cl

I will answer from a slightly different angle: Because it is allowed to do it.
C and C++ are defined against an abstract machine. The compiler transforms this program in terms of the abstract machine to concrete machine following the as-if rule.
The compiler is allowed to make ANY changes as long as it doesn't change the observable behaviour as specified by the abstract machine. There is no reasonable expectation that the compiler will transform your code in the most straightforward way possible (even when a lot of C programmer assume that). Usually, it does this because the compiler wants to optimize the performance compared to the straightforward approach (as discussed in the other answers at length).
If under any circumstances the compiler "optimizes" a correct program to something that has a different observable behaviour, that is a compiler bug.
Any undefined behaviour in our code (signed integer overflow is a classical example) and this contract is void.

Is this how the + operator is implemented in C?

When understanding how primitive operators such as +, -, * and / are implemented in C, I found the following snippet from an interesting answer.
// replaces the + operator
int add(int x, int y) {
while(x) {
int t = (x & y) <<1;
y ^= x;
x = t;
}
return y;
}
It seems that this function demonstrates how + actually works in the background. However, it's too confusing for me to understand it. I believed that such operations are done using assembly directives generated by the compiler for a long time!
Is the + operator implemented as the code posted on MOST implementations? Does this take advantage of two's complement or other implementation-dependent features?

To be pedantic, the C specification does not specify how addition is implemented.
But to be realistic, the + operator on integer types smaller than or equal to the word size of your CPU get translated directly into an addition instruction for the CPU, and larger integer types get translated into multiple addition instructions with some extra bits to handle overflow.
The CPU internally uses logic circuits to implement the addition, and does not use loops, bitshifts, or anything that has a close resemblance to how C works.

When you add two bits, following is the result: (truth table)
a | b | sum (a^b) | carry bit (a&b) (goes to next)
--+---+-----------+--------------------------------
0 | 0 | 0 | 0
0 | 1 | 1 | 0
1 | 0 | 1 | 0
1 | 1 | 0 | 1
So if you do bitwise xor, you can get the sum without carry.
And if you do bitwise and you can get the carry bits.
Extending this observation for multibit numbers a and b
a+b = sum_without_carry(a, b) + carry_bits(a, b) shifted by 1 bit left
= a^b + ((a&b) << 1)
Once b is 0:
a+0 = a
So algorithm boils down to:
Add(a, b)
if b == 0
return a;
else
carry_bits = a & b;
sum_bits = a ^ b;
return Add(sum_bits, carry_bits << 1);
If you get rid of recursion and convert it to a loop
Add(a, b)
while(b != 0) {
carry_bits = a & b;
sum_bits = a ^ b;
a = sum_bits;
b = carrry_bits << 1; // In next loop, add carry bits to a
}
return a;
With above algorithm in mind explanation from code should be simpler:
int t = (x & y) << 1;
Carry bits. Carry bit is 1 if 1 bit to the right in both operands is 1.
y ^= x; // x is used now
Addition without carry (Carry bits ignored)
x = t;
Reuse x to set it to carry
while(x)
Repeat while there are more carry bits
A recursive implementation (easier to understand) would be:
int add(int x, int y) {
return (y == 0) ? x : add(x ^ y, (x&y) << 1);
}
Seems that this function demonstrates how + actually works in the
background
No. Usually (almost always) integer addition translates to machine instruction add. This just demonstrate an alternate implementation using bitwise xor and and.

Seems that this function demonstrates how + actually works in the background
No. This is translated to the native add machine instruction, which is actually using the hardware adder, in the ALU.
If you're wondering how does the computer add, here is a basic adder.
Everything in the computer is done using logic gates, which are mostly made of transistors. The full adder has half-adders in it.
For a basic tutorial on logic gates, and adders, see this. The video is extremely helpful, though long.
In that video, a basic half-adder is shown. If you want a brief description, this is it:
The half adder add's two bits given. The possible combinations are:
Add 0 and 0 = 0
Add 1 and 0 = 1
Add 1 and 1 = 10 (binary)
So now how does the half adder work? Well, it is made up of three logic gates, the and, xor and the nand. The nand gives a positive current if both the inputs are negative, so that means this solves the case of 0 and 0. The xor gives a positive output one of the input is positive, and the other negative, so that means that it solves the problem of 1 and 0. The and gives a positive output only if both the inputs are positive, so that solves the problem of 1 and 1. So basically, we have now got our half-adder. But we still can only add bits.
Now we make our full-adder. A full adder consists of calling the half-adder again and again. Now this has a carry. When we add 1 and 1, we get a carry 1. So what the full-adder does is, it takes the carry from the half-adder, stores it, and passes it as another argument to the half-adder.
If you're confused how can you pass the carry, you basically first add the bits using the half-adder, and then add the sum and the carry. So now you've added the carry, with the two bits. So you do this again and again, till the bits you have to add are over, and then you get your result.
Surprised? This is how it actually happens. It looks like a long process, but the computer does it in fractions of a nanosecond, or to be more specific, in half a clock cycle. Sometimes it is performed even in a single clock cycle. Basically, the computer has the ALU (a major part of the CPU), memory, buses, etc..
If you want to learn computer hardware, from logic gates, memory and the ALU, and simulate a computer, you can see this course, from which I learnt all this: Build a Modern Computer from First Principles
It's free if you do not want an e-certificate. The part two of the course is coming up in spring this year

C uses an abstract machine to describe what C code does. So how it works is not specified. There are C "compilers" that actually compile C into a scripting language, for example.
But, in most C implementations, + between two integers smaller than the machine integer size will be translated into an assembly instruction (after many steps). The assembly instruction will be translated into machine code and embedded within your executable. Assembly is a language "one step removed" from machine code, intended to be easier to read than a bunch of packed binary.
That machine code (after many steps) is then interpreted by the target hardware platform, where it is interpreted by the instruction decoder on the CPU. This instruction decoder takes the instruction, and translates it into signals to send along "control lines". These signals route data from registers and memory through the CPU, where the values are added together often in an arithmetic logic unit.
The arithmetic logic unit might have separate adders and multipliers, or might mix them together.
The arithmetic logic unit has a bunch of transistors that perform the addition operation, then produce the output. Said output is routed via the signals generated from the instruction decoder, and stored in memory or registers.
The layout of said transistors in both the arithmetic logic unit and instruction decoder (as well as parts I have glossed over) is etched into the chip at the plant. The etching pattern is often produced by compiling a hardware description language, which takes an abstraction of what is connected to what and how they operate and generates transistors and interconnect lines.
The hardware description language can contain shifts and loops that don't describe things happening in time (like one after another) but rather in space -- it describes the connections between different parts of hardware. Said code may look very vaguely like the code you posted above.
The above glosses over many parts and layers and contains inaccuracies. This is both from my own incompetence (I have written both hardware and compilers, but am an expert in neither) and because full details would take a career or two, and not a SO post.
Here is a SO post about an 8-bit adder. Here is a non-SO post, where you'll note some of the adders just use operator+ in the HDL! (The HDL itself understands + and generates the lower level adder code for you).

Almost any modern processor that can run compiled C code will have builtin support for integer addition. The code you posted is a clever way to perform integer addition without executing an integer add opcode, but it is not how integer addition is normally performed. In fact, the function linkage probably uses some form of integer addition to adjust the stack pointer.
The code you posted relies on the observation that when adding x and y, you can decompose it into the bits they have in common and the bits that are unique to one of x or y.
The expression x & y (bitwise AND) gives the bits common to x and y. The expression x ^ y (bitwise exclusive OR) gives the bits that are unique to one of x or y.
The sum x + y can be rewritten as the sum of two times the bits they have in common (since both x and y contribute those bits) plus the bits that are unique to x or y.
(x & y) << 1 is twice the bits they have in common (the left shift by 1 effectively multiplies by two).
x ^ y is the bits that are unique to one of x or y.
So if we replace x by the first value and y by the second, the sum should be unchanged. You can think of the first value as the carries of the bitwise additions, and the second as the low-order bit of the bitwise additions.
This process continues until x is zero, at which point y holds the sum.

The code that you found tries to explain how very primitive computer hardware might implement an "add" instruction. I say "might" because I can guarantee that this method isn't used by any CPU, and I'll explain why.
In normal life, you use decimal numbers and you have learned how to add them: To add two numbers, you add the lowest two digits. If the result is less than 10, you write down the result and proceed to the next digit position. If the result is 10 or more, you write down the result minus 10, proceed to the next digit, buy you remember to add 1 more. For example: 23 + 37, you add 3+7 = 10, you write down 0 and remember to add 1 more for the next position. At the 10s position, you add (2+3) + 1 = 6 and write that down. Result is 60.
You can do the exact same thing with binary numbers. The difference is that the only digits are 0 and 1, so the only possible sums are 0, 1, 2. For a 32 bit number, you would handle one digit position after the other. And that is how really primitive computer hardware would do it.
This code works differently. You know the sum of two binary digits is 2 if both digits are 1. So if both digits are 1 then you would add 1 more at the next binary position and write down 0. That's what the calculation of t does: It finds all places where both binary digits are 1 (that's the &) and moves them to the next digit position (<< 1). Then it does the addition: 0+0 = 0, 0+1 = 1, 1+0 = 1, 1+1 is 2, but we write down 0. That's what the excludive or operator does.
But all the 1's that you had to handle in the next digit position haven't been handled. They still need to be added. That's why the code does a loop: In the next iteration, all the extra 1's are added.
Why does no processor do it that way? Because it's a loop, and processors don't like loops, and it is slow. It's slow, because in the worst case, 32 iterations are needed: If you add 1 to the number 0xffffffff (32 1-bits), then the first iteration clears bit 0 of y and sets x to 2. The second iteration clears bit 1 of y and sets x to 4. And so on. It takes 32 iterations to get the result. However, each iteration has to process all bits of x and y, which takes a lot of hardware.
A primitive processor would do things just as quick in the way you do decimal arithmetic, from the lowest position to the highest. It also takes 32 steps, but each step processes only two bits plus one value from the previous bit position, so it is much easier to implement. And even in a primitive computer, one can afford to do this without having to implement loops.
A modern, fast and complex CPU will use a "conditional sum adder". Especially if the number of bits is high, for example a 64 bit adder, it saves a lot of time.
A 64 bit adder consists of two parts: First, a 32 bit adder for the lowest 32 bit. That 32 bit adder produces a sum, and a "carry" (an indicator that a 1 must be added to the next bit position). Second, two 32 bit adders for the higher 32 bits: One adds x + y, the other adds x + y + 1. All three adders work in parallel. Then when the first adder has produced its carry, the CPU just picks which one of the two results x + y or x + y + 1 is the correct one, and you have the complete result. So a 64 bit adder only takes a tiny bit longer than a 32 bit adder, not twice as long.
The 32 bit adder parts are again implemented as conditional sum adders, using multiple 16 bit adders, and the 16 bit adders are conditional sum adders, and so on.

My question is: Is the + operator implemented as the code posted on MOST implementations?
Let's answer the actual question. All operators are implemented by the compiler as some internal data structure that eventually gets translated into code after some transformations. You can't say what code will be generated by a single addition because almost no real world compiler generates code for individual statements.
The compiler is free to generate any code as long as it behaves as if the actual operations were performed according to the standard. But what actually happens can be something completely different.
A simple example:
static int
foo(int a, int b)
{
return a + b;
}
[...]
int a = foo(1, 17);
int b = foo(x, x);
some_other_function(a, b);
There's no need to generate any addition instructions here. It's perfectly legal for the compiler to translate this into:
some_other_function(18, x * 2);
Or maybe the compiler notices that you call the function foo a few times in a row and that it is a simple arithmetic and it will generate vector instructions for it. Or that the result of the addition is used for array indexing later and the lea instruction will be used.
You simply can't talk about how an operator is implemented because it is almost never used alone.

In case a breakdown of the code helps anyone else, take the example x=2, y=6:
x isn't zero, so commence adding to y:
while(2) {
x & y = 2 because
x: 0 0 1 0 //2
y: 0 1 1 0 //6
x&y: 0 0 1 0 //2
2 <<1 = 4 because << 1 shifts all bits to the left:
x&y: 0 0 1 0 //2
(x&y) <<1: 0 1 0 0 //4
In summary, stash that result, 4, in t with
int t = (x & y) <<1;
Now apply the bitwise XOR y^=x:
x: 0 0 1 0 //2
y: 0 1 1 0 //6
y^=x: 0 1 0 0 //4
So x=2, y=4. Finally, sum t+y by resetting x=t and going back to the beginning of the while loop:
x = t;
When t=0 (or, at the beginning of the loop, x=0), finish with
return y;

Just out of interest, on the Atmega328P processor, with the avr-g++ compiler, the following code implements adding one by subtracting -1 :
volatile char x;
int main ()
{
x = x + 1;
}
Generated code:
00000090 <main>:
volatile char x;
int main ()
{
x = x + 1;
90: 80 91 00 01 lds r24, 0x0100
94: 8f 5f subi r24, 0xFF ; 255
96: 80 93 00 01 sts 0x0100, r24
}
9a: 80 e0 ldi r24, 0x00 ; 0
9c: 90 e0 ldi r25, 0x00 ; 0
9e: 08 95 ret
Notice in particular that the add is done by the subi instruction (subtract constant from register) where 0xFF is effectively -1 in this case.
Also of interest is that this particular processor does not have a addi instruction, which implies that the designers thought that doing a subtract of the complement would be adequately handled by the compiler-writers.
Does this take advantage of two's complement or other implementation-dependent features?
It would probably be fair to say that compiler-writers would attempt to implement the wanted effect (adding one number to another) in the most efficient way possible for that particularly architecture. If that requires subtracting the complement, so be it.

Fastest Integer Square Root in the least amount of instructions

I am in a need of fast integer square root that does not involve any explicit division. The target RISC architecture can do operations like add, mul, sub, shift in one cycle (well - the operation's result is written in third cycle, really - but there's interleaving), so any Integer algorithm that uses these ops and is fast would be very appreciated.
This is what I have right now and I'm thinking that a binary search should be faster, since the following loop executes 16 times every single time (regardless of the value). I haven't debugged it extensively yet (but soon), so perhaps it's possible to have an early exit there:
unsigned short int int_sqrt32(unsigned int x)
{
unsigned short int res=0;
unsigned short int add= 0x8000;
int i;
for(i=0;i<16;i++)
{
unsigned short int temp=res | add;
unsigned int g2=temp*temp;
if (x>=g2)
{
res=temp;
}
add>>=1;
}
return res;
}
Looks like the current performance cost of the above [in the context of the target RISC] is a loop of 5 instructions (bitset, mul, compare, store, shift). Probably no space in cache to unroll fully (but this will be the prime candidate for a partial unroll [e.g. A loop of 4 rather than 16], for sure). So, the cost is 16*5 = 80 instructions (plus loop overhead, if not unrolled). Which, if fully interleaved, would cost only 80 (+2 for last instruction) cycles.
Can I get some other sqrt implementation (using only add, mul, bitshift, store/cmp) under 82 cycles?
FAQ:
Why don't you rely on the compiler to produce a good fast code?
There is no working C → RISC compiler for the platform. I will be porting the current reference C code into hand-written RISC ASM.
Did you profile the code to see if sqrt is actually a bottleneck?
No, there is no need for that. The target RISC chip is about twenty MHz, so every single instruction counts. The core loop (calculating the energy transfer form factor between the shooter and receiver patch), where this sqrt is used, will be run ~1,000 times each rendering frame (assuming it will be fast enough, of course), up to 60,000 per second, and roughly 1,000,000 times for whole demo.
Have you tried to optimize the algorithm to perhaps remove the sqrt?
Yes, I did that already. In fact, I got rid of 2 sqrts already and lots of divisions (removed or replaced by shifting). I can see a huge performance boost (compared to the reference float version) even on my gigahertz notebook.
What is the application?
It's a real-time progressive-refinement radiosity renderer for the compo demo. The idea is to have one shooting cycle each frame, so it would visibly converge and look better with each rendered frame (e.g. Up 60-times per second, though the SW rasterizer won't probably be that fast [but at least it can run on the other chip in parallel with the RISC - so if it takes 2-3 frames to render the scene, the RISC will have worked through 2-3 frames of radiosity data, in parallel]).
Why don't you work directly in target ASM?
Because radiosity is a slightly involved algorithm and I need the instant edit & continue debugging capability of Visual Studio. What I've done over the weekend in VS (couple hundred code changes to convert the floating-point math to integer-only) would take me 6 months on the target platform with only printing debugging".
Why can't you use a division?
Because it's 16-times slower on the target RISC than any of the following: mul, add, sub, shift, compare, load/store (which take just 1 cycle). So, it's used only when absolutely required (a couple times already, unfortunately, when shifting could not be used).
Can you use look-up tables?
The engine needs other LUTs already and copying from main RAM to RISC's little cache is prohibitively expensive (and definitely not each and every frame). But, I could perhaps spare 128-256 Bytes if it gave me at least a 100-200% boost for sqrt.
What's the range of the values for sqrt?
I managed to reduce it to mere unsigned 32-bit int (4,294,967,295)
EDIT1: I have ported two versions into the target RISC ASM, so I now have an exact count of ASM instructions during the execution (for the test scene).
Number of sqrt calls: 2,800.
Method1: The same method in this post (loop executing 16 times)
Method2: fred_sqrt (3c from http://www.azillionmonkeys.com/qed/sqroot.html)
Method1: 152.98 instructions per sqrt
Method2: 39.48 instructions per sqrt (with Final Rounding and 2 Newton iterations)
Method2: 21.01 instructions per sqrt (without Final Rounding and 2 Newton iterations)
Method2 uses LUT with 256 values, but since the target RISC can only use 32-bit access within its 4 KB cache, it actually takes 256*4 = 1 KB. But given its performance, I guess I will have to spare that 1 KB (out of 4).
Also, I have found out that there is NO visible visual artifact when I disable the Final rounding and two Newton iterations at the end (of Method2).
Meaning, the precision of that LUT is apparently good enough. Who knew...
The final cost is then 21.01 instructions per sqrt, which is almost ~order of magnitude faster than the very first solution. There's also possibility of reducing it further by sacrificing few of the 32 available registers for the constants used for the conditions and jump labels (each condition must fill 2 registers - one with the actual constant (only values less than 16 are allowed within CMPQ instruction, larger ones must be put into register) we are comparing against and second for the jump to the else label (the then jump is fall-through), as the direct relative jump is only possible within ~10 instructions (impossible with such large if-then-else chain, other than innermost 2 conditions).
EDIT2: ASM micro-optimizations
While benchmarking, I added counters for each of the 26 If.Then.Else codeblocks, to see if there aren't any blocks executed most often.
Turns out, that Blocks 0/10/11 are executed in 99.57%/99.57%/92.57% of cases. This means I can justify sacrificing 3 registers (out of 32) for those comparison constants (in those 3 blocks), e.g. r26 = $1.0000 r25 = $100.0000 r24 = $10.0000
This brought down the total instruction cost from 58,812 (avg:21.01) to 50,448 (avg:18.01)
So, now the average sqrt cost is just 18.01 ASM instructions (and there is no division!), though it will have to be inlined.
EDIT3: ASM micro-optimizations
Since we know that those 3 blocks (0/10/11) are executed most often, we can use local short jumps (16 Bytes in both directions, which is usually just couple of instructions (hence mostly useless), especially when the 6-byte MOVEI #jump_label, register is used during conditions) in those particular conditions. Of course, the Else condition will then incur additional 2 ops (that it would not have otherwise), but that's worth it. The block 10 will have to be swapped (Then block with Else block), which will make it harder to read and debug, but I documented the reasons profusely.
Now the total instruction cost (in test scene) is just 42,500 with an average of 15.18 ASM instructions per sqrt.
EDIT4: ASM micro-optimizations
Block 11 condition splits into innermost Blocks 12&13. It just so happens that those blocks don't need additional +1 math op, hence the local short jump can actually reach the Else block, if I merge bitshift right with the necessary bitshift left #2 (as all offsets within cache must be 32-bit). This saves on filling the jump register though I do need to sacrifice one more register r23 for the comparison value of $40.000.
The final cost is then 34,724 instructions with an average of 12.40 ASM instructions per sqrt.
I am also realizing that I could reshuffle the order of conditions (which will make the other range few ops more expensive, but that's happening for ~7% of cases only), favoring this particular range ($10.000, $40.000) first, saving on at least 1 or maybe even 2 conditions.
In which case, it should fall down to ~8.40 per sqrt.
I am realizing that the range depends directly on intensity of the light and the distance to the wall. Meaning, I have direct control over the RGB value of the light and distance from the wall. And while I would like the solution to be as generic as possible, given this realization (~12 ops per sqrt is mind-blowing), I will gladly sacrifice some flexibility in light colors if I can get sqrt this fast. Besides, there's maybe 10-15 different lights in whole demo, so I can simply find color combinations that eventually result in same sqrt range, but will get insanely fast sqrt. For sure, that's worth it to. And I still have a generic fallback (spanning entire int range) working just fine. Best of both worlds, really.

Have a look here.
For instance, at 3(a) there is this method, which is trivially adaptable to do a 64->32 bit square root, and also trivially transcribable to assembler:
/* by Jim Ulery */
static unsigned julery_isqrt(unsigned long val) {
unsigned long temp, g=0, b = 0x8000, bshft = 15;
do {
if (val >= (temp = (((g << 1) + b)<<bshft--))) {
g += b;
val -= temp;
}
} while (b >>= 1);
return g;
}
No divisions, no multiplications, bit shifts only. However, the time taken will be somewhat unpredictable particularly if you use a branch (on ARM RISC conditional instructions would work).
In general, this page lists ways to calculate square roots. If you happen to want to produce a fast inverse square root (i.e. x**(-0.5) ), or are just interested in amazing ways to optimise code, take a look at this, this and this.

This is the same as yours, but with fewer ops. (I count 9 ops in the loop in your code, including test and increment i in the for loop and 3 assignments, but perhaps some of those disappear when coded in ASM? There are 6 ops in the code below, if you count g*g>n as two (no assignment)).
int isqrt(int n) {
int g = 0x8000;
int c = 0x8000;
for (;;) {
if (g*g > n) {
g ^= c;
}
c >>= 1;
if (c == 0) {
return g;
}
g |= c;
}
}
I got it here. You can maybe eliminate a comparison if you unroll the loop and jump to the appropriate spot based on the highest non-zero bit in the input.
Update
I've been thinking more about using Newton's method. In theory, the number of bits of accuracy should double for each iteration. That means Newton's method is much worse than any of the other suggestions when there are few correct bits in the answer; however, the situation changes where there are a lot of correct bits in the answer. Considering that most suggestions seem to take 4 cycles per bit, that means that one iteration of Newton's method (16 cycles for division + 1 for addition + 1 for shift = 18 cycles) is not worthwhile unless it gives over 4 bits.
So, my suggestion is to build up 8 bits of the answer by one of the suggested methods (8*4 = 32 cycles) then perform one iteration of Newton's method (18 cycles) to double the number of bits to 16. That's a total of 50 cycles (plus maybe an extra 4 cycles to get 9 bits before applying Newton's method, plus maybe 2 cycles to overcome the overshoot occasionally experienced by Newton's method). That's a maximum of 56 cycles which as far as I can see rivals any of the other suggestions.
Second Update
I coded the hybrid algorithm idea. Newton's method itself has no overhead; you just apply and double the number of significant digits. The issue is to have a predictable number of significant digits before you apply Newton's method. For that, we need to figure out where the most significant bit of the answer will appear. Using a modification of the fast DeBruijn sequence method given by another poster, we can perform that calculation in about 12 cycles in my estimation. On the other hand, knowing the position of the msb of the answer speeds up all methods (average, not worst case), so it seems worthwhile no matter what.
After calculating the msb of the answer, I run a number of rounds of the algorithm suggested above, then finish it off with one or two rounds of Newton's method. We decide when to run Newton's method by the following calculation: one bit of the answer takes about 8 cycles according to calculation in the comments; one round of Newton's method takes about 18 cycles (division, addition, and shift, and maybe assignment), so we should only run Newton's method if we're going to get at least three bits out of it. So for 6 bit answers, we can run the linear method 3 times to get 3 significant bits, then run Newton's method 1 time to get another 3. For 15 bit answers, we run the linear method 4 times to get 4 bits, then Newton's method twice to get another 4 then another 7. And so on.
Those calculations depend on knowing exactly how many cycles are required to get a bit by the linear method vs. how many are required by Newton's method. If the "economics" change, e.g., by discovering a faster way to build up bits in a linear fashion, the decision of when to invoke Newton's method will change.
I unrolled the loops and implemented the decisions as switches, which I hope will translate into fast table lookups in assembly. I'm not absolutely sure that I've got the minimum number of cycles in each case, so maybe further tuning is possible. E.g., for s=10, you can try to get 5 bits then apply Newton's method once instead of twice.
I've tested the algorithm thoroughly for correctness. Some additional minor speedups are possible if you're willing to accept slightly incorrect answers in some cases. At least two cycles are used after applying Newton's method to correct an off-by-one error that occurs with numbers of the form m^2-1. And a cycle is used testing for input 0 at the beginning, as the algorithm can't handle that input. If you know you're never going to take the square root of zero you can eliminate that test. Finally, if you only need 8 significant bits in the answer, you can drop one of the Newton's method calculations.
#include <inttypes.h>
#include <stdint.h>
#include <stdbool.h>
#include <stdio.h>
uint32_t isqrt1(uint32_t n);
int main() {
uint32_t n;
bool it_works = true;
for (n = 0; n < UINT32_MAX; ++n) {
uint32_t sr = isqrt1(n);
if ( sr*sr > n || ( sr < 65535 && (sr+1)*(sr+1) <= n )) {
it_works = false;
printf("isqrt(%" PRIu32 ") = %" PRIu32 "\n", n, sr);
}
}
if (it_works) {
printf("it works\n");
}
return 0;
}
/* table modified to return shift s to move 1 to msb of square root of x */
/*
static const uint8_t debruijn32[32] = {
0, 31, 9, 30, 3, 8, 13, 29, 2, 5, 7, 21, 12, 24, 28, 19,
1, 10, 4, 14, 6, 22, 25, 20, 11, 15, 23, 26, 16, 27, 17, 18
};
*/
static const uint8_t debruijn32[32] = {
15, 0, 11, 0, 14, 11, 9, 1, 14, 13, 12, 5, 9, 3, 1, 6,
15, 10, 13, 8, 12, 4, 3, 5, 10, 8, 4, 2, 7, 2, 7, 6
};
/* based on CLZ emulation for non-zero arguments, from
* http://stackoverflow.com/questions/23856596/counting-leading-zeros-in-a-32-bit-unsigned-integer-with-best-algorithm-in-c-pro
*/
uint8_t shift_for_msb_of_sqrt(uint32_t x) {
x |= x >> 1;
x |= x >> 2;
x |= x >> 4;
x |= x >> 8;
x |= x >> 16;
x++;
return debruijn32 [x * 0x076be629 >> 27];
}
uint32_t isqrt1(uint32_t n) {
if (n==0) return 0;
uint32_t s = shift_for_msb_of_sqrt(n);
uint32_t c = 1 << s;
uint32_t g = c;
switch (s) {
case 9:
case 5:
if (g*g > n) {
g ^= c;
}
c >>= 1;
g |= c;
case 15:
case 14:
case 13:
case 8:
case 7:
case 4:
if (g*g > n) {
g ^= c;
}
c >>= 1;
g |= c;
case 12:
case 11:
case 10:
case 6:
case 3:
if (g*g > n) {
g ^= c;
}
c >>= 1;
g |= c;
case 2:
if (g*g > n) {
g ^= c;
}
c >>= 1;
g |= c;
case 1:
if (g*g > n) {
g ^= c;
}
c >>= 1;
g |= c;
case 0:
if (g*g > n) {
g ^= c;
}
}
/* now apply one or two rounds of Newton's method */
switch (s) {
case 15:
case 14:
case 13:
case 12:
case 11:
case 10:
g = (g + n/g) >> 1;
case 9:
case 8:
case 7:
case 6:
g = (g + n/g) >> 1;
}
/* correct potential error at m^2-1 for Newton's method */
return (g==65536 || g*g>n) ? g-1 : g;
}
In light testing on my machine (which admittedly is nothing like yours), the new isqrt1 routine runs about 40% faster on average than the previous isqrt routine I gave.

If multiplication is the same speed (or faster than!) addition and shifting, or if you lack a fast shift-by-amount-contained-in-a-register instruction, then the following will not be helpful. Otherwise:
You're computing temp*temp afresh on each loop cycle, but temp = res | add, which is the same as res + add since their bits don't overlap, and (a) you have already computed res*res on a previous loop cycle, and (b) add has special structure (it's always just a single bit). So, using the fact that (a+b)^2 = a^2 + 2ab + b^2, and you already have a^2, and b^2 is just a single bit shifted twice as far to the left as the single-bit b, and 2ab is just a left-shifted by 1 more position than the location of the single bit in b, you can get rid of the multiplication:
unsigned short int int_sqrt32(unsigned int x)
{
unsigned short int res = 0;
unsigned int res2 = 0;
unsigned short int add = 0x8000;
unsigned int add2 = 0x80000000;
int i;
for(i = 0; i < 16; i++)
{
unsigned int g2 = res2 + (res << i) + add2;
if (x >= g2)
{
res |= add;
res2 = g2;
}
add >>= 1;
add2 >>= 2;
}
return res;
}
Also I would guess that it's a better idea to use the same type (unsigned int) for all variables, since according to the C standard, all arithmetic requires promotion (conversion) of narrower integer types to the widest type involved before the arithmetic operation is performed, followed by subsequent back-conversion if necessary. (This may of course be optimised away by a sufficiently intelligent compiler, but why take the risk?)

From the comment trail, it seems that the RISC processor only provides 32x32->32 bit multiplication and 16x16->32 bit multiplication. A 32x-32->64 bit widening multiply, or a MULHI instruction returning the upper 32 bits of a 64-bit product is not provided.
This would seem to exclude approaches based on Newton-Raphson iteration, which would likely be inefficient, as they typically require either MULHI instruction or widening multiply for the intermediate fixed-point arithmetic.
The C99 code below uses a different iterative approach that requires only 16x16->32 bit multiplies, but converges somewhat linearly, requiring up to six iterations. This approach requires CLZ functionality to quickly determine a starting guess for the iterations. Asker stated in the comments that the RISC processor used does not provide CLZ functionality. So emulation of CLZ is required, and since the emulation adds to both storage and instruction count, this may make this approach uncompetitive. I performed a brute-force search to determine the deBruijn lookup table with the smallest multiplier.
This iterative algorithm delivers raw results quite close to the desired results, i.e. (int)sqrt(x), but always somewhat on the high side due to the truncating nature of integer arithmetic. To arrive at the final result, the result is conditionally decremented until the square of the result is less than or equal to the original argument.
The use of the volatile qualifier in the code only serves to establish that all named variables can in fact be allocated as 16-bit data without impacting the functionality. I do not know whether this provides any advantage, but noticed that the OP specifically used 16-bit variables in their code. For production code, volatile should be removed.
Note that for most processors, the correction steps at the end should not involve any branching. The product y*y can be subtracted from x with carry-out (or borrow-out), then y is corrected by a subtract with carry-in (or borrow-in). So each step should be a sequence MUL, SUBcc, SUBC.
Because implementation of the iteration by a loop incurs substantial overhead, I have elected to completely unroll the loop, but provide two early-out checks. Tallying the operations manually I count 46 operations for the fastest case, 54 operations for the average case, and 60 operations for the worst case.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>
static const uint8_t clz_tab[32] = {
31, 22, 30, 21, 18, 10, 29, 2, 20, 17, 15, 13, 9, 6, 28, 1,
23, 19, 11, 3, 16, 14, 7, 24, 12, 4, 8, 25, 5, 26, 27, 0};
uint8_t clz (uint32_t a)
{
a |= a >> 16;
a |= a >> 8;
a |= a >> 4;
a |= a >> 2;
a |= a >> 1;
return clz_tab [0x07c4acdd * a >> 27];
}
/* 16 x 16 -> 32 bit unsigned multiplication; should be single instruction */
uint32_t umul16w (uint16_t a, uint16_t b)
{
return (uint32_t)a * b;
}
/* Reza Hashemian, "Square Rooting Algorithms for Integer and Floating-Point
Numbers", IEEE Transactions on Computers, Vol. 39, No. 8, Aug. 1990, p. 1025
*/
uint16_t isqrt (uint32_t x)
{
volatile uint16_t y, z, lsb, mpo, mmo, lz, t;
if (x == 0) return x; // early out, code below can't handle zero
lz = clz (x); // #leading zeros, 32-lz = #bits of argument
lsb = lz & 1;
mpo = 17 - (lz >> 1); // m+1, result has roughly half the #bits of argument
mmo = mpo - 2; // m-1
t = 1 << mmo; // power of two for two's complement of initial guess
y = t - (x >> (mpo - lsb)); // initial guess for sqrt
t = t + t; // power of two for two's complement of result
z = y;
y = (umul16w (y, y) >> mpo) + z;
y = (umul16w (y, y) >> mpo) + z;
if (x >= 0x40400) {
y = (umul16w (y, y) >> mpo) + z;
y = (umul16w (y, y) >> mpo) + z;
if (x >= 0x1002000) {
y = (umul16w (y, y) >> mpo) + z;
y = (umul16w (y, y) >> mpo) + z;
}
}
y = t - y; // raw result is 2's complement of iterated solution
y = y - umul16w (lsb, (umul16w (y, 19195) >> 16)); // mult. by sqrt(0.5)
if ((int32_t)(x - umul16w (y, y)) < 0) y--; // iteration may overestimate
if ((int32_t)(x - umul16w (y, y)) < 0) y--; // result, adjust downward if
if ((int32_t)(x - umul16w (y, y)) < 0) y--; // necessary
return y; // (int)sqrt(x)
}
int main (void)
{
uint32_t x = 0;
uint16_t res, ref;
do {
ref = (uint16_t)sqrt((double)x);
res = isqrt (x);
if (res != ref) {
printf ("!!!! x=%08x res=%08x ref=%08x\n", x, res, ref);
return EXIT_FAILURE;
}
x++;
} while (x);
return EXIT_SUCCESS;
}
Another possibility is to use the Newton iteration for the square root, despite the high cost of division. For small inputs only one iteration will be required. Although the asker did not state this, based on the execution time of 16 cycles for the DIV operation I strongly suspect that this is actually a 32/16->16 bit division which requires additional guard code to avoid overflow, defined as a quotient that does not fit into 16 bits. I have added appropriate safeguards to my code based on this assumption.
Since the Newton iteration doubles the number of good bits each time it is applied, we only need a low-precision initial guess which can easily be retrieved from a table based on the five leading bits of the argument. In order to grab these, we first normalize the argument into 2.30 fixed-point format with an additional implicit scale factor of 232-(lz & ~1) where lz are the number of leading zeros in the argument. As in the previous approach the iteration doesn't always deliver an accurate result, so a correction must be applied should the preliminary result be too big. I count 49 cycles for the fast path, 70 cycles for the slow path (average 60 cycles).
static const uint16_t sqrt_tab[32] =
{ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
0x85ff, 0x8cff, 0x94ff, 0x9aff, 0xa1ff, 0xa7ff, 0xadff, 0xb3ff,
0xb9ff, 0xbeff, 0xc4ff, 0xc9ff, 0xceff, 0xd3ff, 0xd8ff, 0xdcff,
0xe1ff, 0xe6ff, 0xeaff, 0xeeff, 0xf3ff, 0xf7ff, 0xfbff, 0xffff
};
/* 32/16->16 bit division. Note: Will overflow if x[31:16] >= y */
uint16_t udiv_32_16 (uint32_t x, uint16_t y)
{
uint16_t r = x / y;
return r;
}
/* table lookup for initial guess followed by division-based Newton iteration*/
uint16_t isqrt (uint32_t x)
{
volatile uint16_t q, lz, y, i, xh;
if (x == 0) return x; // early out, code below can't handle zero
// initial guess based on leading 5 bits of argument normalized to 2.30
lz = clz (x);
i = ((x << (lz & ~1)) >> 27);
y = sqrt_tab[i] >> (lz >> 1);
xh = (x >> 16); // needed for overflow check on division
// first Newton iteration, guard against overflow in division
q = 0xffff;
if (xh < y) q = udiv_32_16 (x, y);
y = (q + y) >> 1;
if (lz < 10) {
// second Newton iteration, guard against overflow in division
q = 0xffff;
if (xh < y) q = udiv_32_16 (x, y);
y = (q + y) >> 1;
}
if (umul16w (y, y) > x) y--; // adjust quotient if too large
return y; // (int)sqrt(x)
}

I don't know how to turn it into an efficient algorithm but when I investigated this in the '80s an interesting pattern emerged. When rounding square roots, there are two more integers with that square root than the preceding one (after zero).
So, one number (zero) has a square root of zero, two have a square root of 1 (1 and 2), 4 have a square root of two (3, 4, 5 and 6) and so on. Probably not a useful answer but interesting nonetheless.

Here is a less incremental version of the technique #j_random_hacker described. On at least one processor it was just a bit faster when I fiddled with this a couple of years ago. I have no idea why.
// assumes unsigned is 32 bits
unsigned isqrt1(unsigned x) {
unsigned r = 0, r2 = 0;
for (int p = 15; p >= 0; --p) {
unsigned tr2 = r2 + (r << (p + 1)) + (1u << (p + p));
if (tr2 <= x) {
r2 = tr2;
r |= (1u << p);
}
}
return r;
}
/*
gcc 6.3 -O2
isqrt(unsigned int):
mov esi, 15
xor r9d, r9d
xor eax, eax
mov r8d, 1
.L3:
lea ecx, [rsi+1]
mov edx, eax
mov r10d, r8d
sal edx, cl
lea ecx, [rsi+rsi]
sal r10d, cl
add edx, r10d
add edx, r9d
cmp edx, edi
ja .L2
mov r11d, r8d
mov ecx, esi
mov r9d, edx
sal r11d, cl
or eax, r11d
.L2:
sub esi, 1
cmp esi, -1
jne .L3
rep ret
*/
If you turn up gcc 9 x86 optimization, it completely unrolls the loop and folds constants. The result is still only about 100 instructions.

How to properly add/subtract a 128-bit number (as two uint64_t)?

I'm working in C and need to add and subtract a 64-bit number and a 128-bit number. The result will be held in the 128-bit number. I am using an integer array to store the upper and lower halves of the 128-bit number (i.e. uint64_t bigNum[2], where bigNum[0] is the least significant).
Can anybody help with an addition and subtraction function that can take in bigNum and add/subtract a uint64_t to it?
I have seen many incorrect examples on the web, so consider this:
bigNum[0] = 0;
bigNum[1] = 1;
subtract(&bigNum, 1);
At this point bigNum[0] should have all bits set, while bigNum[1] should have no bits set.

In many architectures it's very easy to add/subtract any arbitrarily-long integers because there's a carry flag and add/sub-with-flag instruction. For example on x86 rdx:rax += r8:r9 can be done like this
add rax, r9 # add the low parts and store the carry
adc rdx, r8 # add the high parts with carry
In C there's no way to access this carry flag so you must calculate the flag on your own. The easiest way is to check if the unsigned sum is less than either of the operand like this. For example to do a += b we'll do
aL += bL;
aH += bH + (aL < bL);
This is exactly how multi-word add is done in architectures that don't have a flag register. For example in MIPS it's done like this
# alow = blow + clow
addu alow, blow, clow
# set tmp = 1 if alow < clow, else 0
sltu tmp, alow, clow
addu ahigh, bhigh, chigh
addu ahigh, ahigh, tmp
Here's some example assembly output

This should work for the subtraction:
typedef u_int64_t bigNum[2];
void subtract(bigNum *a, u_int64_t b)
{
const u_int64_t borrow = b > a[1];
a[1] -= b;
a[0] -= borrow;
}
Addition is very similar. The above could of course be expressed with an explicit test, too, but I find it cleaner to always do the borrowing. Optimization left as an exercise.
For a bigNum equal to { 0, 1 }, subtracting two would make it equal { ~0UL, ~0UL }, which is the proper bit pattern to represent -1. Here, UL is assumed to promote an integer to 64 bits, which is compiler-dependent of course.

In grade 1 or 2, you should have learn't how to break down the addition of 1 and 10 into parts, by splitting it into multiple separate additions of tens and units. When dealing with big numbers, the same principals can be applied to compute arithmetic operations on arbitrarily large numbers, by realizing your units are now units of 2^bits, your "tens" are 2^bits larger and so on.

For the case the value that your are subtracting is less or equal to bignum[0] you don't have to touch bignum[1].
If it isn't, you subtract it from bignum[0], anyhow. This operation will wrap around, but this is the behavior you need here. In addition you'd then have to substact 1 from bignum[1].

Most compilers support a __int128 type intrinsically.
Try it and you might be lucky.

Most optimized way to calculate modulus in C

I have minimize cost of calculating modulus in C.
say I have a number x and n is the number which will divide x
when n == 65536 (which happens to be 2^16):
mod = x % n (11 assembly instructions as produced by GCC)
or
mod = x & 0xffff which is equal to mod = x & 65535 (4 assembly instructions)
so, GCC doesn't optimize it to this extent.
In my case n is not x^(int) but is largest prime less than 2^16 which is 65521
as I showed for n == 2^16, bit-wise operations can optimize the computation. What bit-wise operations can I preform when n == 65521 to calculate modulus.

First, make sure you're looking at optimized code before drawing conclusion about what GCC is producing (and make sure this particular expression really needs to be optimized). Finally - don't count instructions to draw your conclusions; it may be that an 11 instruction sequence might be expected to perform better than a shorter sequence that includes a div instruction.
Also, you can't conclude that because x mod 65536 can be calculated with a simple bit mask that any mod operation can be implemented that way. Consider how easy dividing by 10 in decimal is as opposed to dividing by an arbitrary number.
With all that out of the way, you may be able to use some of the 'magic number' techniques from Henry Warren's Hacker's Delight book:
Archive of http://www.hackersdelight.org/
Archive of http://www.hackersdelight.org/magic.htm
There was an added chapter on the website that contained "two methods of computing the remainder of division without computing the quotient!", which you may find of some use. The 1st technique applies only to a limited set of divisors, so it won't work for your particular instance. I haven't actually read the online chapter, so I don't know exactly how applicable the other technique might be for you.

x mod 65536 is only equivalent to x & 0xffff if x is unsigned - for signed x, it gives the wrong result for negative numbers. For unsigned x, gcc does indeed optimise x % 65536 to a bitwise and with 65535 (even on -O0, in my tests).
Because 65521 is not a power of 2, x mod 65521 can't be calculated so simply. gcc 4.3.2 on -O3 calculates it using x - (x / 65521) * 65521; the integer division by a constant is done using integer multiplication by a related constant.

rIf you don't have to fully reduce your integers modulo 65521, then you can use the fact that 65521 is close to 2**16. I.e. if x is an unsigned int you want to reduce then you can do the following:
unsigned int low = x &0xffff;
unsigned int hi = (x >> 16);
x = low + 15 * hi;
This uses that 2**16 % 65521 == 15. Note that this is not a full reduction. I.e. starting with a 32-bit input, you only are guaranteed that the result is at most 20 bits and that it is of course congruent to the input modulo 65521.
This trick can be used in applications where there are many operations that have to be reduced modulo the same constant, and where intermediary results do not have to be the smallest element in its residue class.
E.g. one application is the implementation of Adler-32, which uses the modulus 65521. This hash function does a lot of operations modulo 65521. To implement it efficiently one would only do modular reductions after a carefully computed number of additions. A reduction shown as above is enough and only the computation of the hash will need a full modulo operation.

The bitwise operation only works well if the divisor is of the form 2^n. In the general case, there is no such bit-wise operation.

If the constant with which you want to take the modulo is known at compile time
and you have a decent compiler (e.g. gcc), tis usually best to let the compiler
work its magic. Just declare the modulo const.
If you don't know the constant at compile time, but you are going to take - say -
a billion modulos with the same number, then use this http://libdivide.com/

As an approach when we deal with powers of 2, can be considered this one (mostly C flavored):
.
.
#define THE_DIVISOR 0x8U; /* The modulo value (POWER OF 2). */
.
.
uint8 CheckIfModulo(const sint32 TheDividend)
{
uint8 RetVal = 1; /* TheDividend is not modulus THE_DIVISOR. */
if (0 == (TheDividend & (THE_DIVISOR - 1)))
{
/* code if modulo is satisfied */
RetVal = 0; /* TheDividend IS modulus THE_DIVISOR. */
}
else
{
/* code if modulo is NOT satisfied */
}
return RetVal;
}

If x is an increasing index, and the increment i is known to be less than n (e.g. when iterating over a circular array of length n), avoid the modulus completely.
A loop going
x += i; if (x >= n) x -= n;
is way faster than
x = (x + i) % n;
which you unfortunately find in many text books...
If you really need an expression (e.g. because you are using it in a for statement), you can use the ugly but efficient
x = x + (x+i < n ? i : i-n)

idiv — Integer Division
The idiv instruction divides the contents of the 64 bit integer EDX:EAX (constructed by viewing EDX as the most significant four bytes and EAX as the least significant four bytes) by the specified operand value. The quotient result of the division is stored into EAX, while the remainder is placed in EDX.
source: http://www.cs.virginia.edu/~evans/cs216/guides/x86.html

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight