How can I make such bitwise operations faster?

How can I make such bitwise operations faster? - c

The code is very simple.
int foo(int a, int b, int c, int d, int e, int f, int g)
{
int r = (1 << a) | (1 << b) | (1 << c) | (1 << d) | (1 << e ) | (1 << f) | (1 << g);
return r;
}
Assume all the arguments are no greater than 30.
It seems to be a very primitive function, but after compiling with the "-Ofast" flag, it still takes 28 instructions to compute r.
Is there an alternative code that can make these bitwise operations faster?

28 instructions is rather fast.
Consider what you're doing here. You have:
7 shift operations
6 OR operations
1 memory assignment operation
That already requires at least 14 instructions. Now there are additional instructions that are necessary such as storing the intermediate results and loading operands into registers.
If you want deeper analysis, post the assembly output.
Edit: Now to the possible optimization of your algorithm.
You might be able to gain a bit more speed by sacrificing some memory. Precompute the values for each possible bit being set in 32-bit value, e.g. something like that: int bit2value[32]={1,2,4,8,16,32,64,...}; In your function instead of performing the shift operations you can replace them with looking up into the precomputed map: int r = bit2value[a] | bit2value[b] | bit2value[c]...; This can theoretically save the need for some intermediate storing operations.

For each argument you need:
mov cl, argument
mov edx, 1
shl edx, cl
or eax, edx
I believe your function had 7 arguments? Up to g? 27 (4*7 - 1) is as low as it goes, since you can calculate the result of the shift directly in eax for the first argument. You can't not load each argument into a register, as the instruction doesn't operate on memory directly. You can't shift 1 without setting something to 1 every time. Obviously you can't do without the shift instructions or the or.

Using Visual Studio 2015 (32bit, optimized for space), the following code results in 21 rather than 24 instructions:
typedef struct container
{
int a, b, c, d, e, f, g;
} CONTAINER;
int foo2(CONTAINER *ct)
{
int r = (1 << ct->a) | (1 << ct->b) | (1 << ct->c) | (1 << ct->d) | (1 << ct->e) | (1 << ct->f) | (1 << ct->g);
return r;
}
The assembly code (21 rather than 20 instructions, sorry!):
1 push esi
2 mov edx, ecx
3 xor esi, esi
4 inc esi
5 mov ecx, DWORD PTR [edx+24]
6 mov eax, DWORD PTR [edx+20]
7 shl esi, cl
8 mov ecx, DWORD PTR [edx+8]
9 bts esi, eax
10 mov eax, DWORD PTR [edx+16]
11 bts esi, eax
12 mov eax, DWORD PTR [edx+12]
13 bts esi, eax
14 bts esi, ecx
15 mov ecx, DWORD PTR [edx+4]
16 bts esi, ecx
17 mov ecx, DWORD PTR [edx]
18 bts esi, ecx
19 mov eax, esi
20 pop esi
21 ret 0

Related

Converting a loop from x86 assembly to C language with AND, OR, SHR and SHL instructions and an array

I don't understand what is the problem because the result is right, but there is something wrong in it and i don't get it.
1.This is the x86 code I have to convert to C:
%include "io.inc"
SECTION .data
mask DD 0xffff, 0xff00ff, 0xf0f0f0f, 0x33333333, 0x55555555
SECTION .text
GLOBAL CMAIN
CMAIN:
GET_UDEC 4, EAX
MOV EBX, mask
ADD EBX, 16
MOV ECX, 1
.L:
MOV ESI, DWORD [EBX]
MOV EDI, ESI
NOT EDI
MOV EDX, EAX
AND EAX, ESI
AND EDX, EDI
SHL EAX, CL
SHR EDX, CL
OR EAX, EDX
SHL ECX, 1
SUB EBX, 4
CMP EBX, mask - 4
JNE .L
PRINT_UDEC 4, EAX
NEWLINE
XOR EAX, EAX
RET
2.My converted C code, when I input 0 it output me the right answer but there is something false in my code I don't understand what is:
#include "stdio.h"
int main(void)
{
int mask [5] = {0xffff, 0xff00ff, 0xf0f0f0f, 0x33333333, 0x55555555};
int eax;
int esi;
int ebx;
int edi;
int edx;
char cl = 0;
scanf("%d",&eax);
ebx = mask[4];
ebx = ebx + 16;
int ecx = 1;
L:
esi = ebx;
edi = esi;
edi = !edi;
edx = eax;
eax = eax && esi;
edx = edx && edi;
eax = eax << cl;
edx = edx >> cl ;
eax = eax || edx;
ecx = ecx << 1;
ebx = ebx - 4;
if(ebx == mask[1]) //mask - 4
{
goto L;
}
printf("%d",eax);
return 0;
}

Assembly AND is C bitwise &, not logical &&. (Same for OR). So you want eax &= esi.
(Using &= "compound assignment" makes the C even look like x86-style 2-operand asm so I'd recommend that.)
NOT is also bitwise flip-all-the-bits, not booleanize to 0/1. In C that's edi = ~edi;
Read the manual for x86 instructions like https://www.felixcloutier.com/x86/not, and for C operators like ~ and ! to check that they are / aren't what you want. https://en.cppreference.com/w/c/language/expressions https://en.cppreference.com/w/c/language/operator_arithmetic
You should be single-stepping your C and your asm in a debugger so you notice the first divergence, and know which instruction / C statement to fix. Don't just run the whole thing and look at one number for the result! Debuggers are massively useful for asm; don't waste your time without one.
CL is the low byte of ECX, not a separate C variable. You could use a union between uint32_t and uint8_t in C, or just use eax <<= ecx&31; since you don't have anything that writes CL separately from ECX. (x86 shifts mask their count; that C statement could compile to shl eax, cl. https://www.felixcloutier.com/x86/sal:sar:shl:shr). The low 5 bits of ECX are also the low 5 bits of CL.
SHR is a logical right shift, not arithmetic, so you need to be using unsigned not int at least for the >>. But really just use it for everything.
You're handling EBX completely wrong; it's a pointer.
MOV EBX, mask
ADD EBX, 16
This is like unsigned int *ebx = mask+4;
The size of a dword is 4 bytes, but C pointer math scales by the type size, so +1 is a whole element, not 1 byte. So 16 bytes is 4 dwords = 4 unsigned int elements.
MOV ESI, DWORD [EBX]
That's a load using EBX as an address. This should be easy to see if you single-step the asm in a debugger: It's not just copying the value.
CMP EBX, mask - 4
JNE .L
This is NASM syntax; it's comparing against the address of the dword before the start of the array. It's effectively the bottom of a fairly normal do{}while loop. (Why are loops always compiled into "do...while" style (tail jump)?)
do { // .L
...
} while(ebx != &mask[-1]); // cmp/jne
It's looping from the end of the mask array, stopping when the pointer goes past the end.
Equivalently, the compare could be ebx !-= mask - 1. I wrote it with unary & (address-of) cancelling out the [] to make it clear that it's the address of what would be one element before the array.
Note that it's jumping on not equal; you had your if()goto backwards, jumping only on equality. This is a loop.
unsigned mask[] should be static because it's in section .data, not on the stack. And not const, because again it's in .data not .rodata (Linux) or .rdata (Windows))
This one doesn't affect the logic, only that detail of decompiling.
There may be other bugs; I didn't try to check everything.

if(ebx != mask[1]) //mask - 4
{
goto L;
}
//JNE IMPLIES a !=

NASM x86 core dump writing on memory [duplicate]

This question already has answers here:
How can I pass parameters in assembler x86 function call?
(3 answers)
Can't pass parameter from C to Assembly code
(3 answers)
Why does IA-32 have a non-intuitive caller and callee register saving convention?
(4 answers)
What are the calling conventions for UNIX & Linux system calls (and user-space functions) on i386 and x86-64
(4 answers)
Closed 3 years ago.
I am learning assembly NASM and trying to do a LFSR code and call it on a C program to evaluate the execution time diference, but failed to figure out the problem with my code.
My LFSR C version works just fine and is defined as follows:
int lfsr(){
int cont = 0;
uint32_t start_state = SEED;
uint32_t lfsr = start_state;
uint32_t bit;
while (cont != 16777216) {
bit = ((lfsr >> 1) ^ (lfsr >> 5) ^ (lfsr >> 7) ^ (lfsr >> 13)) & 1;
lfsr = (lfsr >> 1) | (bit << 23);
lfsr_nums[cont] = lfsr;
cont++;
}
return cont;
}
My NASM x86 was based on the C version and it generates the numbers the same way the C code does. It should take a pointer to an array as parameter, and return (as reference) the same array with the numbers created and return (as value) the amount of the numbers. The LFSR logic works just fine, I checked the numbers created, but the code still give me a SegFault Core Dump error.
With gdb the message is that the error is in the do procedure. While I tried to debug my code I found out that the error was in the mov dword [esi + 4 * eax], ebx, if I comment it out the code doesn't output a segfault.
section .text
global lfsr_nasm
lfsr_nasm:
push dword ebx;
mov esi, edi ; vec
mov eax, 0 ;Cont = 0
mov ebx, 0x1313 ; lfst = start_state = seed
do:
mov ecx, ebx ; ecx = lfst
shr ecx, 1 ; lfsr >> 1
mov edx, ebx ; edx = lfst
shr edx, 5; lfst >> 5
xor ecx, edx ; lfst >> 1 ^ lfsr >> 5
mov edx, ebx ; edx = lfsr
shr edx, 7 ; edx = lfst >> 7
xor ecx, edx; lfst >> 1 ^ lfsr >> 5 ^ lfsr >> 7
mov edx, ebx ; edx = lfsr
shr edx, 13 ; edx = lfst >> 13
xor ecx, edx; lfst >> 1 ^ lfsr >> 5 ^ lfsr >> 7 ^ lfsr >> 13
and ecx, 1 ;ecx = bit
shr ebx, 1 ;lfsr >> 1
shl ecx, 23 ; bit << 23
or ebx, ecx ; lfsr = (lfsr >> 1) | (bit << 23);
mov dword [esi + 4 * eax], ebx
inc eax ; cont++
cmp eax, 16777216; cont != 16777216
jne do ;
pop dword ebx;
ret
The way I make the call in C, and declare my vector and NASM function:
extern int lfsr_nasm (uint32_t *vec);
uint32_t lfsr_nums[16777216];
int main(int argc, char *argv[]){
int cont;
cont = lfsr_nasm(lfsr_nums);
for(int i = 0; i < 16777216; i++){
printf("%d ", lfsr_nums[i]);
}
}
I believe that the vector is too big for the NASM or C and maybe the program is trying to access bad memory, but I couldn't find anything to confirm my believes neither a fix to the problem. Already tried with malloc and calloc.

Assembly Power(A,b) function

I'm trying to make an assembly power(a, b) function, based on this c code:
int power(int x, int y)
{
int z;
z = 1;
while (y > 0) {
if ((y % 2) == 1) {
y = y - 1;
z = z * x;
} else {
y = y / 2;
x = x * x;
}
}
return z;
}
Though, for some reason, it only gets some outputs right, and I can't figure out where the problem is. Here is my assembly code:
;* int power(int x, int y); *
;*****************************************************************************
%define y [ebp+12]
%define x [ebp+8]
%define z [ebp-4]
power:
push ebp
mov ebp,esp
sub esp,16
mov byte[ebp-4], 1
jmp L3;
L1:
mov eax,[ebp+12]
and eax,0x1
test eax,eax
je L2;
sub byte[ebp+12],1
mov eax,[ebp-4]
imul eax, [ebp+8]
mov [ebp-4],eax
jmp L3;
L2:
mov eax,[ebp+12]
mov edx,eax
shr edx,31
add eax,edx
sar eax,1
mov [ebp+12],eax
mov eax,[ebp+8]
imul eax,[ebp+8]
mov [ebp+8],eax
L3:
cmp byte[ebp+12],0
jg L1;
mov eax,[ebp-4]
leave
ret
When I run it, this is my output:
power(2, 0) returned -1217218303 - incorrect, should return 1
power(2, 1) returned 2 - correct
power(2, 2) returned -575029244 - incorrect, should return 4
power(2, 3) returned 8 - correct
Could someone please help me figure out where I've gone wrong, or to correct my code, any help would be appreciated.

You're storing a single byte, then reading back that byte + 3 garbage bytes.
%define z [ebp-4] ; what's the point of this define if you write it out explicitly instead of mov eax, z?
; should just be a comment if you don't want use it
mov byte[ebp-4], 1 ; leaves [ebp-3..ebp-1] unmodified
...
mov eax, [ebp-4] ; loads 4 bytes, including whatever garbage was there
Using a debugger would have shown you that you were getting garbage in EAX at this point. You could fix it by using movzx eax, byte [ebp-4], or by storing 4B in the first place.
Or, better, don't use any extra stack memory at all, since you can use ECX without even saving/restoring it in the usual 32-bit calling conventions. Keep your data in registers, that's what they're for.
Your "solution" of using [ebp+16] is writing to stack space owned by the caller. You're just getting lucky that it happens to have the upper 3 bytes zeroed, I guess, and that clobbering it doesn't lead to a crash.
and eax,1
test eax,eax
is redundant: test eax, 1 is sufficient. (Or if you want to destroy eax, and eax, 1 sets flags based on the result).
There's a huge amount of other inefficiency in your code, too.

Seeking maximum bitmap (aka bit array) performance with C/Intel assembly

Following on from my two previous questions, How to improve memory performance/data locality of 64-bit C/intel assembly program and Using C/Intel assembly, what is the fastest way to test if a 128-byte memory block contains all zeros?, I have further reduced the running time of the test program mentioned in these questions from 150 seconds down to 62 seconds, as I will describe below.
The 64-bit program has five 4 GB lookup tables (bytevecM, bytevecD, bytevecC, bytevecL, bytevecX). To reduce the (huge) number of cache misses, analysed in my last question, I added five 4 MB bitmaps, one per lookup table.
Here is the original inner loop:
psz = (size_t*)&bytevecM[(unsigned int)m7 & 0xffffff80];
if (psz[0] == 0 && psz[1] == 0
&& psz[2] == 0 && psz[3] == 0
&& psz[4] == 0 && psz[5] == 0
&& psz[6] == 0 && psz[7] == 0
&& psz[8] == 0 && psz[9] == 0
&& psz[10] == 0 && psz[11] == 0
&& psz[12] == 0 && psz[13] == 0
&& psz[14] == 0 && psz[15] == 0) continue;
// ... rinse and repeat for bytevecD, bytevecC, bytevecL, bytevecX
// expensive inner loop that scans 128 byte chunks from the 4 GB lookup tables...
The idea behind this simple "pre-check" was to avoid the expensive inner loop if all 128 bytes were zero. However, profiling showed that this pre-check was the primary bottleneck due to huge numbers of cache misses, as discussed last time. So I created a 4 MB bitmap to do the pre-check. (BTW, around 36% of 128-byte blocks are zero, not 98% as I mistakenly reported last time).
Here is the code I used to create a 4 MB bitmap from a 4 GB lookup table:
// Last chunk index (bitmap size=((LAST_CHUNK_IDX+1)>>3)=4,194,304 bytes)
#define LAST_CHUNK_IDX 33554431
void make_bitmap(
const unsigned char* bytevec, // in: byte vector
unsigned char* bitvec // out: bitmap
)
{
unsigned int uu;
unsigned int ucnt = 0;
unsigned int byte;
unsigned int bit;
const size_t* psz;
for (uu = 0; uu <= LAST_CHUNK_IDX; ++uu)
{
psz = (size_t*)&bytevec[uu << 7];
if (psz[0] == 0 && psz[1] == 0
&& psz[2] == 0 && psz[3] == 0
&& psz[4] == 0 && psz[5] == 0
&& psz[6] == 0 && psz[7] == 0
&& psz[8] == 0 && psz[9] == 0
&& psz[10] == 0 && psz[11] == 0
&& psz[12] == 0 && psz[13] == 0
&& psz[14] == 0 && psz[15] == 0) continue;
++ucnt;
byte = uu >> 3;
bit = (uu & 7);
bitvec[byte] |= (1 << bit);
}
printf("ucnt=%u hits from %u\n", ucnt, LAST_CHUNK_IDX+1);
}
Suggestions for a better way to do this are welcome.
With the bitmaps created via the function above, I then changed the "pre-check" to use the 4 MB bitmaps, instead of the 4 GB lookup tables, like so:
if ( (bitvecM[m7 >> 10] & (1 << ((m7 >> 7) & 7))) == 0 ) continue;
// ... rinse and repeat for bitvecD, bitvecC, bitvecL, bitvecX
// expensive inner loop that scans 128 byte chunks from the 4 GB lookup tables...
This was "successful" in that the running time was reduced from 150 seconds down to 62 seconds in the simple single-threaded case. However, VTune still reports some pretty big numbers, as shown below.
I profiled a more realistic test with eight simultaneous threads running across different ranges. The VTune output of the inner loop check for zero blocks is shown below:
> m7 = (unsigned int)( (m6 ^ q7) * H_PRIME );
> if ( (bitvecM[m7 >> 10] & (1 << ((m7 >> 7) & 7))) == 0 ) continue;
0x1400025c7 Block 15:
mov eax, r15d 1.058s
mov edx, ebx 0.109s
xor eax, ecx 0.777s
imul eax, eax, 0xf4243 1.088s
mov r9d, eax 3.369s
shr eax, 0x7 0.123s
and eax, 0x7 1.306s
movzx ecx, al 1.319s
mov eax, r9d 0.156s
shr rax, 0xa 0.248s
shl edx, cl 1.321s
test byte ptr [rax+r10*1], dl 1.832s
jz 0x140007670 2.037s
> d7 = (unsigned int)( (s6.m128i_i32[0] ^ q7) * H_PRIME );
> if ( (bitvecD[d7 >> 10] & (1 << ((d7 >> 7) & 7))) == 0 ) continue;
0x1400025f3 Block 16:
mov eax, dword ptr [rsp+0x30] 104.983s
mov edx, ebx 1.663s
xor eax, r15d 0.062s
imul eax, eax, 0xf4243 0.513s
mov edi, eax 1.172s
shr eax, 0x7 0.140s
and eax, 0x7 0.062s
movzx ecx, al 0.575s
mov eax, edi 0.689s
shr rax, 0xa 0.016s
shl edx, cl 0.108s
test byte ptr [rax+r11*1], dl 1.591s
jz 0x140007670 1.087s
> c7 = (unsigned int)( (s6.m128i_i32[1] ^ q7) * H_PRIME );
> if ( (bitvecC[c7 >> 10] & (1 << ((c7 >> 7) & 7))) == 0 ) continue;
0x14000261f Block 17:
mov eax, dword ptr [rsp+0x34] 75.863s
mov edx, 0x1 1.097s
xor eax, r15d 0.031s
imul eax, eax, 0xf4243 0.265s
mov ebx, eax 0.512s
shr eax, 0x7 0.016s
and eax, 0x7 0.233s
movzx ecx, al 0.233s
mov eax, ebx 0.279s
shl edx, cl 0.109s
mov rcx, qword ptr [rsp+0x58] 0.652s
shr rax, 0xa 0.171s
movzx ecx, byte ptr [rax+rcx*1] 0.126s
test cl, dl 77.918s
jz 0x140007667
> l7 = (unsigned int)( (s6.m128i_i32[2] ^ q7) * H_PRIME );
> if ( (bitvecL[l7 >> 10] & (1 << ((l7 >> 7) & 7))) == 0 ) continue;
0x140002655 Block 18:
mov eax, dword ptr [rsp+0x38] 0.980s
mov edx, 0x1 0.794s
xor eax, r15d 0.062s
imul eax, eax, 0xf4243 0.187s
mov r11d, eax 0.278s
shr eax, 0x7 0.062s
and eax, 0x7 0.218s
movzx ecx, al 0.218s
mov eax, r11d 0.186s
shl edx, cl 0.031s
mov rcx, qword ptr [rsp+0x50] 0.373s
shr rax, 0xa 0.233s
movzx ecx, byte ptr [rax+rcx*1] 0.047s
test cl, dl 55.060s
jz 0x14000765e
In addition to that, large amounts of time were (confusingly to me) attributed to this line:
> for (q6 = 1; q6 < 128; ++q6) {
0x1400075a1 Block 779:
inc edx 0.124s
mov dword ptr [rsp+0x10], edx
cmp edx, 0x80 0.031s
jl 0x140002574
mov ecx, dword ptr [rsp+0x4]
mov ebx, dword ptr [rsp+0x48]
...
0x140007575 Block 772:
mov edx, dword ptr [rsp+0x10] 0.699s
...
0x14000765e Block 789 (note: jz in l7 section above jumps here if zero):
mov edx, dword ptr [rsp+0x10] 1.169s
jmp 0x14000757e 0.791s
0x140007667 Block 790 (note: jz in c7 section above jumps here if zero):
mov edx, dword ptr [rsp+0x10] 2.261s
jmp 0x140007583 1.461s
0x140007670 Block 791 (note: jz in m7/d7 section above jumps here if zero):
mov edx, dword ptr [rsp+0x10] 108.355s
jmp 0x140007588 6.922s
I don't fully understand the big numbers in the VTune output above. If anyone can shed more light on these numbers, I'm all ears.
It seems to me that my five 4 MB bitmaps are bigger than my Core i7 3770 processor can fit into its 8 MB L3 cache, leading to many cache misses (though far fewer than before). If my CPU had a 30 MB L3 cache (as the upcoming Ivy Bridge-E has), I speculate that this program would run a lot faster because all five bitmaps would comfortably fit into the L3 cache. Is that right?
Further to that, since the code to test the bitmaps, namely:
m7 = (unsigned int)( (m6 ^ q7) * H_PRIME );
bitvecM[m7 >> 10] & (1 << ((m7 >> 7) & 7))) == 0
now appears five times in the inner loop, any suggestions for speeding up this code are very welcome.

Within the core bits of the loop, using the _bittest() MSVC intrinsic for the bitmap check combines the shl/test combo the compiler creates into a single instruction with (on SandyBridge) no latency/throughput penalty, i.e. it should shave a few cycles off.
Beyond that, can only think of calculating the bitmaps by map-reducing bit sets via recursive POR, as a variation on your zero testing that might be worth benchmarking:
for (int i = 0; i < MAX_IDX; i++) {
__m128i v[8];
__m128i* ptr = ...[i << ...];
v[0] = _mm_load_si128(ptr[0]);
v[1] = _mm_load_si128(ptr[1]);
v[2] = _mm_load_si128(ptr[2]);
v[3] = _mm_load_si128(ptr[3]);
v[4] = _mm_load_si128(ptr[4]);
v[5] = _mm_load_si128(ptr[5]);
v[6] = _mm_load_si128(ptr[6]);
v[7] = _mm_load_si128(ptr[7]);
v[0] = _mm_or_si128(v[0], v[1]);
v[2] = _mm_or_si128(v[2], v[3]);
v[4] = _mm_or_si128(v[4], v[5]);
v[6] = _mm_or_si128(v[6], v[7]);
v[0] = _mm_or_si128(v[0], v[2]);
v[2] = _mm_or_si128(v[4], v[6]);
v[0] = _mm_or_si128(v[0], v[2]);
if (_mm_movemask_epi8(_mm_cmpeq_epi8(_mm_setzero_si128(), v[0]))) {
// the contents aren't all zero
}
...
}
At this point, the pure load / accumulate-OR / extract mask might be better than a tight loop of SSE4.2 PTEST because there's no flags dependency and no branches.

For the 128-byte buffer, do the comparisons with larger integers.
unsigned char cbuf[128];
unsigned long long *lbuf = cbuf;
int i;
for (i=0; i < 128/sizeof(long long); i++) {
if (lbuf[i]) return false; // something not a zero
}
return true; // all zero

Fastest way to calculate a 128-bit integer modulo a 64-bit integer

I have a 128-bit unsigned integer A and a 64-bit unsigned integer B. What's the fastest way to calculate A % B - that is the (64-bit) remainder from dividing A by B?
I'm looking to do this in either C or assembly language, but I need to target the 32-bit x86 platform. This unfortunately means that I cannot take advantage of compiler support for 128-bit integers, nor of the x64 architecture's ability to perform the required operation in a single instruction.
Edit:
Thank you for the answers so far. However, it appears to me that the suggested algorithms would be quite slow - wouldn't the fastest way to perform a 128-bit by 64-bit division be to leverage the processor's native support for 64-bit by 32-bit division? Does anyone know if there is a way to perform the larger division in terms of a few smaller divisions?
Re: How often does B change?
Primarily I'm interested in a general solution - what calculation would you perform if A and B are likely to be different every time?
However, a second possible situation is that B does not vary as often as A - there may be as many as 200 As to divide by each B. How would your answer differ in this case?

You can use the division version of Russian Peasant Multiplication.
To find the remainder, execute (in pseudo-code):
X = B;
while (X <= A/2)
{
X <<= 1;
}
while (A >= B)
{
if (A >= X)
A -= X;
X >>= 1;
}
The modulus is left in A.
You'll need to implement the shifts, comparisons and subtractions to operate on values made up of a pair of 64 bit numbers, but that's fairly trivial (likely you should implement the left-shift-by-1 as X + X).
This will loop at most 255 times (with a 128 bit A). Of course you need to do a pre-check for a zero divisor.

Perhaps you're looking for a finished program, but the basic algorithms for multi-precision arithmetic can be found in Knuth's Art of Computer Programming, Volume 2. You can find the division algorithm described online here. The algorithms deal with arbitrary multi-precision arithmetic, and so are more general than you need, but you should be able to simplify them for 128 bit arithmetic done on 64- or 32-bit digits. Be prepared for a reasonable amount of work (a) understanding the algorithm, and (b) converting it to C or assembler.
You might also want to check out Hacker's Delight, which is full of very clever assembler and other low-level hackery, including some multi-precision arithmetic.

If your B is small enough for the uint64_t + operation to not wrap:
Given A = AH*2^64 + AL:
A % B == (((AH % B) * (2^64 % B)) + (AL % B)) % B
== (((AH % B) * ((2^64 - B) % B)) + (AL % B)) % B
If your compiler supports 64-bit integers, then this is probably the easiest way to go.
MSVC's implementation of a 64-bit modulo on 32-bit x86 is some hairy loop filled assembly (VC\crt\src\intel\llrem.asm for the brave), so I'd personally go with that.

This is almost untested partly speed modificated Mod128by64 'Russian peasant' algorithm function. Unfortunately I'm a Delphi user so this function works under Delphi. :) But the assembler is almost the same so...
function Mod128by64(Dividend: PUInt128; Divisor: PUInt64): UInt64;
//In : eax = #Dividend
// : edx = #Divisor
//Out: eax:edx as Remainder
asm
//Registers inside rutine
//Divisor = edx:ebp
//Dividend = bh:ebx:edx //We need 64 bits + 1 bit in bh
//Result = esi:edi
//ecx = Loop counter and Dividend index
push ebx //Store registers to stack
push esi
push edi
push ebp
mov ebp, [edx] //Divisor = edx:ebp
mov edx, [edx + 4]
mov ecx, ebp //Div by 0 test
or ecx, edx
jz #DivByZero
xor edi, edi //Clear result
xor esi, esi
//Start of 64 bit division Loop
mov ecx, 15 //Load byte loop shift counter and Dividend index
#SkipShift8Bits: //Small Dividend numbers shift optimisation
cmp [eax + ecx], ch //Zero test
jnz #EndSkipShiftDividend
loop #SkipShift8Bits //Skip 8 bit loop
#EndSkipShiftDividend:
test edx, $FF000000 //Huge Divisor Numbers Shift Optimisation
jz #Shift8Bits //This Divisor is > $00FFFFFF:FFFFFFFF
mov ecx, 8 //Load byte shift counter
mov esi, [eax + 12] //Do fast 56 bit (7 bytes) shift...
shr esi, cl //esi = $00XXXXXX
mov edi, [eax + 9] //Load for one byte right shifted 32 bit value
#Shift8Bits:
mov bl, [eax + ecx] //Load 8 bits of Dividend
//Here we can unrole partial loop 8 bit division to increase execution speed...
mov ch, 8 //Set partial byte counter value
#Do65BitsShift:
shl bl, 1 //Shift dividend left for one bit
rcl edi, 1
rcl esi, 1
setc bh //Save 65th bit
sub edi, ebp //Compare dividend and divisor
sbb esi, edx //Subtract the divisor
sbb bh, 0 //Use 65th bit in bh
jnc #NoCarryAtCmp //Test...
add edi, ebp //Return privius dividend state
adc esi, edx
#NoCarryAtCmp:
dec ch //Decrement counter
jnz #Do65BitsShift
//End of 8 bit (byte) partial division loop
dec cl //Decrement byte loop shift counter
jns #Shift8Bits //Last jump at cl = 0!!!
//End of 64 bit division loop
mov eax, edi //Load result to eax:edx
mov edx, esi
#RestoreRegisters:
pop ebp //Restore Registers
pop edi
pop esi
pop ebx
ret
#DivByZero:
xor eax, eax //Here you can raise Div by 0 exception, now function only return 0.
xor edx, edx
jmp #RestoreRegisters
end;
At least one more speed optimisation is possible! After 'Huge Divisor Numbers Shift Optimisation' we can test divisors high bit, if it is 0 we do not need to use extra bh register as 65th bit to store in it. So unrolled part of loop can look like:
shl bl,1 //Shift dividend left for one bit
rcl edi,1
rcl esi,1
sub edi, ebp //Compare dividend and divisor
sbb esi, edx //Subtract the divisor
jnc #NoCarryAtCmpX
add edi, ebp //Return privius dividend state
adc esi, edx
#NoCarryAtCmpX:

I know the question specified 32-bit code, but the answer for 64-bit may be useful or interesting to others.
And yes, 64b/32b => 32b division does make a useful building-block for 128b % 64b => 64b. libgcc's __umoddi3 (source linked below) gives an idea of how to do that sort of thing, but it only implements 2N % 2N => 2N on top of a 2N / N => N division, not 4N % 2N => 2N.
Wider multi-precision libraries are available, e.g. https://gmplib.org/manual/Integer-Division.html#Integer-Division.
GNU C on 64-bit machines does provide an __int128 type, and libgcc functions to multiply and divide as efficiently as possible on the target architecture.
x86-64's div r/m64 instruction does 128b/64b => 64b division (also producing remainder as a second output), but it faults if the quotient overflows. So you can't directly use it if A/B > 2^64-1, but you can get gcc to use it for you (or even inline the same code that libgcc uses).
This compiles (Godbolt compiler explorer) to one or two div instructions (which happen inside a libgcc function call). If there was a faster way, libgcc would probably use that instead.
#include <stdint.h>
uint64_t AmodB(unsigned __int128 A, uint64_t B) {
return A % B;
}
The __umodti3 function it calls calculates a full 128b/128b modulo, but the implementation of that function does check for the special case where the divisor's high half is 0, as you can see in the libgcc source. (libgcc builds the si/di/ti version of the function from that code, as appropriate for the target architecture. udiv_qrnnd is an inline asm macro that does unsigned 2N/N => N division for the target architecture.
For x86-64 (and other architectures with a hardware divide instruction), the fast-path (when high_half(A) < B; guaranteeing div won't fault) is just two not-taken branches, some fluff for out-of-order CPUs to chew through, and a single div r64 instruction, which takes about 50-100 cycles1 on modern x86 CPUs, according to Agner Fog's insn tables. Some other work can be happening in parallel with div, but the integer divide unit is not very pipelined and div decodes to a lot of uops (unlike FP division).
The fallback path still only uses two 64-bit div instructions for the case where B is only 64-bit, but A/B doesn't fit in 64 bits so A/B directly would fault.
Note that libgcc's __umodti3 just inlines __udivmoddi4 into a wrapper that only returns the remainder.
Footnote 1: 32-bit div is over 2x faster on Intel CPUs. On AMD CPUs, performance only depends on the size of the actual input values, even if they're small values in a 64-bit register. If small values are common, it might be worth benchmarking a branch to a simple 32-bit division version before doing 64-bit or 128-bit division.
For repeated modulo by the same B
It might be worth considering calculating a fixed-point multiplicative inverse for B, if one exists. For example, with compile-time constants, gcc does the optimization for types narrower than 128b.
uint64_t modulo_by_constant64(uint64_t A) { return A % 0x12345678ABULL; }
movabs rdx, -2233785418547900415
mov rax, rdi
mul rdx
mov rax, rdx # wasted instruction, could have kept using RDX.
movabs rdx, 78187493547
shr rax, 36 # division result
imul rax, rdx # multiply and subtract to get the modulo
sub rdi, rax
mov rax, rdi
ret
x86's mul r64 instruction does 64b*64b => 128b (rdx:rax) multiplication, and can be used as a building block to construct a 128b * 128b => 256b multiply to implement the same algorithm. Since we only need the high half of the full 256b result, that saves a few multiplies.
Modern Intel CPUs have very high performance mul: 3c latency, one per clock throughput. However, the exact combination of shifts and adds required varies with the constant, so the general case of calculating a multiplicative inverse at run-time isn't quite as efficient each time its used as a JIT-compiled or statically-compiled version (even on top of the pre-computation overhead).
IDK where the break-even point would be. For JIT-compiling, it will be higher than ~200 reuses, unless you cache generated code for commonly-used B values. For the "normal" way, it might possibly be in the range of 200 reuses, but IDK how expensive it would be to find a modular multiplicative inverse for 128-bit / 64-bit division.
libdivide can do this for you, but only for 32 and 64-bit types. Still, it's probably a good starting point.

I have made both version of Mod128by64 'Russian peasant' division function: classic and speed optimised. Speed optimised can do on my 3Ghz PC more than 1000.000 random calculations per second and is more than three times faster than classic function.
If we compare the execution time of calculating 128 by 64 and calculating 64 by 64 bit modulo than this function is only about 50% slower.
Classic Russian peasant:
function Mod128by64Clasic(Dividend: PUInt128; Divisor: PUInt64): UInt64;
//In : eax = #Dividend
// : edx = #Divisor
//Out: eax:edx as Remainder
asm
//Registers inside rutine
//edx:ebp = Divisor
//ecx = Loop counter
//Result = esi:edi
push ebx //Store registers to stack
push esi
push edi
push ebp
mov ebp, [edx] //Load divisor to edx:ebp
mov edx, [edx + 4]
mov ecx, ebp //Div by 0 test
or ecx, edx
jz #DivByZero
push [eax] //Store Divisor to the stack
push [eax + 4]
push [eax + 8]
push [eax + 12]
xor edi, edi //Clear result
xor esi, esi
mov ecx, 128 //Load shift counter
#Do128BitsShift:
shl [esp + 12], 1 //Shift dividend from stack left for one bit
rcl [esp + 8], 1
rcl [esp + 4], 1
rcl [esp], 1
rcl edi, 1
rcl esi, 1
setc bh //Save 65th bit
sub edi, ebp //Compare dividend and divisor
sbb esi, edx //Subtract the divisor
sbb bh, 0 //Use 65th bit in bh
jnc #NoCarryAtCmp //Test...
add edi, ebp //Return privius dividend state
adc esi, edx
#NoCarryAtCmp:
loop #Do128BitsShift
//End of 128 bit division loop
mov eax, edi //Load result to eax:edx
mov edx, esi
#RestoreRegisters:
lea esp, esp + 16 //Restore Divisors space on stack
pop ebp //Restore Registers
pop edi
pop esi
pop ebx
ret
#DivByZero:
xor eax, eax //Here you can raise Div by 0 exception, now function only return 0.
xor edx, edx
jmp #RestoreRegisters
end;
Speed optimised Russian peasant:
function Mod128by64Oprimized(Dividend: PUInt128; Divisor: PUInt64): UInt64;
//In : eax = #Dividend
// : edx = #Divisor
//Out: eax:edx as Remainder
asm
//Registers inside rutine
//Divisor = edx:ebp
//Dividend = ebx:edx //We need 64 bits
//Result = esi:edi
//ecx = Loop counter and Dividend index
push ebx //Store registers to stack
push esi
push edi
push ebp
mov ebp, [edx] //Divisor = edx:ebp
mov edx, [edx + 4]
mov ecx, ebp //Div by 0 test
or ecx, edx
jz #DivByZero
xor edi, edi //Clear result
xor esi, esi
//Start of 64 bit division Loop
mov ecx, 15 //Load byte loop shift counter and Dividend index
#SkipShift8Bits: //Small Dividend numbers shift optimisation
cmp [eax + ecx], ch //Zero test
jnz #EndSkipShiftDividend
loop #SkipShift8Bits //Skip Compute 8 Bits unroled loop ?
#EndSkipShiftDividend:
test edx, $FF000000 //Huge Divisor Numbers Shift Optimisation
jz #Shift8Bits //This Divisor is > $00FFFFFF:FFFFFFFF
mov ecx, 8 //Load byte shift counter
mov esi, [eax + 12] //Do fast 56 bit (7 bytes) shift...
shr esi, cl //esi = $00XXXXXX
mov edi, [eax + 9] //Load for one byte right shifted 32 bit value
#Shift8Bits:
mov bl, [eax + ecx] //Load 8 bit part of Dividend
//Compute 8 Bits unroled loop
shl bl, 1 //Shift dividend left for one bit
rcl edi, 1
rcl esi, 1
jc #DividentAbove0 //dividend hi bit set?
cmp esi, edx //dividend hi part larger?
jb #DividentBelow0
ja #DividentAbove0
cmp edi, ebp //dividend lo part larger?
jb #DividentBelow0
#DividentAbove0:
sub edi, ebp //Return privius dividend state
sbb esi, edx
#DividentBelow0:
shl bl, 1 //Shift dividend left for one bit
rcl edi, 1
rcl esi, 1
jc #DividentAbove1 //dividend hi bit set?
cmp esi, edx //dividend hi part larger?
jb #DividentBelow1
ja #DividentAbove1
cmp edi, ebp //dividend lo part larger?
jb #DividentBelow1
#DividentAbove1:
sub edi, ebp //Return privius dividend state
sbb esi, edx
#DividentBelow1:
shl bl, 1 //Shift dividend left for one bit
rcl edi, 1
rcl esi, 1
jc #DividentAbove2 //dividend hi bit set?
cmp esi, edx //dividend hi part larger?
jb #DividentBelow2
ja #DividentAbove2
cmp edi, ebp //dividend lo part larger?
jb #DividentBelow2
#DividentAbove2:
sub edi, ebp //Return privius dividend state
sbb esi, edx
#DividentBelow2:
shl bl, 1 //Shift dividend left for one bit
rcl edi, 1
rcl esi, 1
jc #DividentAbove3 //dividend hi bit set?
cmp esi, edx //dividend hi part larger?
jb #DividentBelow3
ja #DividentAbove3
cmp edi, ebp //dividend lo part larger?
jb #DividentBelow3
#DividentAbove3:
sub edi, ebp //Return privius dividend state
sbb esi, edx
#DividentBelow3:
shl bl, 1 //Shift dividend left for one bit
rcl edi, 1
rcl esi, 1
jc #DividentAbove4 //dividend hi bit set?
cmp esi, edx //dividend hi part larger?
jb #DividentBelow4
ja #DividentAbove4
cmp edi, ebp //dividend lo part larger?
jb #DividentBelow4
#DividentAbove4:
sub edi, ebp //Return privius dividend state
sbb esi, edx
#DividentBelow4:
shl bl, 1 //Shift dividend left for one bit
rcl edi, 1
rcl esi, 1
jc #DividentAbove5 //dividend hi bit set?
cmp esi, edx //dividend hi part larger?
jb #DividentBelow5
ja #DividentAbove5
cmp edi, ebp //dividend lo part larger?
jb #DividentBelow5
#DividentAbove5:
sub edi, ebp //Return privius dividend state
sbb esi, edx
#DividentBelow5:
shl bl, 1 //Shift dividend left for one bit
rcl edi, 1
rcl esi, 1
jc #DividentAbove6 //dividend hi bit set?
cmp esi, edx //dividend hi part larger?
jb #DividentBelow6
ja #DividentAbove6
cmp edi, ebp //dividend lo part larger?
jb #DividentBelow6
#DividentAbove6:
sub edi, ebp //Return privius dividend state
sbb esi, edx
#DividentBelow6:
shl bl, 1 //Shift dividend left for one bit
rcl edi, 1
rcl esi, 1
jc #DividentAbove7 //dividend hi bit set?
cmp esi, edx //dividend hi part larger?
jb #DividentBelow7
ja #DividentAbove7
cmp edi, ebp //dividend lo part larger?
jb #DividentBelow7
#DividentAbove7:
sub edi, ebp //Return privius dividend state
sbb esi, edx
#DividentBelow7:
//End of Compute 8 Bits (unroled loop)
dec cl //Decrement byte loop shift counter
jns #Shift8Bits //Last jump at cl = 0!!!
//End of division loop
mov eax, edi //Load result to eax:edx
mov edx, esi
#RestoreRegisters:
pop ebp //Restore Registers
pop edi
pop esi
pop ebx
ret
#DivByZero:
xor eax, eax //Here you can raise Div by 0 exception, now function only return 0.
xor edx, edx
jmp #RestoreRegisters
end;

I'd like to share a few thoughts.
It's not as simple as MSN proposes I'm afraid.
In the expression:
(((AH % B) * ((2^64 - B) % B)) + (AL % B)) % B
both multiplication and addition may overflow. I think one could take it into account and still use the general concept with some modifications, but something tells me it's going to get really scary.
I was curious how 64 bit modulo operation was implemented in MSVC and I tried to find something out. I don't really know assembly and all I had available was Express edition, without the source of VC\crt\src\intel\llrem.asm, but I think I managed to get some idea what's going on, after a bit of playing with the debugger and disassembly output. I tried to figure out how the remainder is calculated in case of positive integers and the divisor >=2^32. There is some code that deals with negative numbers of course, but I didn't dig into that.
Here is how I see it:
If divisor >= 2^32 both the dividend and the divisor are shifted right as much as necessary to fit the divisor into 32 bits. In other words: if it takes n digits to write the divisor down in binary and n > 32, n-32 least significant digits of both the divisor and the dividend are discarded. After that, the division is performed using hardware support for dividing 64 bit integers by 32 bit ones. The result might be incorrect, but I think it can be proved, that the result may be off by at most 1. After the division, the divisor (original one) is multiplied by the result and the product subtracted from the dividend. Then it is corrected by adding or subtracting the divisor if necessary (if the result of the division was off by one).
It's easy to divide 128 bit integer by 32 bit one leveraging hardware support for 64-bit by 32-bit division. In case the divisor < 2^32, one can calculate the remainder performing just 4 divisions as follows:
Let's assume the dividend is stored in:
DWORD dividend[4] = ...
the remainder will go into:
DWORD remainder;
1) Divide dividend[3] by divisor. Store the remainder in remainder.
2) Divide QWORD (remainder:dividend[2]) by divisor. Store the remainder in remainder.
3) Divide QWORD (remainder:dividend[1]) by divisor. Store the remainder in remainder.
4) Divide QWORD (remainder:dividend[0]) by divisor. Store the remainder in remainder.
After those 4 steps the variable remainder will hold what You are looking for.
(Please don't kill me if I got the endianess wrong. I'm not even a programmer)
In case the divisor is grater than 2^32-1 I don't have good news. I don't have a complete proof that the result after the shift is off by no more than 1, in the procedure I described earlier, which I believe MSVC is using. I think however that it has something to do with the fact, that the part that is discarded is at least 2^31 times less than the divisor, the dividend is less than 2^64 and the divisor is greater than 2^32-1, so the result is less than 2^32.
If the dividend has 128 bits the trick with discarding bits won't work. So in general case the best solution is probably the one proposed by GJ or caf. (Well, it would be probably the best even if discarding bits worked. Division, multiplication subtraction and correction on 128 bit integer might be slower.)
I was also thinking about using the floating point hardware. x87 floating point unit uses 80 bit precision format with fraction 64 bits long. I think one can get the exact result of 64 bit by 64 bit division. (Not the remainder directly, but also the remainder using multiplication and subtraction like in the "MSVC procedure"). IF the dividend >=2^64 and < 2^128 storing it in the floating point format seems similar to discarding least significant bits in "MSVC procedure". Maybe someone can prove the error in that case is bound and find it useful. I have no idea if it has a chance to be faster than GJ's solution, but maybe it's worth it to try.

The solution depends on what exactly you are trying to solve.
E.g. if you are doing arithmetic in a ring modulo a 64-bit integer then using
Montgomerys reduction is very efficient. Of course this assumes that you the same modulus many times and that it pays off to convert the elements of the ring into a special representation.
To give just a very rough estimate on the speed of this Montgomerys reduction: I have an old benchmark that performs a modular exponentiation with 64-bit modulus and exponent in 1600 ns on a 2.4Ghz Core 2. This exponentiation does about 96 modular multiplications (and modular reductions) and hence needs about 40 cycles per modular multiplication.

The accepted answer by #caf was real nice and highly rated, yet it contain a bug not seen for years.
To help test that and other solutions, I am posting a test harness and making it community wiki.
unsigned cafMod(unsigned A, unsigned B) {
assert(B);
unsigned X = B;
// while (X < A / 2) { Original code used <
while (X <= A / 2) {
X <<= 1;
}
while (A >= B) {
if (A >= X) A -= X;
X >>= 1;
}
return A;
}
void cafMod_test(unsigned num, unsigned den) {
if (den == 0) return;
unsigned y0 = num % den;
unsigned y1 = mod(num, den);
if (y0 != y1) {
printf("FAIL num:%x den:%x %x %x\n", num, den, y0, y1);
fflush(stdout);
exit(-1);
}
}
unsigned rand_unsigned() {
unsigned x = (unsigned) rand();
return x * 2 ^ (unsigned) rand();
}
void cafMod_tests(void) {
const unsigned i[] = { 0, 1, 2, 3, 0x7FFFFFFF, 0x80000000,
UINT_MAX - 3, UINT_MAX - 2, UINT_MAX - 1, UINT_MAX };
for (unsigned den = 0; den < sizeof i / sizeof i[0]; den++) {
if (i[den] == 0) continue;
for (unsigned num = 0; num < sizeof i / sizeof i[0]; num++) {
cafMod_test(i[num], i[den]);
}
}
cafMod_test(0x8711dd11, 0x4388ee88);
cafMod_test(0xf64835a1, 0xf64835a);
time_t t;
time(&t);
srand((unsigned) t);
printf("%u\n", (unsigned) t);fflush(stdout);
for (long long n = 10000LL * 1000LL * 1000LL; n > 0; n--) {
cafMod_test(rand_unsigned(), rand_unsigned());
}
puts("Done");
}
int main(void) {
cafMod_tests();
return 0;
}

As a general rule, division is slow and multiplication is faster, and bit shifting is faster yet. From what I have seen of the answers so far, most of the answers have been using a brute force approach using bit-shifts. There exists another way. Whether it is faster remains to be seen (AKA profile it).
Instead of dividing, multiply by the reciprocal. Thus, to discover A % B, first calculate the reciprocal of B ... 1/B. This can be done with a few loops using the Newton-Raphson method of convergence. To do this well will depend upon a good set of initial values in a table.
For more details on the Newton-Raphson method of converging on the reciprocal, please refer to http://en.wikipedia.org/wiki/Division_(digital)
Once you have the reciprocal, the quotient Q = A * 1/B.
The remainder R = A - Q*B.
To determine if this would be faster than the brute force (as there will be many more multiplies since we will be using 32-bit registers to simulate 64-bit and 128-bit numbers, profile it.
If B is constant in your code, you can pre-calculate the reciprocal and simply calculate using the last two formulae. This, I am sure will be faster than bit-shifting.
Hope this helps.

If 128-bit unsigned by 63-bit unsigned is good enough, then it can be done in a loop doing at most 63 cycles.
Consider this a proposed solution MSNs' overflow problem by limiting it to 1-bit. We do so by splitting the problem in 2, modular multiplication and adding the results at the end.
In the following example upper corresponds to the most significant 64-bits, lower to the least significant 64-bits and div is the divisor.
unsigned 128_mod(uint64_t upper, uint64_t lower, uint64_t div) {
uint64_t result = 0;
uint64_t a = (~0%div)+1;
upper %= div; // the resulting bit-length determines number of cycles required
// first we work out modular multiplication of (2^64*upper)%div
while (upper != 0){
if(upper&1 == 1){
result += a;
if(result >= div){result -= div;}
}
a <<= 1;
if(a >= div){a -= div;}
upper >>= 1;
}
// add up the 2 results and return the modulus
if(lower>div){lower -= div;}
return (lower+result)%div;
}
The only problem is that, if the divisor is 64-bits then we get overflows of 1-bit (loss of information) giving a faulty result.
It bugs me that I haven't figured out a neat way to handle the overflows.

I don't know how to compile the assembler codes, any help is appreciated to compile and test them.
I solved this problem by comparing against gmplib "mpz_mod()" and summing 1 million loop results. It was a long ride to go from slowdown (seedup 0.12) to speedup 1.54 -- that is the reason I think the C codes in this thread will be slow.
Details inclusive test harness in this thread:
https://www.raspberrypi.org/forums/viewtopic.php?f=33&t=311893&p=1873122#p1873122
This is "mod_256()" with speedup over using gmplib "mpz_mod()", use of __builtin_clzll() for longer shifts was essential:
typedef __uint128_t uint256_t[2];
#define min(x, y) ((x<y) ? (x) : (y))
int clz(__uint128_t u)
{
// unsigned long long h = ((unsigned long long *)&u)[1];
unsigned long long h = u >> 64;
return (h!=0) ? __builtin_clzll(h) : 64 + __builtin_clzll(u);
}
__uint128_t mod_256(uint256_t x, __uint128_t n)
{
if (x[1] == 0) return x[0] % n;
else
{
__uint128_t r = x[1] % n;
int F = clz(n);
int R = clz(r);
for(int i=0; i<128; ++i)
{
if (R>F+1)
{
int h = min(R-(F+1), 128-i);
r <<= h; R-=h; i+=(h-1); continue;
}
r <<= 1; if (r >= n) { r -= n; R=clz(r); }
}
r += (x[0] % n); if (r >= n) r -= n;
return r;
}
}

If you have a recent x86 machine, there are 128-bit registers for SSE2+. I've never tried to write assembly for anything other than basic x86, but I suspect there are some guides out there.

I am 9 years after the battle but here is an interesting O(1) edge case for powers of 2 that is worth mentioning.
#include <stdio.h>
// example with 32 bits and 8 bits.
int main() {
int i = 930;
unsigned char b = (unsigned char) i;
printf("%d", (int) b); // 162, same as 930 % 256
}

Since there is no predefined 128-bit integer type in C, bits of A have to be represented in an array. Although B (64-bit integer) can be stored in an unsigned long long int variable, it is needed to put bits of B into another array in order to work on A and B efficiently.
After that, B is incremented as Bx2, Bx3, Bx4, ... until it is the greatest B less than A. And then (A-B) can be calculated, using some subtraction knowledge for base 2.
Is this the kind of solution that you are looking for?

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How can I make such bitwise operations faster? - c

Related

Converting a loop from x86 assembly to C language with AND, OR, SHR and SHL instructions and an array

NASM x86 core dump writing on memory [duplicate]

Assembly Power(A,b) function

Seeking maximum bitmap (aka bit array) performance with C/Intel assembly

Fastest way to calculate a 128-bit integer modulo a 64-bit integer

Categories

Resources