large integer addition with CUDA - c

I've been developing a cryptographic algorithm on the GPU and currently stuck with an algorithm to perform large integer addition. Large integers are represented in a usual way as a bunch of 32-bit words.
For example, we can use one thread to add two 32-bit words. For simplicity, let assume
that the numbers to be added are of the same length and number of threads per block == number of words. Then:
__global__ void add_kernel(int *C, const int *A, const int *B) {
int x = A[threadIdx.x];
int y = B[threadIdx.x];
int z = x + y;
int carry = (z < x);
/** do carry propagation in parallel somehow ? */
............
z = z + newcarry; // update the resulting words after carry propagation
C[threadIdx.x] = z;
}
I am pretty sure that there is a way to do carry propagation via some tricky reduction procedure but could not figure it out..
I had a look at CUDA thrust extensions but big integer package seems not to be implemented yet.
Perhaps someone can give me a hint how to do that on CUDA ?

You are right, carry propagation can be done via prefix sum computation but it's a bit tricky to define the binary function for this operation and prove that it is associative (needed for parallel prefix sum). As a matter of fact, this algorithm is used (theoretically) in Carry-lookahead adder.
Suppose we have two large integers a[0..n-1] and b[0..n-1].
Then we compute (i = 0..n-1):
s[i] = a[i] + b[i]l;
carryin[i] = (s[i] < a[i]);
We define two functions:
generate[i] = carryin[i];
propagate[i] = (s[i] == 0xffffffff);
with quite intuitive meaning: generate[i] == 1 means that the carry is generated at
position i while propagate[i] == 1 means that the carry will be propagated from position
(i - 1) to (i + 1). Our goal is to compute the function carryout[0..n-1] used to update the resulting sum s[0..n-1]. carryout can be computed recursively as follows:
carryout[i] = generate[i] OR (propagate[i] AND carryout[i-1])
carryout[0] = 0
Here carryout[i] == 1 if carry is generated at position i OR it is generated sometimes earlier AND propagated to position i. Finally, we update the resulting sum:
s[i] = s[i] + carryout[i-1]; for i = 1..n-1
carry = carryout[n-1];
Now it is quite straightforward to prove that carryout function is indeed binary associative and hence parallel prefix sum computation applies. To implement this on CUDA, we can merge both flags 'generate' and 'propagate' in a single variable since they are mutually exclusive, i.e.:
cy[i] = (s[i] == -1u ? -1u : 0) | carryin[i];
In other words,
cy[i] = 0xffffffff if propagate[i]
cy[i] = 1 if generate[i]
cy[u] = 0 otherwise
Then, one can verify that the following formula computes prefix sum for carryout function:
cy[i] = max((int)cy[i], (int)cy[k]) & cy[i];
for all k < i. The example code below shows large addition for 2048-word integers. Here I used CUDA blocks with 512 threads:
// add & output carry flag
#define UADDO(c, a, b) \
asm volatile("add.cc.u32 %0, %1, %2;" : "=r"(c) : "r"(a) , "r"(b));
// add with carry & output carry flag
#define UADDC(c, a, b) \
asm volatile("addc.cc.u32 %0, %1, %2;" : "=r"(c) : "r"(a) , "r"(b));
#define WS 32
__global__ void bignum_add(unsigned *g_R, const unsigned *g_A,const unsigned *g_B) {
extern __shared__ unsigned shared[];
unsigned *r = shared;
const unsigned N_THIDS = 512;
unsigned thid = threadIdx.x, thid_in_warp = thid & WS-1;
unsigned ofs, cf;
uint4 a = ((const uint4 *)g_A)[thid],
b = ((const uint4 *)g_B)[thid];
UADDO(a.x, a.x, b.x) // adding 128-bit chunks with carry flag
UADDC(a.y, a.y, b.y)
UADDC(a.z, a.z, b.z)
UADDC(a.w, a.w, b.w)
UADDC(cf, 0, 0) // save carry-out
// memory consumption: 49 * N_THIDS / 64
// use "alternating" data layout for each pair of warps
volatile short *scan = (volatile short *)(r + 16 + thid_in_warp +
49 * (thid / 64)) + ((thid / 32) & 1);
scan[-32] = -1; // put identity element
if(a.x == -1u && a.x == a.y && a.x == a.z && a.x == a.w)
// this indicates that carry will propagate through the number
cf = -1u;
// "Hillis-and-Steele-style" reduction
scan[0] = cf;
cf = max((int)cf, (int)scan[-2]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-4]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-8]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-16]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-32]) & cf;
scan[0] = cf;
int *postscan = (int *)r + 16 + 49 * (N_THIDS / 64);
if(thid_in_warp == WS - 1) // scan leading carry-outs once again
postscan[thid >> 5] = cf;
__syncthreads();
if(thid < N_THIDS / 32) {
volatile int *t = (volatile int *)postscan + thid;
t[-8] = -1; // load identity symbol
cf = t[0];
cf = max((int)cf, (int)t[-1]) & cf;
t[0] = cf;
cf = max((int)cf, (int)t[-2]) & cf;
t[0] = cf;
cf = max((int)cf, (int)t[-4]) & cf;
t[0] = cf;
}
__syncthreads();
cf = scan[0];
int ps = postscan[(int)((thid >> 5) - 1)]; // postscan[-1] equals to -1
scan[0] = max((int)cf, ps) & cf; // update carry flags within warps
cf = scan[-2];
if(thid_in_warp == 0)
cf = ps;
if((int)cf < 0)
cf = 0;
UADDO(a.x, a.x, cf) // propagate carry flag if needed
UADDC(a.y, a.y, 0)
UADDC(a.z, a.z, 0)
UADDC(a.w, a.w, 0)
((uint4 *)g_R)[thid] = a;
}
Note that macros UADDO / UADDC might not be necessary anymore since CUDA 4.0 has corresponding intrinsics (however I am not entirely sure).
Also remark that, though parallel reduction is quite fast, if you need to add several large integers in a row, it might be better to use some redundant representation (which was suggested in comments above), i.e., first accumulate the results of additions in 64-bit words, and then perform one carry propagation at the very end in "one sweep".

I thought I would post my answer also, in addition to #asm, so this SO question can be a sort of repository of ideas. Similar to #asm, I detect and store the carry condition as well as the "carry-through" condition, ie. when the intermediate word result is all 1's (0xF...FFF) so that if a carry were to propagate into this word, it would "carry-through" to the next word.
I didn't use any PTX or asm in my code, so I chose to use 64-bit unsigned ints instead of 32-bit, to achieve the 2048x32bit capability, using 1024 threads.
A larger difference from #asm's code is in my parallel carry propagation scheme. I construct a bit-packed array ("carry") where each bit represents the carry condition generated from the independent intermediate 64-bit adds from each of the 1024 threads. I also construct a bit-packed array ("carry_through") where each bit represents the carry_through condition of the individual 64-bit intermediate results. For 1024 threads, this amounts to 1024/64 = 16x64 bit words of shared memory for each bit-packed array, so total shared mem usage is 64+3 32bit quantites. With these bit packed arrays, I perform the following to generate a combined propagated carry indicator:
carry = carry | (carry_through ^ ((carry & carry_through) + carry_through);
(note that carry is shifted left by one: carry[i] indicates that the result of a[i-1] + b[i-1] generated a carry)
The explanation is as follows:
the bitwise and of carry and carry_through generates the candidates where a carry will
interact with a sequence of one or more carry though conditions
adding the result of step one to carry_through generates a result which
has changed bits which represent all words that will be affected by
the propagation of the carry into the carry_through sequence
taking the exclusive-or of carry_through plus the result from step 2
shows the affected results indicated with a 1 bit
taking the bitwise or of the result from step 3 and the ordinary
carry indicators gives a combined carry condition, which is then
used to update all the intermediate results.
Note that the addition in step 2 requires another multi-word add (for big ints composed of more than 64 words). I believe this algorithm works, and it has passed the test cases I have thrown at it.
Here is my example code which implements this:
// parallel add of large integers
// requires CC 2.0 or higher
// compile with:
// nvcc -O3 -arch=sm_20 -o paradd2 paradd2.cu
#include <stdio.h>
#include <stdlib.h>
#define MAXSIZE 1024 // the number of 64 bit quantities that can be added
#define LLBITS 64 // the number of bits in a long long
#define BSIZE ((MAXSIZE + LLBITS -1)/LLBITS) // MAXSIZE when packed into bits
#define nTPB MAXSIZE
// define either GPU or GPUCOPY, not both -- for timing
#define GPU
//#define GPUCOPY
#define LOOPCNT 1000
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
// perform c = a + b, for unsigned integers of psize*64 bits.
// all work done in a single threadblock.
// multiple threadblocks are handling multiple separate addition problems
// least significant word is at a[0], etc.
__global__ void paradd(const unsigned size, const unsigned psize, unsigned long long *c, const unsigned long long *a, const unsigned long long *b){
__shared__ unsigned long long carry_through[BSIZE];
__shared__ unsigned long long carry[BSIZE+1];
__shared__ volatile unsigned mcarry;
__shared__ volatile unsigned mcarry_through;
unsigned idx = threadIdx.x + (psize * blockIdx.x);
if ((threadIdx.x < psize) && (idx < size)){
// handle 64 bit unsigned add first
unsigned long long cr1 = a[idx];
unsigned long long lc = cr1 + b[idx];
// handle carry
if (threadIdx.x < BSIZE){
carry[threadIdx.x] = 0;
carry_through[threadIdx.x] = 0;
}
if (threadIdx.x == 0){
mcarry = 0;
mcarry_through = 0;
}
__syncthreads();
if (lc < cr1){
if ((threadIdx.x%LLBITS) != (LLBITS-1))
atomicAdd(&(carry[threadIdx.x/LLBITS]), (2ull<<(threadIdx.x%LLBITS)));
else atomicAdd(&(carry[(threadIdx.x/LLBITS)+1]), 1);
}
// handle carry-through
if (lc == 0xFFFFFFFFFFFFFFFFull)
atomicAdd(&(carry_through[threadIdx.x/LLBITS]), (1ull<<(threadIdx.x%LLBITS)));
__syncthreads();
if (threadIdx.x < ((psize + LLBITS-1)/LLBITS)){
// only 1 warp executing within this if statement
unsigned long long cr3 = carry_through[threadIdx.x];
cr1 = carry[threadIdx.x] & cr3;
// start of sub-add
unsigned long long cr2 = cr3 + cr1;
if (cr2 < cr1) atomicAdd((unsigned *)&mcarry, (2u<<(threadIdx.x)));
if (cr2 == 0xFFFFFFFFFFFFFFFFull) atomicAdd((unsigned *)&mcarry_through, (1u<<threadIdx.x));
if (threadIdx.x == 0) {
unsigned cr4 = mcarry & mcarry_through;
cr4 += mcarry_through;
mcarry |= (mcarry_through ^ cr4);
}
if (mcarry & (1u<<threadIdx.x)) cr2++;
// end of sub-add
carry[threadIdx.x] |= (cr2 ^ cr3);
}
__syncthreads();
if (carry[threadIdx.x/LLBITS] & (1ull<<(threadIdx.x%LLBITS))) lc++;
c[idx] = lc;
}
}
int main() {
unsigned long long *h_a, *h_b, *h_c, *d_a, *d_b, *d_c, *c;
unsigned at_once = 256; // valid range = 1 .. 65535
unsigned prob_size = MAXSIZE ; // valid range = 1 .. MAXSIZE
unsigned dsize = at_once * prob_size;
cudaEvent_t t_start_gpu, t_start_cpu, t_end_gpu, t_end_cpu;
float et_gpu, et_cpu, tot_gpu, tot_cpu;
tot_gpu = 0;
tot_cpu = 0;
if (sizeof(unsigned long long) != (LLBITS/8)) {printf("Word Size Error\n"); return 1;}
if ((c = (unsigned long long *)malloc(dsize * sizeof(unsigned long long))) == 0) {printf("Malloc Fail\n"); return 1;}
cudaHostAlloc((void **)&h_a, dsize * sizeof(unsigned long long), cudaHostAllocDefault);
cudaCheckErrors("cudaHostAlloc1 fail");
cudaHostAlloc((void **)&h_b, dsize * sizeof(unsigned long long), cudaHostAllocDefault);
cudaCheckErrors("cudaHostAlloc2 fail");
cudaHostAlloc((void **)&h_c, dsize * sizeof(unsigned long long), cudaHostAllocDefault);
cudaCheckErrors("cudaHostAlloc3 fail");
cudaMalloc((void **)&d_a, dsize * sizeof(unsigned long long));
cudaCheckErrors("cudaMalloc1 fail");
cudaMalloc((void **)&d_b, dsize * sizeof(unsigned long long));
cudaCheckErrors("cudaMalloc2 fail");
cudaMalloc((void **)&d_c, dsize * sizeof(unsigned long long));
cudaCheckErrors("cudaMalloc3 fail");
cudaMemset(d_c, 0, dsize*sizeof(unsigned long long));
cudaEventCreate(&t_start_gpu);
cudaEventCreate(&t_end_gpu);
cudaEventCreate(&t_start_cpu);
cudaEventCreate(&t_end_cpu);
for (unsigned loops = 0; loops <LOOPCNT; loops++){
//create some test cases
if (loops == 0){
for (int j=0; j<at_once; j++)
for (int k=0; k<prob_size; k++){
int i= (j*prob_size) + k;
h_a[i] = 0xFFFFFFFFFFFFFFFFull;
h_b[i] = 0;
}
h_a[prob_size-1] = 0;
h_b[prob_size-1] = 1;
h_b[0] = 1;
}
else if (loops == 1){
for (int i=0; i<dsize; i++){
h_a[i] = 0xFFFFFFFFFFFFFFFFull;
h_b[i] = 0;
}
h_b[0] = 1;
}
else if (loops == 2){
for (int i=0; i<dsize; i++){
h_a[i] = 0xFFFFFFFFFFFFFFFEull;
h_b[i] = 2;
}
h_b[0] = 1;
}
else {
for (int i = 0; i<dsize; i++){
h_a[i] = (((unsigned long long)lrand48())<<33) + (unsigned long long)lrand48();
h_b[i] = (((unsigned long long)lrand48())<<33) + (unsigned long long)lrand48();
}
}
#ifdef GPUCOPY
cudaEventRecord(t_start_gpu, 0);
#endif
cudaMemcpy(d_a, h_a, dsize*sizeof(unsigned long long), cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMemcpy1 fail");
cudaMemcpy(d_b, h_b, dsize*sizeof(unsigned long long), cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMemcpy2 fail");
#ifdef GPU
cudaEventRecord(t_start_gpu, 0);
#endif
paradd<<<at_once, nTPB>>>(dsize, prob_size, d_c, d_a, d_b);
cudaCheckErrors("Kernel Fail");
#ifdef GPU
cudaEventRecord(t_end_gpu, 0);
#endif
cudaMemcpy(h_c, d_c, dsize*sizeof(unsigned long long), cudaMemcpyDeviceToHost);
cudaCheckErrors("cudaMemcpy3 fail");
#ifdef GPUCOPY
cudaEventRecord(t_end_gpu, 0);
#endif
cudaEventSynchronize(t_end_gpu);
cudaEventElapsedTime(&et_gpu, t_start_gpu, t_end_gpu);
tot_gpu += et_gpu;
cudaEventRecord(t_start_cpu, 0);
//also compute result on CPU for comparison
for (int j=0; j<at_once; j++) {
unsigned rc=0;
for (int n=0; n<prob_size; n++){
unsigned i = (j*prob_size) + n;
c[i] = h_a[i] + h_b[i];
if (c[i] < h_a[i]) {
c[i] += rc;
rc=1;}
else {
if ((c[i] += rc) != 0) rc=0;
}
if (c[i] != h_c[i]) {printf("Results mismatch at offset %d, GPU = 0x%lX, CPU = 0x%lX\n", i, h_c[i], c[i]); return 1;}
}
}
cudaEventRecord(t_end_cpu, 0);
cudaEventSynchronize(t_end_cpu);
cudaEventElapsedTime(&et_cpu, t_start_cpu, t_end_cpu);
tot_cpu += et_cpu;
if ((loops%(LOOPCNT/10)) == 0) printf("*\n");
}
printf("\nResults Match!\n");
printf("Average GPU time = %fms\n", (tot_gpu/LOOPCNT));
printf("Average CPU time = %fms\n", (tot_cpu/LOOPCNT));
return 0;
}

Related

Divide 64-bit integers as though the dividend is shifted left 64 bits, without having 128-bit types

Apologies for the confusing title. I'm not sure how to better describe what I'm trying to accomplish. I'm essentially trying to do the reverse of
getting the high half of a 64-bit multiplication in C for platforms where
int64_t divHi64(int64_t dividend, int64_t divisor) {
return ((__int128)dividend << 64) / (__int128)divisor;
}
isn't possible due to lacking support for __int128.
This can be done without a multi-word division
Suppose we want to do ⌊264 × x⁄y⌋ then we can transform the expression like this
The first term is trivially done as ((-y)/y + 1)*x as per this question How to compute 2⁶⁴/n in C?
The second term is equivalent to (264 % y)/y*x and is a little bit trickier. I've tried various ways but all need 128-bit multiplication and 128/64 division if using only integer operations. That can be done using the algorithms to calculate MulDiv64(a, b, c) = a*b/c in the below questions
Most accurate way to do a combined multiply-and-divide operation in 64-bit?
How to multiply a 64 bit integer by a fraction in C++ while minimizing error?
(a * b) / c MulDiv and dealing with overflow from intermediate multiplication
How can I multiply and divide 64-bit ints accurately?
However they may be slow, and if you have those functions you calculate the whole expression more easily like MulDiv64(x, UINT64_MAX, y) + x/y + something without messing up with the above transformation
Using long double seems to be the easiest way if it has 64 bits of precision or more. So now it can be done by (264 % y)/(long double)y*x
uint64_t divHi64(uint64_t x, uint64_t y) {
uint64_t mod_y = UINT64_MAX % y + 1;
uint64_t result = ((-y)/y + 1)*x;
if (mod_y != y)
result += (uint64_t)((mod_y/(long double)y)*x);
return result;
}
The overflow check was omitted for simplification. A slight modification will be needed if you need signed division
If you're targeting 64-bit Windows but you're using MSVC which doesn't have __int128 then now it has a 128-bit/64-bit divide intrinsic which simplifies the job significantly without a 128-bit integer type. You still need to handle overflow though because the div instruction will throw an exception on that case
uint64_t divHi64(uint64_t x, uint64_t y) {
uint64_t high, remainder;
uint64_t low = _umul128(UINT64_MAX, y, &high);
if (x <= high /* && 0 <= low */)
return _udiv128(x, 0, y, &remainder);
// overflow case
errno = EOVERFLOW;
return 0;
}
The overflow checking above is can be simplified to checking whether x < y, because if x >= y then the result will overflow
See also
Efficient Multiply/Divide of two 128-bit Integers on x86 (no 64-bit)
Efficient computation of 2**64 / divisor via fast floating-point reciprocal
Exhaustive tests on 16/16 bit division shows that my solution works correctly for all cases. However you do need double even though float has more than 16 bits of precision, otherwise occasionally a less-than-one result will be returned. It may be fixed by adding an epsilon value before truncating: (uint64_t)((mod_y/(long double)y)*x + epsilon). That means you'll need __float128 (or the -m128bit-long-double option) in gcc for precise 64/64-bit output if you don't correct the result with epsilon. However that type is available on 32-bit targets, unlike __int128 which is supported only on 64-bit targets, so life will be a bit easier. Of course you can use the function as-is if just a very close result is needed
Below is the code I've used for verifying
#include <thread>
#include <iostream>
#include <limits>
#include <climits>
#include <mutex>
std::mutex print_mutex;
#define MAX_THREAD 8
#define NUM_BITS 27
#define CHUNK_SIZE (1ULL << NUM_BITS)
// typedef uint32_t T;
// typedef uint64_t T2;
// typedef double D;
typedef uint64_t T;
typedef unsigned __int128 T2; // the type twice as wide as T
typedef long double D;
// typedef __float128 D;
const D epsilon = 1e-14;
T divHi(T x, T y) {
T mod_y = std::numeric_limits<T>::max() % y + 1;
T result = ((-y)/y + 1)*x;
if (mod_y != y)
result += (T)((mod_y/(D)y)*x + epsilon);
return result;
}
void testdiv(T midpoint)
{
T begin = midpoint - CHUNK_SIZE/2;
T end = midpoint + CHUNK_SIZE/2;
for (T i = begin; i != end; i++)
{
T x = i & ((1 << NUM_BITS/2) - 1);
T y = CHUNK_SIZE/2 - (i >> NUM_BITS/2);
// if (y == 0)
// continue;
auto q1 = divHi(x, y);
T2 q2 = ((T2)x << sizeof(T)*CHAR_BIT)/y;
if (q2 != (T)q2)
{
// std::lock_guard<std::mutex> guard(print_mutex);
// std::cout << "Overflowed: " << x << '&' << y << '\n';
continue;
}
else if (q1 != q2)
{
std::lock_guard<std::mutex> guard(print_mutex);
std::cout << x << '/' << y << ": " << q1 << " != " << (T)q2 << '\n';
}
}
std::lock_guard<std::mutex> guard(print_mutex);
std::cout << "Done testing [" << begin << ", " << end << "]\n";
}
uint16_t divHi16(uint32_t x, uint32_t y) {
uint32_t mod_y = std::numeric_limits<uint16_t>::max() % y + 1;
int result = ((((1U << 16) - y)/y) + 1)*x;
if (mod_y != y)
result += (mod_y/(double)y)*x;
return result;
}
void testdiv16(uint32_t begin, uint32_t end)
{
for (uint32_t i = begin; i != end; i++)
{
uint32_t y = i & 0xFFFF;
if (y == 0)
continue;
uint32_t x = i & 0xFFFF0000;
uint32_t q2 = x/y;
if (q2 > 0xFFFF) // overflowed
continue;
uint16_t q1 = divHi16(x >> 16, y);
if (q1 != q2)
{
std::lock_guard<std::mutex> guard(print_mutex);
std::cout << x << '/' << y << ": " << q1 << " != " << q2 << '\n';
}
}
}
int main()
{
std::thread t[MAX_THREAD];
for (int i = 0; i < MAX_THREAD; i++)
t[i] = std::thread(testdiv, std::numeric_limits<T>::max()/MAX_THREAD*i);
for (int i = 0; i < MAX_THREAD; i++)
t[i].join();
std::thread t2[MAX_THREAD];
constexpr uint32_t length = std::numeric_limits<uint32_t>::max()/MAX_THREAD;
uint32_t begin, end = length;
for (int i = 0; i < MAX_THREAD - 1; i++)
{
begin = end;
end += length;
t2[i] = std::thread(testdiv16, begin, end);
}
t2[MAX_THREAD - 1] = std::thread(testdiv, end, UINT32_MAX);
for (int i = 0; i < MAX_THREAD; i++)
t2[i].join();
std::cout << "Done\n";
}

Long long int makes my Sieve of Eratosthenes super slow?

I have a program that requires me to find primes up till 10**10-1 (10,000,000,000). I wrote a Sieve of Eratosthenes to do this, and it worked very well (and accurately) as high as 10**9 (1,000,000,000). I confirmed its accuracy by having it count the number of primes it found, and it matched the value of 50,847,534 on the chart found here. I used unsigned int as the storage type and it successfully found all the primes in approximately 30 seconds.
However, 10**10 requires that I use a larger storage type: long long int. Once I switched to this, the program is running signifigantly slower (its been 3 hours plus and its still working). Here is the relevant code:
typedef unsigned long long ul_long;
typedef unsigned int u_int;
ul_long max = 10000000000;
u_int blocks = 1250000000;
char memField[1250000000];
char mapBit(char place) { //convert 0->0x80, 1->0x40, 2->0x20, and so on
return 0x80 >> (place);
}
for (u_int i = 2; i*i < max; i++) {
if (memField[i / 8] & activeBit) { //Use correct memory block
for (ul_long n = 2 * i; n < max; n += i) {
char secondaryBit = mapBit(n % 8); //Determine bit position of n
u_int activeByte = n / 8; //Determine correct memory block
if (n < 8) { //Manual override memory block and bit for first block
secondaryBit = mapBit(n);
activeByte = 0;
}
memField[activeByte] &= ~(secondaryBit); //Set the flag to false
}
}
activeBit = activeBit >> 1; //Check the next
if (activeBit == 0x00) activeBit = 0x80;
}
I figure that since 10**10 is 10x larger then 10**9 it should take 10 times the amount of time. Where is the flaw in this? Why did changing to long long cause such significant performance issues and how can I fix this? I recognize that the numbers get larger, so it should be somewhat slower, but only towards the end. Is there something I'm missing.
Note: I realize long int should technically be large enough but my limits.h says it isn't even though I'm compiling 64 bit. Thats why I use long long int in case anyone was wondering. Also, keep in mind, I have no computer science training, just a hobbyist.
edit: just ran it in "Release" as x86-64 with some of the debug statements suggested. I got the following output:
looks like I hit the u_int bound. I don't know why i is getting that large.
Your program has an infinite loop in for (u_int i = 2; i*i < max; i++). i is an unsigned int so i*i wraps at 32-bit and is always less than max. Make i an ul_long.
Note that you should use simpler bit pattern from 1 to 0x80 for bit 0 to 7.
Here is a complete version:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
typedef unsigned long long ul_long;
typedef unsigned int u_int;
#define TESTBIT(a, bit) (a[(bit) / 8] & (1 << ((bit) & 7)))
#define CLEARBIT(a, bit) (a[(bit) / 8] &= ~(1 << ((bit) & 7)))
ul_long count_primes(ul_long max) {
size_t blocks = (max + 7) / 8;
unsigned char *memField = malloc(blocks);
if (memField == NULL) {
printf("cannot allocate memory for %llu bytes\n",
(unsigned long long)blocks);
return 0;
}
memset(memField, 255, blocks);
CLEARBIT(memField, 0); // 0 is not prime
CLEARBIT(memField, 1); // 1 is not prime
// clear bits after max
for (ul_long i = max + 1; i < blocks * 8ULL; i++) {
CLEARBIT(memField, i);
}
for (ul_long i = 2; i * i < max; i++) {
if (TESTBIT(memField, i)) { //Check if i is prime
for (ul_long n = 2 * i; n < max; n += i) {
CLEARBIT(memField, n); //Reset all multiples of i
}
}
}
unsigned int bitCount[256];
for (int i = 0; i < 256; i++) {
bitCount[i] = (((i >> 0) & 1) + ((i >> 1) & 1) +
((i >> 2) & 1) + ((i >> 3) & 1) +
((i >> 4) & 1) + ((i >> 5) & 1) +
((i >> 6) & 1) + ((i >> 7) & 1));
}
ul_long count = 0;
for (size_t i = 0; i < blocks; i++) {
count += bitCount[memField[i]];
}
printf("count of primes up to %llu: %llu\n", max, count);
free(memField);
return count;
}
int main(int argc, char *argv[]) {
if (argc > 1) {
for (int i = 1; i < argc; i++) {
count_primes(strtoull(argv[i], NULL, 0));
}
} else {
count_primes(10000000000);
}
return 0;
}
It completes in 10 seconds for 10^9 and 131 seconds for 10^10:
count of primes up to 1000000000: 50847534
count of primes up to 10000000000: 455052511

Counting number of bits in an unsigned long

So I've written this function to count the number of bits in a long, which for my purposes includes zeros to the right of the MSB and excludes zeros to its left:
int bitCount(unsigned long bits)
{
int len = 64;
unsigned long mask = 0x8000000000000000;
while ((bits & mask) == 0 && len > 0){
mask >>= 1;
--len;
}
return len;
}
This function works fine for me as far as returning a correct answer, but is there a better (faster or otherwise) way to go about doing this?
If you want to count the number of bits in a long type, I suggest you use ULONG_MAX from the <limits.h> header file, and use the right shift operator >> to count the number of one-bits. This way you don't have to actually know the number of bits beforehand.
Something like
unsigned long value = ULONG_MAX;
unsigned count = 1;
while (value >>= 1)
++count;
This works because the right shift fills up with zeroes.
The general answer for the number of bits in any type is CHAR_BIT*sizeof(type). CHAR_BIT, defined in <limits.h> is the (implementation-defined) number of bits in a char. sizeof(type) is specified in a way that yields the number of chars used to represent the type (i.e. sizeof(char) is 1).
The solutions the other guys proposed are very nice and probably the shortest to write and remain understandable. Another straight forward approach would be something like this
int bitCountLinear(long int n) {
int len = sizeof(long int)*8;
for (int i = 0; i < len; ++i)
if ((1UL<<i) > (unsigned long int)n)
return i;
return len;
}
The rest might get a bit extreme but I gave it a try so I'll share it.
I suspected that there might be arguably faster methods of doing this. eg Using binary search (even though a length of 64bits is extremely small). So I gave it a quick try for you and for the fun of it.
union long_ing_family {
unsigned long int uli;
long int li;
};
int bitCountLogN(long int num) {
union long_ing_family lif;
lif.li = num;
unsigned long int n = lif.uli;
int res;
int len = sizeof(long int)*8-1;
int max = len;
int min = 0;
if (n == 0) return 0;
do {
res = (min + max) / 2;
if (n < 1UL<<res)
max = res - 1;
else if (n >= (1UL<<(res+1)))
min = res + 1;
else
return res+1;
} while (min < max);
return min+1; // or max+1
}
I then timed both to see if they have any interesting differences...
#include <stdio.h>
#define REPS 10000000
int bitCountLinear(long int n);
int bitCountLogN(long int num);
unsigned long int timestamp_start(void);
unsigned long int timestamp_stop(void);
union long_ing_family;
int main(void) {
long int n;
long int begin, end;
long int begin_Lin, end_Lin;
long int begin_Log, end_Log;
begin_Lin = 0;
end_Lin = 0;
begin_Log = 0;
end_Log = 0;
for (int i = 0; i < REPS; ++i) {
begin_Lin += timestamp_start();
bitCountLinear(i);
end_Lin += timestamp_stop();
}
printf("Linear: %lu\n", (end_Lin-begin_Lin)/REPS);
for (int i = 0; i < REPS; ++i) {
begin_Log += timestamp_start();
bitCountLogN(i);
end_Log += timestamp_stop();
}
printf("Log(n): %lu\n", (end_Log-begin_Log)/REPS);
}
unsigned long int timestamp_start(void) {
unsigned int cycles_low;
unsigned int cycles_high;
asm volatile ("CPUID\n\t"
"RDTSCP\n\t"
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low)::"%rax", "%rbx", "%rcx", "%rdx");
return ((unsigned long int)cycles_high << 32) | cycles_low;
}
unsigned long int timestamp_stop(void) {
unsigned int cycles_low;
unsigned int cycles_high;
asm volatile ("RDTSCP\n\t"
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t"
"CPUID\n\t": "=r" (cycles_high), "=r" (cycles_low)::"%rax", "%rbx", "%rcx", "%rdx");
return ((unsigned long int)cycles_high << 32) | cycles_low;
}
...and not surprisingly they didn't.
On my machine I'll get numbers like
Linear: 228
Log(n): 224
Which are not considered to be different assuming a lot of background noise.
Edit:
I realized that I only tested the fastest solutions for the Linear approach so changing the function inputs to
bitCountLinear(0xFFFFFFFFFFFFFFFF-i);
and
bitCountLogN(0xFFFFFFFFFFFFFFFF-i);
On my machine I'll get numbers like
Linear: 415
Log(n): 269
Which is clearly a win for the Log(n) method. I didn't expect to see a difference here.
You can count the number of bit 1 first:
int bitCount(unsigned long n)
{
unsigned long tmp;
tmp = n - ((n >> 1) & 0x7777777777777777)
- ((n >> 2) & 0x3333333333333333)
- ((n >> 3) & 0x1111111111111111);
tmp = (tmp + (tmp >> 4)) & 0x0F0F0F0F0F0F0F0F;
return 64 - tmp % 255; // temp % 255 is number of bit 1
}
Take a look at the MIT HAKMEM Count.

Is there a more optimal way to approach some of these functions?

I completed some bit manipulation exercises out of a textbook recently and have grasped onto some of the core ideas behind manipulating bits firmly. My main concern with making this post is for optimizations to my current code. I get the hunch that there are some functions that I could approach better. Do you have any recommendations for the following code?
#include <stdio.h>
#include "funcs.h"
// basically sizeof(int) using bit manipulation
unsigned int int_size(){
int size = 0;
for(unsigned int i = ~00u; i > 0; i >>= 1, size++);
return size;
}
// get a bit at a specific nth index
// index starts with 0 on the most significant bit
unsigned int bit_get(unsigned int data, unsigned int n){
return (data >> (int_size() - n - 1)) & 1;
}
// set a bit at a specific nth index
// index starts with 0 on the most significant bit
unsigned int bit_set(unsigned int data, unsigned int n){
return data | (1 << (int_size() - n - 1));
}
// gets the bit width of the data (<32)
unsigned int bit_width(unsigned int data){
int width = int_size();
for(; width > 0; width--)
if((data & (1 << width)) != 0)
break;
return width + 1;
}
// print the data contained in an unsigned int
void print_data(unsigned int data){
printf("%016X = ",data);
for(int i = 0; i < int_size(); i++)
printf("%X",bit_get(data,i));
putchar('\n');
}
// search for pattern in source (where pattern is n wide)
unsigned int bitpat_search(unsigned int source, unsigned int pattern,
unsigned int n){
int right = int_size() - n;
unsigned int mask = 0;
for(int i = 0; i < n; i++)
mask |= 1 << i;
for(int i = 0; i < right; i++)
if(((source & (mask << (right - i))) >> (right - i) ^ pattern) == 0)
return i - bit_width(source);
return -1;
}
// extract {count} bits from data starting at {start}
unsigned int bitpat_get(unsigned int data, int start, int count){
if(start < 0 || count < 0 || int_size() <= start || int_size() <= count || bit_width(data) != count)
return -1;
unsigned int mask = 1;
for(int i = 0; i < count; i++)
mask |= 1 << i;
mask <<= int_size() - start - count;
return (data & mask) >> (int_size() - start - count);
}
// set {count} bits (basically width of {replace}) in {*data} starting at {start}
void bitpat_set(unsigned int *data, unsigned int replace, int start, int count){
if(start < 0 || count < 0 || int_size() <= start || int_size() <= count || bit_width(replace) != count)
return;
unsigned int mask = 1;
for(int i = 0; i < count; i++)
mask |= 1 << i;
*data = ((*data | (mask << (int_size() - start - count))) & ~(mask << (int_size() - start - count))) | (replace << (int_size() - start - count));
}
because your int_size() function returns the same value each time you could save some time there:
unsigned int int_size(){
static unsigned int size = 0;
if (size == 0)
for(unsigned int i = ~00u; i > 0; i >>= 1, size++);
return size;
}
so it will calculate the value only once.
But replacing all calls of this function by sizeof(int)*8 would be much better.
I looked through your code and there's nothing that jumps out at me.
Overall, don't sweat the small stuff. If the code runs and works fine, no worries. If you are really concerned about performance, go ahead and run your code through a profiler.
Overall, I will say that the one thing you might be dealing with is the "paranoia" I see in your code regarding the width of an int. I generally use the fixed-length types in stdint.h and give the caller some options regarding what length of ints (i.e. uint8_t, uint16_t, uint32_t, etc.) they want to deal with.
Also, in C99, there are bitfields, which allow for each bit to be addressed into.
unsigned int int_size(){
return __builtin_popcount((unsigned int) -1) / __builtin_popcount((unsigned char) -1);
}
This should be faster than looping.
Including int_size() in all the others seems like its going to kill performance unless the compiler is really good at optimizing that loop out.
You could use a uint32_t instead of an int and then you would know up front the size.
You could also use sizeof(int) to get the size in bytes of an int and multiply by 8. I haven't seen an environment that defined a byte to be other than 8 bits, but the standard does seem to allow for it in saying it is implementation defined.

In-place integer multiplication

I'm writing a program (in C) in which I try to calculate powers of big numbers in an as short of a period as possible. The numbers I represent as vectors of digits, so all operations have to be written by hand.
The program would be much faster without all the allocations and deallocations of intermediary results. Is there any algorithm for doing integer multiplication, in-place? For example, the function
void BigInt_Times(BigInt *a, const BigInt *b);
would place the result of the multiplication of a and b inside of a, without using an intermediary value.
Here, muln() is 2n (really, n) by n = 2n in-place multiplication for unsigned integers. You can adjust it to operate with 32-bit or 64-bit "digits" instead of 8-bit. The modulo operator is left in for clarity.
muln2() is n by n = n in-place multiplication (as hinted here), also operating on 8-bit "digits".
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <limits.h>
typedef unsigned char uint8;
typedef unsigned short uint16;
#if UINT_MAX >= 0xFFFFFFFF
typedef unsigned uint32;
#else
typedef unsigned long uint32;
#endif
typedef unsigned uint;
void muln(uint8* dst/* n bytes + n extra bytes for product */,
const uint8* src/* n bytes */,
uint n)
{
uint c1, c2;
memset(dst + n, 0, n);
for (c1 = 0; c1 < n; c1++)
{
uint8 carry = 0;
for (c2 = 0; c2 < n; c2++)
{
uint16 p = dst[c1] * src[c2] + carry + dst[(c1 + n + c2) % (2 * n)];
dst[(c1 + n + c2) % (2 * n)] = (uint8)(p & 0xFF);
carry = (uint8)(p >> 8);
}
dst[c1] = carry;
}
for (c1 = 0; c1 < n; c1++)
{
uint8 t = dst[c1];
dst[c1] = dst[n + c1];
dst[n + c1] = t;
}
}
void muln2(uint8* dst/* n bytes */,
const uint8* src/* n bytes */,
uint n)
{
uint c1, c2;
if (n >= 0xFFFF) abort();
for (c1 = n - 1; c1 != ~0u; c1--)
{
uint16 s = 0;
uint32 p = 0; // p must be able to store ceil(log2(n))+2*8 bits
for (c2 = c1; c2 != ~0u; c2--)
{
p += dst[c2] * src[c1 - c2];
}
dst[c1] = (uint8)(p & 0xFF);
for (c2 = c1 + 1; c2 < n; c2++)
{
p >>= 8;
s += dst[c2] + (uint8)(p & 0xFF);
dst[c2] = (uint8)(s & 0xFF);
s >>= 8;
}
}
}
int main(void)
{
uint8 a[4] = { 0xFF, 0xFF, 0x00, 0x00 };
uint8 b[2] = { 0xFF, 0xFF };
printf("0x%02X%02X * 0x%02X%02X = ", a[1], a[0], b[1], b[0]);
muln(a, b, 2);
printf("0x%02X%02X%02X%02X\n", a[3], a[2], a[1], a[0]);
a[0] = -2; a[1] = -1;
b[0] = -3; b[1] = -1;
printf("0x%02X%02X * 0x%02X%02X = ", a[1], a[0], b[1], b[0]);
muln2(a, b, 2);
printf("0x%02X%02X\n", a[1], a[0]);
return 0;
}
Output:
0xFFFF * 0xFFFF = 0xFFFE0001
0xFFFE * 0xFFFD = 0x0006
I think this is the best we can do in-place. One thing I don't like about muln2() is that it has to accumulate bigger intermediate products and then propagate a bigger carry.
Well, the standard algorithm consists of multiplying every digit (word) of 'a' with every digit of 'b' and summing them into the appropriate places in the result. The i'th digit of a thus goes into every digit from i to i+n of the result. So in order to do this 'in place' you need to calculate the output digits down from most significant to least. This is a little bit trickier than doing it from least to most, but not much...
It doesn't sound like you really need an algorithm. Rather, you need better use of the language's features.
Why not just create that function you indicated in your answer? Use it and enjoy! (The function would likely end up returning a reference to a as its result.)
Typically, big-int representations vary in length depending on the value represented; in general, the result is going to be longer than either operand. In particular, for multiplication, the size of the resulting representation is roughly the sum of the sizes of the arguments.
If you are certain that memory management is truly the bottleneck for your particular platform, you might consider implementing a multiply function which updates a third value. In terms of your C-style function prototype above:
void BigInt_Times_Update(const BigInt* a, const BigInt* b, BigInt* target);
That way, you can handle memory management in the same way C++ std::vector<> containers do: your update target only needs to reallocate its heap data when the existing size is too small.

Resources