NEON implementation in ARM

NEON implementation in ARM - arm

I am a beginner in NEON and wanted to optimize the following code but while it compiles and produces the same output as desired , I don't see any improvement. AFAIK NEON is helpful in doing operations on contiguous block of data so I was hoping some improvement in execution time and cycles.What am i doing wrong??
I'm working on gcc on Ubuntu 12.04 with -03 level optimization
normal c implementation
for(i= 0;i<9215;i++)
{
Z[i] = (L[i]>0)?0:1;
}
Neon form
for(i=0;i<9215;i+=4)
{
int32x4_t l_N = vld1q_s32(&L[i]);
uint32x4_t mask_n=vcltq_s32(l_N,zero_N);
int32x4_t z_n = vbslq_s32(mask_n,one_N,zero_N);
vst1q_s32(&Z[i],z_n);
}

Problems:
You are using a very inefficient algorithm for the computation inside the loop
Your routine suffers from heavy pipeline interlocks, instruction by instruction
void isNonNatural(int32_t * pDst, int32_t *pSrc, int n)
{
int32x4_t vec;
const int32x4_t one = vdupq_n_s32(1);
int32_t a;
unsigned int i;
if (n >= 4)
{
n -= 4;
while (1) {
do {
n -= 4;
vec = vld1q_s32(pSrc++);
vec = vqsubq_s32(vec, one);
vec = (int32x4_t) vshrq_n_u32((uint32x4_t) vec, 31);
vst1q_s32(pDst++, vec);
} while (n >= 0);
if (n <= -4) return;
// dealing with residuals
pSrc += n; // rewind pointers
pDst += n;
} // iterate for one last time
}
for (i = 0; i < n; ++i) {
a = *pSrc++;
if (a > 0) a = 0; else a = 1;
*pDst++ = a;
}
}
This function above should be somewhat faster than your implementation.
a saturating subtraction by 1 is done so that 0 becomes -1 while 0x80000000 remains 0x80000000
The elements get shifted by 31 bits so that only the sign bit remains.
I you can live with 0xffffffff instead of 1, you can leave out the typecasting and use vshrq_n_s32 instead. It won't be any faster though.
Pay attention to the residual management.
Programming NEON is like driving a big truck. You shouldn't drive it like a compact car.
While NEON can compute multiple data at once, mostly in a single cycle, it has higher instruction latencies, usually 3~4 cycles. In other words, each and every instruction has to wait that long for the previous one to return the result in the implementation above.
Virtually the only way avoiding this is unrolling, a deep one.
void isNonNatural_unroll(int32_t * pDst, int32_t *pSrc, int n)
{
int32x4_t vec1, vec2, vec3, vec4;
const int32x4_t one = vdupq_n_s32(1);
int32_t a;
unsigned int i;
if (n >= 16)
{
n -= 16;
while (1) {
do {
n -= 16;
vec1 = vld1q_s32(pSrc++);
vec2 = vld1q_s32(pSrc++);
vec3 = vld1q_s32(pSrc++);
vec4 = vld1q_s32(pSrc++);
vec1 = vqsubq_s32(vec1, one);
vec2 = vqsubq_s32(vec2, one);
vec3 = vqsubq_s32(vec3, one);
vec4 = vqsubq_s32(vec4, one);
vec1 = (int32x4_t) vshrq_n_u32((uint32x4_t) vec1, 31);
vec2 = (int32x4_t) vshrq_n_u32((uint32x4_t) vec2, 31);
vec3 = (int32x4_t) vshrq_n_u32((uint32x4_t) vec3, 31);
vec4 = (int32x4_t) vshrq_n_u32((uint32x4_t) vec4, 31);
vst1q_s32(pDst++, vec1);
vst1q_s32(pDst++, vec2);
vst1q_s32(pDst++, vec3);
vst1q_s32(pDst++, vec4);
} while (n >= 0);
if (n <= -16) return;
// dealing with residuals
pSrc += n; // rewind pointers
pDst += n;
} // iterate for one last time
}
if (n & 8)
{
vec1 = vld1q_s32(pSrc++);
vec2 = vld1q_s32(pSrc++);
vec1 = vqsubq_s32(vec1, one);
vec2 = vqsubq_s32(vec2, one);
vec1 = (int32x4_t) vshrq_n_u32((uint32x4_t) vec1, 31);
vec2 = (int32x4_t) vshrq_n_u32((uint32x4_t) vec2, 31);
vst1q_s32(pDst++, vec1);
vst1q_s32(pDst++, vec2);
}
if (n & 4)
{
vec1 = vld1q_s32(pSrc++);
vec1 = vqsubq_s32(vec1, one);
vec1 = (int32x4_t) vshrq_n_u32((uint32x4_t) vec1, 31);
vst1q_s32(pDst++, vec1);
}
n &= 3;
for (i = 0; i < n; ++i) {
a = *pSrc++;
if (a > 0) a = 0; else a = 1;
*pDst++ = a;
}
}
Now this one should be a lot faster than the previous ones since virtually all the latencies are hidden (more than four times as fast), provided the pathetic compilers don't mess it up.

Related

Divide 64-bit integers as though the dividend is shifted left 64 bits, without having 128-bit types

Apologies for the confusing title. I'm not sure how to better describe what I'm trying to accomplish. I'm essentially trying to do the reverse of
getting the high half of a 64-bit multiplication in C for platforms where
int64_t divHi64(int64_t dividend, int64_t divisor) {
return ((__int128)dividend << 64) / (__int128)divisor;
}
isn't possible due to lacking support for __int128.

This can be done without a multi-word division
Suppose we want to do ⌊264 × x⁄y⌋ then we can transform the expression like this
The first term is trivially done as ((-y)/y + 1)*x as per this question How to compute 2⁶⁴/n in C?
The second term is equivalent to (264 % y)/y*x and is a little bit trickier. I've tried various ways but all need 128-bit multiplication and 128/64 division if using only integer operations. That can be done using the algorithms to calculate MulDiv64(a, b, c) = a*b/c in the below questions
Most accurate way to do a combined multiply-and-divide operation in 64-bit?
How to multiply a 64 bit integer by a fraction in C++ while minimizing error?
(a * b) / c MulDiv and dealing with overflow from intermediate multiplication
How can I multiply and divide 64-bit ints accurately?
However they may be slow, and if you have those functions you calculate the whole expression more easily like MulDiv64(x, UINT64_MAX, y) + x/y + something without messing up with the above transformation
Using long double seems to be the easiest way if it has 64 bits of precision or more. So now it can be done by (264 % y)/(long double)y*x
uint64_t divHi64(uint64_t x, uint64_t y) {
uint64_t mod_y = UINT64_MAX % y + 1;
uint64_t result = ((-y)/y + 1)*x;
if (mod_y != y)
result += (uint64_t)((mod_y/(long double)y)*x);
return result;
}
The overflow check was omitted for simplification. A slight modification will be needed if you need signed division
If you're targeting 64-bit Windows but you're using MSVC which doesn't have __int128 then now it has a 128-bit/64-bit divide intrinsic which simplifies the job significantly without a 128-bit integer type. You still need to handle overflow though because the div instruction will throw an exception on that case
uint64_t divHi64(uint64_t x, uint64_t y) {
uint64_t high, remainder;
uint64_t low = _umul128(UINT64_MAX, y, &high);
if (x <= high /* && 0 <= low */)
return _udiv128(x, 0, y, &remainder);
// overflow case
errno = EOVERFLOW;
return 0;
}
The overflow checking above is can be simplified to checking whether x < y, because if x >= y then the result will overflow
See also
Efficient Multiply/Divide of two 128-bit Integers on x86 (no 64-bit)
Efficient computation of 2**64 / divisor via fast floating-point reciprocal
Exhaustive tests on 16/16 bit division shows that my solution works correctly for all cases. However you do need double even though float has more than 16 bits of precision, otherwise occasionally a less-than-one result will be returned. It may be fixed by adding an epsilon value before truncating: (uint64_t)((mod_y/(long double)y)*x + epsilon). That means you'll need __float128 (or the -m128bit-long-double option) in gcc for precise 64/64-bit output if you don't correct the result with epsilon. However that type is available on 32-bit targets, unlike __int128 which is supported only on 64-bit targets, so life will be a bit easier. Of course you can use the function as-is if just a very close result is needed
Below is the code I've used for verifying
#include <thread>
#include <iostream>
#include <limits>
#include <climits>
#include <mutex>
std::mutex print_mutex;
#define MAX_THREAD 8
#define NUM_BITS 27
#define CHUNK_SIZE (1ULL << NUM_BITS)
// typedef uint32_t T;
// typedef uint64_t T2;
// typedef double D;
typedef uint64_t T;
typedef unsigned __int128 T2; // the type twice as wide as T
typedef long double D;
// typedef __float128 D;
const D epsilon = 1e-14;
T divHi(T x, T y) {
T mod_y = std::numeric_limits<T>::max() % y + 1;
T result = ((-y)/y + 1)*x;
if (mod_y != y)
result += (T)((mod_y/(D)y)*x + epsilon);
return result;
}
void testdiv(T midpoint)
{
T begin = midpoint - CHUNK_SIZE/2;
T end = midpoint + CHUNK_SIZE/2;
for (T i = begin; i != end; i++)
{
T x = i & ((1 << NUM_BITS/2) - 1);
T y = CHUNK_SIZE/2 - (i >> NUM_BITS/2);
// if (y == 0)
// continue;
auto q1 = divHi(x, y);
T2 q2 = ((T2)x << sizeof(T)*CHAR_BIT)/y;
if (q2 != (T)q2)
{
// std::lock_guard<std::mutex> guard(print_mutex);
// std::cout << "Overflowed: " << x << '&' << y << '\n';
continue;
}
else if (q1 != q2)
{
std::lock_guard<std::mutex> guard(print_mutex);
std::cout << x << '/' << y << ": " << q1 << " != " << (T)q2 << '\n';
}
}
std::lock_guard<std::mutex> guard(print_mutex);
std::cout << "Done testing [" << begin << ", " << end << "]\n";
}
uint16_t divHi16(uint32_t x, uint32_t y) {
uint32_t mod_y = std::numeric_limits<uint16_t>::max() % y + 1;
int result = ((((1U << 16) - y)/y) + 1)*x;
if (mod_y != y)
result += (mod_y/(double)y)*x;
return result;
}
void testdiv16(uint32_t begin, uint32_t end)
{
for (uint32_t i = begin; i != end; i++)
{
uint32_t y = i & 0xFFFF;
if (y == 0)
continue;
uint32_t x = i & 0xFFFF0000;
uint32_t q2 = x/y;
if (q2 > 0xFFFF) // overflowed
continue;
uint16_t q1 = divHi16(x >> 16, y);
if (q1 != q2)
{
std::lock_guard<std::mutex> guard(print_mutex);
std::cout << x << '/' << y << ": " << q1 << " != " << q2 << '\n';
}
}
}
int main()
{
std::thread t[MAX_THREAD];
for (int i = 0; i < MAX_THREAD; i++)
t[i] = std::thread(testdiv, std::numeric_limits<T>::max()/MAX_THREAD*i);
for (int i = 0; i < MAX_THREAD; i++)
t[i].join();
std::thread t2[MAX_THREAD];
constexpr uint32_t length = std::numeric_limits<uint32_t>::max()/MAX_THREAD;
uint32_t begin, end = length;
for (int i = 0; i < MAX_THREAD - 1; i++)
{
begin = end;
end += length;
t2[i] = std::thread(testdiv16, begin, end);
}
t2[MAX_THREAD - 1] = std::thread(testdiv, end, UINT32_MAX);
for (int i = 0; i < MAX_THREAD; i++)
t2[i].join();
std::cout << "Done\n";
}

How to mix two bitmaps with AVX2 with 80-20%?

I have 2 bitmaps. I want to mix them in 80:20 portions, so I simply multipicate the pixels value with 0,8 and 0,2. The code works fine written in C (as a for cycle), but using AVX2 instructions results a bad output image.
#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>
#define ARRSIZE 5992826
void main(void){
FILE *bmp = fopen("filepath1", "rb"),
*bmpp = fopen("filepath2", "rb"),
*write = fopen("output", "wb");
unsigned char *a = aligned_alloc(32, ARRSIZE),
*b = aligned_alloc(32, ARRSIZE),
*c = aligned_alloc(32, ARRSIZE);
fread(c, 1, 122, bmp);
rewind(bmp);
fread(a, 1, ARRSIZE, bmp);
fread(b, 1, ARRSIZE, bmpp);
__m256i mm_a, mm_b;
__m256d mm_two = _mm256_set1_pd(2),
mm_eight = _mm256_set1_pd(8);
__m256d mm_c, mm_d,
mm_ten = _mm256_set1_pd(10.0);
int i = 122;
for(; i < ARRSIZE; i+=32){
// c[i] = ((a[i] * 0.8) + (b[i] * 0.2));
mm_a = _mm256_loadu_si256((__m256i *)&(a[i]));
mm_b = _mm256_loadu_si256((__m256i *)&(b[i]));
mm_c = _mm256_div_pd((__m256d)mm_a, mm_ten);
mm_d = _mm256_div_pd((__m256d)mm_b, mm_ten);
mm_a = (__m256i)_mm256_floor_pd(_mm256_mul_pd(mm_c, mm_eight));
mm_b = (__m256i)_mm256_floor_pd(_mm256_mul_pd(mm_d, mm_two));
mm_a = _mm256_add_epi8(mm_a, mm_b);
_mm256_storeu_si256((__m256i *)&(c[i]), mm_a);
}
fwrite(c, 1, ARRSIZE, write);
fclose(bmp);
fclose(bmpp);
fclose(write);
free(a);
free(b);
free(c);
}

A problem with the code that you had is that casting between vector types is not a value-preserving conversion, it is a reinterpretation. So (__m256d)mm_a actually means "take these 32 bytes and interpret them as 4 doubles". That can be OK, but if the data is packed RGB888 then reinterpreting it as doubles is not good.
Proper conversions could be used, but using floating point arithmetic (especially double precision) for this is overkill. Using smaller types makes more of them fit in a vector so that is usually faster since more items can be worked on with an instruction.
Also the 122 byte header should not be put into the aligned arrays, its presence there immediately unaligns the position of the actual pixel data. It can be written to the output file separately.
For example, one strategy for this is to widen to 16bit, use _mm256_mulhi_epu16 to scale by approximately 80% and approximately 20%, add them with _mm256_add_epi16, then narrow to 8bit again. The unpacking to 16bit and later packing back to 8bit works a bit strangely with 256bit vectors, think of it as 2x the 128bit operation side by side. To prevent premature truncation, the 8bit source data can be unpacked with a free shift left by 8, putting the data byte in the high byte of the corresponding word. That way the multiply-high will create 16bit intermediate results, instead of truncating them to 8bit immediately, this way we can round after doing the addition which is more proper (this does cost an extra shift, and optionally an add). For example like this (not tested):
const uint16_t scale_a = uint16_t(0x10000 * 0.8);
const uint16_t scale_b = uint16_t(0x10000 - scale_a);
__m256i roundoffset = _mm256_set1_epi16(0x80);
__m256i zero = _mm256_setzero_si256();
for(int i = 0; i < ARRSIZE; i += 32) {
// c[i] = ((a[i] * 0.8) + (b[i] * 0.2));
// c[i] = ((a[i] << 8) * scale_a) + ((b[i] << 8) * scale_b) >> 7;
__m256i raw_a = _mm256_loadu_si256((__m256i *)&(a[i]));
__m256i raw_b = _mm256_loadu_si256((__m256i *)&(b[i]));
__m256i data_al = _mm256_unpacklo_epi8(zero, raw_a);
__m256i data_bl = _mm256_unpacklo_epi8(zero, raw_b);
__m256i data_ah = _mm256_unpackhi_epi8(zero, raw_a);
__m256i data_bh = _mm256_unpackhi_epi8(zero, raw_b);
__m256i scaled_al = _mm256_mulhi_epu16(data_al, _mm256_set1_epi16(scale_a));
__m256i scaled_bl = _mm256_mulhi_epu16(data_bl, _mm256_set1_epi16(scale_b));
__m256i scaled_ah = _mm256_mulhi_epu16(data_ah, _mm256_set1_epi16(scale_a));
__m256i scaled_bh = _mm256_mulhi_epu16(data_bh, _mm256_set1_epi16(scale_b));
__m256i suml = _mm256_add_epi16(scaled_al, scaled_bl);
__m256i sumh = _mm256_add_epi16(scaled_ah, scaled_bh);
__m256i roundedl = _mm256_srli_epi16(_mm256_add_epi16(suml, roundoffset), 8);
__m256i roundedh = _mm256_srli_epi16(_mm256_add_epi16(sumh, roundoffset), 8);
__m256i packed = _mm256_packus_epi16(roundedl, roundedh);
_mm256_storeu_si256((__m256i *)&(c[i]), packed);
}
There are quite a lot of shuffle operations in it, which limit the throughput to one iteration every 5 cycles (in the absence of other limiters), which is roughly 1 pixel (as output) per cycle.
A different strategy could be to use _mm256_maddubs_epi16, with a lower precision approximation of the blend factors. It treats its second operand as signed bytes and does signed saturation, so this time only a 7-bit approximation of the scales fit. Since it operates on 8bit data there is less unpacking, but there is still some unpacking since it requires the data from both images to be interleaved. Maybe like this (also not tested):
const uint8_t scale_a = uint8_t(0x80 * 0.8);
const uint8_t scale_b = uint8_t(0x80 - scale_a);
__m256i scale = _mm256_set1_epi16((scale_b << 8) | scale_a);
__m256i roundoffset = _mm256_set1_epi16(0x80);
//__m256i scale = _mm256_set1_epi16();
for(int i = 0; i < ARRSIZE; i += 32) {
// c[i] = ((a[i] * 0.8) + (b[i] * 0.2));
// c[i] = (a[i] * scale_a) + (b[i] * scale_b) >> 7;
__m256i raw_a = _mm256_loadu_si256((__m256i *)&(a[i]));
__m256i raw_b = _mm256_loadu_si256((__m256i *)&(b[i]));
__m256i data_l = _mm256_unpacklo_epi8(raw_a, raw_b);
__m256i data_h = _mm256_unpackhi_epi8(raw_a, raw_b);
__m256i blended_l = _mm256_maddubs_epi16(data_l, scale);
__m256i blended_h = _mm256_maddubs_epi16(data_h, scale);
__m256i roundedl = _mm256_srli_epi16(_mm256_add_epi16(blended_l, roundoffset), 7);
__m256i roundedh = _mm256_srli_epi16(_mm256_add_epi16(blended_h, roundoffset), 7);
__m256i packed = _mm256_packus_epi16(roundedl, roundedh);
_mm256_storeu_si256((__m256i *)&(c[i]), packed);
}
With only 3 shuffles, perhaps the throughput could reach 1 iteration per 3 cycles, that would be almost 1.8 pixels per cycle.
Hopefully there are better ways to do it. Neither of these approaches is close to maxing out on multiplications, which seems like it should be the goal. I don't know how to get there though.
An other strategy is using several rounds of averaging to get close to the desired ratio, but close is not that close. Maybe something like this (not tested):
for(int i = 0; i < ARRSIZE; i += 32) {
// c[i] = round_somehow((a[i] * 0.8125) + (b[i] * 0.1875));
__m256i raw_a = _mm256_loadu_si256((__m256i *)&(a[i]));
__m256i raw_b = _mm256_loadu_si256((__m256i *)&(b[i]));
__m256i mixed_8_8 = _mm256_avg_epu8(raw_a, raw_b);
__m256i mixed_12_4 = _mm256_avg_epu8(raw_a, mixed_8_8);
__m256i mixed_14_2 = _mm256_avg_epu8(raw_a, mixed_12_4);
__m256i mixed_13_3 = _mm256_avg_epu8(mixed_12_4, mixed_14_2);
_mm256_storeu_si256((__m256i *)&(c[i]), mixed_13_3);
}
But _mm256_avg_epu8 rounds up, maybe it's bad to stack it so many times. There is no "avg round down"-instruction but avg_down(a, b) == ~avg_up(~a, ~b). That does not result in a huge mess of complements because most of them cancel each other. If there is still rounding up, it makes sense to leave that for the last operation. Always rounding down saves a XOR though. Maybe something like this (not tested)
__m256i ones = _mm256_set1_epi8(-1);
for(int i = 0; i < ARRSIZE; i += 32) {
// c[i] = round_somehow((a[i] * 0.8125) + (b[i] * 0.1875));
__m256i raw_a = _mm256_loadu_si256((__m256i *)&(a[i]));
__m256i raw_b = _mm256_loadu_si256((__m256i *)&(b[i]));
__m256i inv_a = _mm256_xor_si256(ones, raw_a);
__m256i inv_b = _mm256_xor_si256(ones, raw_b);
__m256i mixed_8_8 = _mm256_avg_epu8(inv_a, inv_b);
__m256i mixed_12_4 = _mm256_avg_epu8(inv_a, mixed_8_8);
__m256i mixed_14_2 = _mm256_avg_epu8(inv_a, mixed_12_4);
__m256i mixed_13_3 = _mm256_avg_epu8(_mm256_xor_si256(mixed_12_4, ones),
_mm256_xor_si256(mixed_14_2, ones));
_mm256_storeu_si256((__m256i *)&(c[i]), mixed_13_3);
}

C - find size of initialized data in an integer variable

I need size of initialized data stored in an integer variable.
Suppose.
u32 var = 0x0; should return 0
u32 var = 0x12; should return 1
u32 var = 0x1234; should return 2
u32 var = 0x123456; should return 3
u32 var = 0x12345678; should return 4

A log2(x) will give you the exponent of a binary value. Some C implementations have this function already built-in. If not, there are some alternative here: How to write log base(2) in c/c++
The resulting exponent can be divided and rounded in order to give the values you need.
A first attempt (untested) is:
int byteCount(const int x)
{
if (x == 0) return 0; /* Avoid error */
return (int)trunc((log10(x)/log10(2))/8+1);
}
UPDATE:
It seems my code is being taken literally. Here is an optimized version:
int byteCount(const u32 x)
{
if (x == 0) return 0; /* Avoid error */
return (int)trunc((log10(x)/0.301029995663981)/8+1);
}

Do you need to count number of non-zero bytes?
u8 countNonZeroBytes(u32 n) {
u8 result = n == 0 ? 0 : 1;
while (n >> 8 != 0) {
result++;
n = n >> 8;
}
return result;
}

This should give you the answer as per your requirement.
u8 CountNonZeroBytes(u32 n) {
u32 mask = 0xFF;
u8 i, result = 0;
for (i = 0; i < sizeof(n); i++) {
if (mask & n)
result++;
mask = mask << 8;
}
return result;
}

Here is a version of the "leading zeroes" approach to log2 that doesn't use floating point. The optimizer will do loop unrolling, so it's equivalent to the "four compare" version. It is 4x faster than the floating point version.
u32
bytecnt(u32 val)
{
int bitno;
u32 msk;
u32 bycnt;
bycnt = 0;
for (bitno = 24; bitno >= 0; bitno -= 8) {
msk = 0xFF << bitno;
if (val & msk) {
bycnt = bitno / 8;
bycnt += 1;
break;
}
}
return bycnt;
}
Here is a test program that compares the two algorithms [Note that I'm using Jaime's floating point version for comparison]:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
typedef unsigned int u32;
#define RATIO \
do { \
if (tvslow > tvfast) \
ratio = tvslow / tvfast; \
else \
ratio = tvfast / tvslow; \
printf("%.3fx\n",ratio); \
} while (0)
int opt_f;
// _tvgetf -- get timestamp
double
_tvgetf(void)
{
struct timespec ts;
double val;
#if 1
clock_gettime(CLOCK_REALTIME,&ts);
#else
clock_gettime(CLOCK_MONOTONIC_RAW,&ts);
#endif
val = ts.tv_nsec;
val /= 1e9;
val += ts.tv_sec;
return val;
}
u32
bytecnt(u32 val)
{
int bitno;
u32 msk;
u32 bycnt;
bycnt = 0;
for (bitno = 24; bitno >= 0; bitno -= 8) {
msk = 0xFF << bitno;
if (val & msk) {
bycnt = bitno / 8;
bycnt += 1;
break;
}
}
return bycnt;
}
u32
bytecnt2(u32 val)
{
u32 bycnt;
do {
if (val & (0xFF << 24)) {
bycnt = 4;
break;
}
if (val & (0xFF << 16)) {
bycnt = 3;
break;
}
if (val & (0xFF << 8)) {
bycnt = 2;
break;
}
if (val & (0xFF << 0)) {
bycnt = 1;
break;
}
bycnt = 0;
} while (0);
return bycnt;
}
int byteCount(const int x)
{
if (x == 0) return 0; /* Avoid error */
return (int)trunc((log10(x)/log10(2))/8+1);
}
u32 byteCount2(u32 x)
{
if (x == 0) return 0; /* Avoid error */
return (u32)trunc((log10(x)/log10(2))/8+1);
}
static double l2 = 0;
u32 byteCount3(u32 x)
{
if (x == 0) return 0; /* Avoid error */
return (u32)trunc((log10(x)/l2)/8+1);
}
u32 byteCount4(u32 x)
{
if (x == 0) return 0; /* Avoid error */
return (u32)trunc((log10(x)/0.301029995663981)/8+1);
}
void
test(u32 val)
{
u32 bicnt;
u32 lgcnt;
bicnt = bytecnt(val);
lgcnt = byteCount2(val);
if (bicnt != lgcnt) {
printf("%8.8X: bicnt=%8.8X lgcnt=%8.8X\n",
val,bicnt,lgcnt);
exit(1);
}
}
double
timeit(u32 (*proc)(u32),const char *who)
{
double tvbeg;
double tvdif;
double tvper;
int trycnt;
int trymax;
u32 val;
trymax = 1000000;
trymax *= 10;
tvbeg = _tvgetf();
for (trycnt = 1; trycnt < trymax; ++trycnt) {
for (val = 1; val != 0; val <<= 1)
proc(val);
}
tvdif = _tvgetf();
tvdif -= tvbeg;
tvper = tvdif;
tvper /= trymax;
tvper /= 32;
printf("%.9f %.9f -- %s\n",tvdif,tvper,who);
return tvdif;
}
int
main(int argc,char **argv)
{
char *cp;
u32 val;
double tvfast;
double tvslow;
double ratio;
--argc;
++argv;
l2 = log10(2);
for (; argc > 0; --argc, ++argv) {
cp = *argv;
if (*cp != '-')
break;
switch (cp[1]) {
case 'f':
opt_f = 1;
break;
}
}
// do quick validity test
printf("quick validity test ...\n");
test(0);
for (val = 1; val != 0; val <<= 1)
test(val);
// speed tests
printf("speed tests ...\n");
tvfast = timeit(bytecnt2,"bytecnt2");
tvslow = timeit(bytecnt,"bytecnt");
RATIO;
tvslow = timeit(byteCount2,"byteCount2");
RATIO;
tvslow = timeit(byteCount3,"byteCount3");
RATIO;
tvslow = timeit(byteCount4,"byteCount4");
RATIO;
// do full validity test
if (opt_f) {
for (val = 1; val != 0; ++val)
test(val);
}
return 0;
}
Here is the test output:
quick validity test ...
speed tests ...
1.180300474 0.000000004 -- bytecnt2
1.363260031 0.000000004 -- bytecnt
1.155x
6.759670734 0.000000021 -- byteCount2
5.727x
6.653460503 0.000000021 -- byteCount3
5.637x
6.636421680 0.000000021 -- byteCount4
5.623x
UPDATE:
my byteCount proposal is not optimized, for the sake of clarity. For example, you can convert log10(2) into a constant. I think that would have a noticeable increase of performance.
I've updated the test program to incorporate the changes.
But, the optimizer had already eliminated the log10(2) in your original code (i.e. only one call to log10), so hand coding it had little to no effect.
Several others did similar loop implementations for number of zero bytes [which I don't believe is what OP wanted, based on the "sizeof" phrase].
It turns out that the fastest version is also the simplest, most boring, and [IMO] most straightforward. This is something I added: bytecnt2, which is the "four compares" suggested by Paul R.
Doing floating point would be fine with better [or comparable] performance. I'd give it a pass even at 2x [FYI, before getting the results, I assumed that they would be ballpark (e.g. within 10%)].
But, the F.P. implementation is also less straightforward for OP's intended result.
IMO, something that is 4x slower [and more complicated] is a red flag. Not just tweaking, but indicates the approach is incorrect. Taking an int and converting it into a float [and back again], using some relatively heavyweight functions, for something that simple bit shifting/masking will accomplish.

If you don't mind use gcc extensions, this is a very good solution:
by the way you should be more clear in your question. Your terminology is confusing. Both "size" and "initialized" are used outside their established meaning.
Extra extra safe/portable: (probably not needed):
size_t leading_zeroes(uint32_t v)
{
if (v == 0) // __builtin_clz is undefined for 0
return sizeof(uint32_t) * CHAR_BIT;
return __builtin_clz(v);
}
size_t trailing_bytes(uint32_t v)
{
return sizeof(uint32_t) - leading_zeroes(v) / CHAR_BIT;
}
Simpler version:
size_t leading_zeroes(uint32_t v)
{
if (v == 0) // __builtin_clz is undefined for 0
return 32;
return __builtin_clz(v);
}
size_t trailing_bytes(uint32_t v)
{
return 4 - leading_zeroes(v) / 8;
}

Fastest sort algorithm for millions of UINT64 RGBZ graphics pixels

I am sorting 10+ million uint64_ts with RGB data from .RAW files and 79% of my C program time is spent in qsort. I am looking for a faster sort for this specific data type.
Being RAW graphical data, the numbers are very random and ~80% unique. No partial sorting or runs of sorted data can be expected. The 4 uint16_ts inside the uint64_t are R, G, B and zero (possibly a small count <= ~20).
I have the simplest comparison function I can think of using unsigned long longs (you CANNOT just subtract them):
qsort(hpidx, num_pix, sizeof(uint64_t), comp_uint64);
...
int comp_uint64(const void *a, const void *b) {
if(*((uint64_t *)a) > *((uint64_t *)b)) return(+1);
if(*((uint64_t *)a) < *((uint64_t *)b)) return(-1);
return(0);
} // End Comp_uint64().
There was a very interesting "Programming Puzzles & Code Golf" on StackExchange, but they used floats. Then there are QSort, RecQuick, heap, stooge, tree, radix...
The swenson/sort looked interesting but had no (obvious) support for my datatype, uint64_t. And the "quick sort" time was the best. Some sources say the system qsort can be anything, not necessarily "Quick Sort".
A C++ sort bypasses the generic casting of void pointers and realizes great improvements in performance over C. There has to be an optimized method to slam U8s through a 64bit processor at warp speed.
System/compiler info:
I am currently using the GCC with Strawberry Perl
gcc version 4.9.2 (x86_64-posix-sjlj, built by strawberryperl.com
Intel 2700K Sandy Bridge CPU, 32GB DDR3
windows 7/64 pro
gcc -D__USE_MINGW_ANSI_STDIO -O4 -ffast-math -m64 -Ofast -march=corei7-avx -mtune=corei7 -Ic:/bin/xxHash-master -Lc:/bin/xxHash-master c:/bin/stddev.c -o c:/bin/stddev.g6.exe
First attempt at a better qsort, QSORT()!
Tried to use Michael Tokarev's inline qsort.
"READY-TO-USE"? From qsort.h documentation
-----------------------------
* Several ready-to-use examples:
*
* Sorting array of integers:
* void int_qsort(int *arr, unsigned n) {
* #define int_lt(a,b) ((*a)<(*b))
* QSORT(int, arr, n, int_lt);
--------------------------------
Change from type "int" to "uint64_t"
compile error on TYPE???
c:/bin/bpbfct.c:586:8: error: expected expression before 'uint64_t'
QSORT(uint64_t, hpidx, num_pix, islt);
I can't find a real, compiling, working example program, just comments with the "general concept"
#define QSORT_TYPE uint64_t
#define islt(a,b) ((*a)<(*b))
uint64_t *QSORT_BASE;
int QSORT_NELT;
hpidx=(uint64_t *) calloc(num_pix+2, sizeof(uint64_t)); // Hash . PIDX
QSORT_BASE = hpidx;
QSORT_NELT = num_pix; // QSORT_LT is function QSORT_LT()
QSORT(uint64_t, hpidx, num_pix, islt);
//QSORT(uint64_t *, hpidx, num_pix, QSORT_LT); // QSORT_LT mal-defined?
//qsort(hpidx, num_pix, sizeof(uint64_t), comp_uint64); // << WORKS
The "ready-to-use" examples use types of int, char * and struct elt. Isn't uint64_t a type?? Try long long
QSORT(long long, hpidx, num_pix, islt);
c:/bin/bpbfct.c:586:8: error: expected expression before 'long'
QSORT(long long, hpidx, num_pix, islt);
Next attempt: RADIXSORT:
Results: RADIX_SORT is RADICAL!
I:\br3\pf.249465>grep "Event" bb12.log | grep -i Sort
<< 1.40 sec average
4) Time=1.411 sec = 49.61%, Event RADIX_SORT , hits=1
4) Time=1.396 sec = 49.13%, Event RADIX_SORT , hits=1
4) Time=1.392 sec = 49.15%, Event RADIX_SORT , hits=1
16) Time=1.414 sec = 49.12%, Event RADIX_SORT , hits=1
I:\br3\pf.249465>grep "Event" bb11.log | grep -i Sort
<< 5.525 sec average = 3.95 time slower
4) Time=5.538 sec = 86.34%, Event QSort , hits=1
4) Time=5.519 sec = 79.41%, Event QSort , hits=1
4) Time=5.519 sec = 79.02%, Event QSort , hits=1
4) Time=5.563 sec = 79.49%, Event QSort , hits=1
4) Time=5.684 sec = 79.83%, Event QSort , hits=1
4) Time=5.509 sec = 79.30%, Event QSort , hits=1
3.94 times faster than whatever sort qsort out of the box uses!
And, even more importantly, there was actual, working code, not just 80% of what you need given by some Guru who assumes you know everything they know and can fill in the other 20%.
Fantastic solution! Thanks Louis Ricci!

I would use Radix Sort with an 8bit radix. For 64bit values a well optimized radix sort will have to iterate over the list 9 times (one to precalculate the counts and offsets and 8 for 64bits/8bits). 9*N time and 2*N space (using a shadow array).
Here's what an optimized radix sort would look like.
typedef union {
struct {
uint32_t c8[256];
uint32_t c7[256];
uint32_t c6[256];
uint32_t c5[256];
uint32_t c4[256];
uint32_t c3[256];
uint32_t c2[256];
uint32_t c1[256];
};
uint32_t counts[256 * 8];
} rscounts_t;
uint64_t * radixSort(uint64_t * array, uint32_t size) {
rscounts_t counts;
memset(&counts, 0, 256 * 8 * sizeof(uint32_t));
uint64_t * cpy = (uint64_t *)malloc(size * sizeof(uint64_t));
uint32_t o8=0, o7=0, o6=0, o5=0, o4=0, o3=0, o2=0, o1=0;
uint32_t t8, t7, t6, t5, t4, t3, t2, t1;
uint32_t x;
// calculate counts
for(x = 0; x < size; x++) {
t8 = array[x] & 0xff;
t7 = (array[x] >> 8) & 0xff;
t6 = (array[x] >> 16) & 0xff;
t5 = (array[x] >> 24) & 0xff;
t4 = (array[x] >> 32) & 0xff;
t3 = (array[x] >> 40) & 0xff;
t2 = (array[x] >> 48) & 0xff;
t1 = (array[x] >> 56) & 0xff;
counts.c8[t8]++;
counts.c7[t7]++;
counts.c6[t6]++;
counts.c5[t5]++;
counts.c4[t4]++;
counts.c3[t3]++;
counts.c2[t2]++;
counts.c1[t1]++;
}
// convert counts to offsets
for(x = 0; x < 256; x++) {
t8 = o8 + counts.c8[x];
t7 = o7 + counts.c7[x];
t6 = o6 + counts.c6[x];
t5 = o5 + counts.c5[x];
t4 = o4 + counts.c4[x];
t3 = o3 + counts.c3[x];
t2 = o2 + counts.c2[x];
t1 = o1 + counts.c1[x];
counts.c8[x] = o8;
counts.c7[x] = o7;
counts.c6[x] = o6;
counts.c5[x] = o5;
counts.c4[x] = o4;
counts.c3[x] = o3;
counts.c2[x] = o2;
counts.c1[x] = o1;
o8 = t8;
o7 = t7;
o6 = t6;
o5 = t5;
o4 = t4;
o3 = t3;
o2 = t2;
o1 = t1;
}
// radix
for(x = 0; x < size; x++) {
t8 = array[x] & 0xff;
cpy[counts.c8[t8]] = array[x];
counts.c8[t8]++;
}
for(x = 0; x < size; x++) {
t7 = (cpy[x] >> 8) & 0xff;
array[counts.c7[t7]] = cpy[x];
counts.c7[t7]++;
}
for(x = 0; x < size; x++) {
t6 = (array[x] >> 16) & 0xff;
cpy[counts.c6[t6]] = array[x];
counts.c6[t6]++;
}
for(x = 0; x < size; x++) {
t5 = (cpy[x] >> 24) & 0xff;
array[counts.c5[t5]] = cpy[x];
counts.c5[t5]++;
}
for(x = 0; x < size; x++) {
t4 = (array[x] >> 32) & 0xff;
cpy[counts.c4[t4]] = array[x];
counts.c4[t4]++;
}
for(x = 0; x < size; x++) {
t3 = (cpy[x] >> 40) & 0xff;
array[counts.c3[t3]] = cpy[x];
counts.c3[t3]++;
}
for(x = 0; x < size; x++) {
t2 = (array[x] >> 48) & 0xff;
cpy[counts.c2[t2]] = array[x];
counts.c2[t2]++;
}
for(x = 0; x < size; x++) {
t1 = (cpy[x] >> 56) & 0xff;
array[counts.c1[t1]] = cpy[x];
counts.c1[t1]++;
}
free(cpy);
return array;
}
EDIT this implementation was based on a JavaScript version Fastest way to sort 32bit signed integer arrays in JavaScript?
Here's the IDEONE for the C radix sort http://ideone.com/JHI0d9

I see a few options, roughly in order of easiest to hardest.
Enable link-time optimization with the -flto switch. This may get the compiler to inline your comparison function. It's too easy not to try.
If LTO has no effect, you can use an inline qsort implementation like Michael Tokarev's inline qsort. This page suggests a 2x improvement, again solely due to the compiler's ability to inline the comparison function.
Use the C++ std::sort. I know your code is in C, but you can make a small module that only sorts and provides a C interface. You're already using a toolchain that has great C++ support.
Try swenson/sort's library. It implements many algorithms so you can use the one that works best on your data. It appears to be inlineable, and they claim to be faster than qsort.
Find another sorting library. Something that can do Louis' Radix Sort is a good suggestion.
Note you can also do your comparison with a single branch instead of two. Just find out which is bigger, then subtract.

With some compilers/platforms the following is branch-less and faster, though not much different than OP's original.
int comp_uint64_b(const void *a, const void *b) {
return
(*((uint64_t *)a) > *((uint64_t *)b)) -
(*((uint64_t *)a) < *((uint64_t *)b));
}

Maybe some ?: instead of ifs would make things a tad quicker.

large integer addition with CUDA

I've been developing a cryptographic algorithm on the GPU and currently stuck with an algorithm to perform large integer addition. Large integers are represented in a usual way as a bunch of 32-bit words.
For example, we can use one thread to add two 32-bit words. For simplicity, let assume
that the numbers to be added are of the same length and number of threads per block == number of words. Then:
__global__ void add_kernel(int *C, const int *A, const int *B) {
int x = A[threadIdx.x];
int y = B[threadIdx.x];
int z = x + y;
int carry = (z < x);
/** do carry propagation in parallel somehow ? */
............
z = z + newcarry; // update the resulting words after carry propagation
C[threadIdx.x] = z;
}
I am pretty sure that there is a way to do carry propagation via some tricky reduction procedure but could not figure it out..
I had a look at CUDA thrust extensions but big integer package seems not to be implemented yet.
Perhaps someone can give me a hint how to do that on CUDA ?

You are right, carry propagation can be done via prefix sum computation but it's a bit tricky to define the binary function for this operation and prove that it is associative (needed for parallel prefix sum). As a matter of fact, this algorithm is used (theoretically) in Carry-lookahead adder.
Suppose we have two large integers a[0..n-1] and b[0..n-1].
Then we compute (i = 0..n-1):
s[i] = a[i] + b[i]l;
carryin[i] = (s[i] < a[i]);
We define two functions:
generate[i] = carryin[i];
propagate[i] = (s[i] == 0xffffffff);
with quite intuitive meaning: generate[i] == 1 means that the carry is generated at
position i while propagate[i] == 1 means that the carry will be propagated from position
(i - 1) to (i + 1). Our goal is to compute the function carryout[0..n-1] used to update the resulting sum s[0..n-1]. carryout can be computed recursively as follows:
carryout[i] = generate[i] OR (propagate[i] AND carryout[i-1])
carryout[0] = 0
Here carryout[i] == 1 if carry is generated at position i OR it is generated sometimes earlier AND propagated to position i. Finally, we update the resulting sum:
s[i] = s[i] + carryout[i-1]; for i = 1..n-1
carry = carryout[n-1];
Now it is quite straightforward to prove that carryout function is indeed binary associative and hence parallel prefix sum computation applies. To implement this on CUDA, we can merge both flags 'generate' and 'propagate' in a single variable since they are mutually exclusive, i.e.:
cy[i] = (s[i] == -1u ? -1u : 0) | carryin[i];
In other words,
cy[i] = 0xffffffff if propagate[i]
cy[i] = 1 if generate[i]
cy[u] = 0 otherwise
Then, one can verify that the following formula computes prefix sum for carryout function:
cy[i] = max((int)cy[i], (int)cy[k]) & cy[i];
for all k < i. The example code below shows large addition for 2048-word integers. Here I used CUDA blocks with 512 threads:
// add & output carry flag
#define UADDO(c, a, b) \
asm volatile("add.cc.u32 %0, %1, %2;" : "=r"(c) : "r"(a) , "r"(b));
// add with carry & output carry flag
#define UADDC(c, a, b) \
asm volatile("addc.cc.u32 %0, %1, %2;" : "=r"(c) : "r"(a) , "r"(b));
#define WS 32
__global__ void bignum_add(unsigned *g_R, const unsigned *g_A,const unsigned *g_B) {
extern __shared__ unsigned shared[];
unsigned *r = shared;
const unsigned N_THIDS = 512;
unsigned thid = threadIdx.x, thid_in_warp = thid & WS-1;
unsigned ofs, cf;
uint4 a = ((const uint4 *)g_A)[thid],
b = ((const uint4 *)g_B)[thid];
UADDO(a.x, a.x, b.x) // adding 128-bit chunks with carry flag
UADDC(a.y, a.y, b.y)
UADDC(a.z, a.z, b.z)
UADDC(a.w, a.w, b.w)
UADDC(cf, 0, 0) // save carry-out
// memory consumption: 49 * N_THIDS / 64
// use "alternating" data layout for each pair of warps
volatile short *scan = (volatile short *)(r + 16 + thid_in_warp +
49 * (thid / 64)) + ((thid / 32) & 1);
scan[-32] = -1; // put identity element
if(a.x == -1u && a.x == a.y && a.x == a.z && a.x == a.w)
// this indicates that carry will propagate through the number
cf = -1u;
// "Hillis-and-Steele-style" reduction
scan[0] = cf;
cf = max((int)cf, (int)scan[-2]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-4]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-8]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-16]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-32]) & cf;
scan[0] = cf;
int *postscan = (int *)r + 16 + 49 * (N_THIDS / 64);
if(thid_in_warp == WS - 1) // scan leading carry-outs once again
postscan[thid >> 5] = cf;
__syncthreads();
if(thid < N_THIDS / 32) {
volatile int *t = (volatile int *)postscan + thid;
t[-8] = -1; // load identity symbol
cf = t[0];
cf = max((int)cf, (int)t[-1]) & cf;
t[0] = cf;
cf = max((int)cf, (int)t[-2]) & cf;
t[0] = cf;
cf = max((int)cf, (int)t[-4]) & cf;
t[0] = cf;
}
__syncthreads();
cf = scan[0];
int ps = postscan[(int)((thid >> 5) - 1)]; // postscan[-1] equals to -1
scan[0] = max((int)cf, ps) & cf; // update carry flags within warps
cf = scan[-2];
if(thid_in_warp == 0)
cf = ps;
if((int)cf < 0)
cf = 0;
UADDO(a.x, a.x, cf) // propagate carry flag if needed
UADDC(a.y, a.y, 0)
UADDC(a.z, a.z, 0)
UADDC(a.w, a.w, 0)
((uint4 *)g_R)[thid] = a;
}
Note that macros UADDO / UADDC might not be necessary anymore since CUDA 4.0 has corresponding intrinsics (however I am not entirely sure).
Also remark that, though parallel reduction is quite fast, if you need to add several large integers in a row, it might be better to use some redundant representation (which was suggested in comments above), i.e., first accumulate the results of additions in 64-bit words, and then perform one carry propagation at the very end in "one sweep".

I thought I would post my answer also, in addition to #asm, so this SO question can be a sort of repository of ideas. Similar to #asm, I detect and store the carry condition as well as the "carry-through" condition, ie. when the intermediate word result is all 1's (0xF...FFF) so that if a carry were to propagate into this word, it would "carry-through" to the next word.
I didn't use any PTX or asm in my code, so I chose to use 64-bit unsigned ints instead of 32-bit, to achieve the 2048x32bit capability, using 1024 threads.
A larger difference from #asm's code is in my parallel carry propagation scheme. I construct a bit-packed array ("carry") where each bit represents the carry condition generated from the independent intermediate 64-bit adds from each of the 1024 threads. I also construct a bit-packed array ("carry_through") where each bit represents the carry_through condition of the individual 64-bit intermediate results. For 1024 threads, this amounts to 1024/64 = 16x64 bit words of shared memory for each bit-packed array, so total shared mem usage is 64+3 32bit quantites. With these bit packed arrays, I perform the following to generate a combined propagated carry indicator:
carry = carry | (carry_through ^ ((carry & carry_through) + carry_through);
(note that carry is shifted left by one: carry[i] indicates that the result of a[i-1] + b[i-1] generated a carry)
The explanation is as follows:
the bitwise and of carry and carry_through generates the candidates where a carry will
interact with a sequence of one or more carry though conditions
adding the result of step one to carry_through generates a result which
has changed bits which represent all words that will be affected by
the propagation of the carry into the carry_through sequence
taking the exclusive-or of carry_through plus the result from step 2
shows the affected results indicated with a 1 bit
taking the bitwise or of the result from step 3 and the ordinary
carry indicators gives a combined carry condition, which is then
used to update all the intermediate results.
Note that the addition in step 2 requires another multi-word add (for big ints composed of more than 64 words). I believe this algorithm works, and it has passed the test cases I have thrown at it.
Here is my example code which implements this:
// parallel add of large integers
// requires CC 2.0 or higher
// compile with:
// nvcc -O3 -arch=sm_20 -o paradd2 paradd2.cu
#include <stdio.h>
#include <stdlib.h>
#define MAXSIZE 1024 // the number of 64 bit quantities that can be added
#define LLBITS 64 // the number of bits in a long long
#define BSIZE ((MAXSIZE + LLBITS -1)/LLBITS) // MAXSIZE when packed into bits
#define nTPB MAXSIZE
// define either GPU or GPUCOPY, not both -- for timing
#define GPU
//#define GPUCOPY
#define LOOPCNT 1000
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
// perform c = a + b, for unsigned integers of psize*64 bits.
// all work done in a single threadblock.
// multiple threadblocks are handling multiple separate addition problems
// least significant word is at a[0], etc.
__global__ void paradd(const unsigned size, const unsigned psize, unsigned long long *c, const unsigned long long *a, const unsigned long long *b){
__shared__ unsigned long long carry_through[BSIZE];
__shared__ unsigned long long carry[BSIZE+1];
__shared__ volatile unsigned mcarry;
__shared__ volatile unsigned mcarry_through;
unsigned idx = threadIdx.x + (psize * blockIdx.x);
if ((threadIdx.x < psize) && (idx < size)){
// handle 64 bit unsigned add first
unsigned long long cr1 = a[idx];
unsigned long long lc = cr1 + b[idx];
// handle carry
if (threadIdx.x < BSIZE){
carry[threadIdx.x] = 0;
carry_through[threadIdx.x] = 0;
}
if (threadIdx.x == 0){
mcarry = 0;
mcarry_through = 0;
}
__syncthreads();
if (lc < cr1){
if ((threadIdx.x%LLBITS) != (LLBITS-1))
atomicAdd(&(carry[threadIdx.x/LLBITS]), (2ull<<(threadIdx.x%LLBITS)));
else atomicAdd(&(carry[(threadIdx.x/LLBITS)+1]), 1);
}
// handle carry-through
if (lc == 0xFFFFFFFFFFFFFFFFull)
atomicAdd(&(carry_through[threadIdx.x/LLBITS]), (1ull<<(threadIdx.x%LLBITS)));
__syncthreads();
if (threadIdx.x < ((psize + LLBITS-1)/LLBITS)){
// only 1 warp executing within this if statement
unsigned long long cr3 = carry_through[threadIdx.x];
cr1 = carry[threadIdx.x] & cr3;
// start of sub-add
unsigned long long cr2 = cr3 + cr1;
if (cr2 < cr1) atomicAdd((unsigned *)&mcarry, (2u<<(threadIdx.x)));
if (cr2 == 0xFFFFFFFFFFFFFFFFull) atomicAdd((unsigned *)&mcarry_through, (1u<<threadIdx.x));
if (threadIdx.x == 0) {
unsigned cr4 = mcarry & mcarry_through;
cr4 += mcarry_through;
mcarry |= (mcarry_through ^ cr4);
}
if (mcarry & (1u<<threadIdx.x)) cr2++;
// end of sub-add
carry[threadIdx.x] |= (cr2 ^ cr3);
}
__syncthreads();
if (carry[threadIdx.x/LLBITS] & (1ull<<(threadIdx.x%LLBITS))) lc++;
c[idx] = lc;
}
}
int main() {
unsigned long long *h_a, *h_b, *h_c, *d_a, *d_b, *d_c, *c;
unsigned at_once = 256; // valid range = 1 .. 65535
unsigned prob_size = MAXSIZE ; // valid range = 1 .. MAXSIZE
unsigned dsize = at_once * prob_size;
cudaEvent_t t_start_gpu, t_start_cpu, t_end_gpu, t_end_cpu;
float et_gpu, et_cpu, tot_gpu, tot_cpu;
tot_gpu = 0;
tot_cpu = 0;
if (sizeof(unsigned long long) != (LLBITS/8)) {printf("Word Size Error\n"); return 1;}
if ((c = (unsigned long long *)malloc(dsize * sizeof(unsigned long long))) == 0) {printf("Malloc Fail\n"); return 1;}
cudaHostAlloc((void **)&h_a, dsize * sizeof(unsigned long long), cudaHostAllocDefault);
cudaCheckErrors("cudaHostAlloc1 fail");
cudaHostAlloc((void **)&h_b, dsize * sizeof(unsigned long long), cudaHostAllocDefault);
cudaCheckErrors("cudaHostAlloc2 fail");
cudaHostAlloc((void **)&h_c, dsize * sizeof(unsigned long long), cudaHostAllocDefault);
cudaCheckErrors("cudaHostAlloc3 fail");
cudaMalloc((void **)&d_a, dsize * sizeof(unsigned long long));
cudaCheckErrors("cudaMalloc1 fail");
cudaMalloc((void **)&d_b, dsize * sizeof(unsigned long long));
cudaCheckErrors("cudaMalloc2 fail");
cudaMalloc((void **)&d_c, dsize * sizeof(unsigned long long));
cudaCheckErrors("cudaMalloc3 fail");
cudaMemset(d_c, 0, dsize*sizeof(unsigned long long));
cudaEventCreate(&t_start_gpu);
cudaEventCreate(&t_end_gpu);
cudaEventCreate(&t_start_cpu);
cudaEventCreate(&t_end_cpu);
for (unsigned loops = 0; loops <LOOPCNT; loops++){
//create some test cases
if (loops == 0){
for (int j=0; j<at_once; j++)
for (int k=0; k<prob_size; k++){
int i= (j*prob_size) + k;
h_a[i] = 0xFFFFFFFFFFFFFFFFull;
h_b[i] = 0;
}
h_a[prob_size-1] = 0;
h_b[prob_size-1] = 1;
h_b[0] = 1;
}
else if (loops == 1){
for (int i=0; i<dsize; i++){
h_a[i] = 0xFFFFFFFFFFFFFFFFull;
h_b[i] = 0;
}
h_b[0] = 1;
}
else if (loops == 2){
for (int i=0; i<dsize; i++){
h_a[i] = 0xFFFFFFFFFFFFFFFEull;
h_b[i] = 2;
}
h_b[0] = 1;
}
else {
for (int i = 0; i<dsize; i++){
h_a[i] = (((unsigned long long)lrand48())<<33) + (unsigned long long)lrand48();
h_b[i] = (((unsigned long long)lrand48())<<33) + (unsigned long long)lrand48();
}
}
#ifdef GPUCOPY
cudaEventRecord(t_start_gpu, 0);
#endif
cudaMemcpy(d_a, h_a, dsize*sizeof(unsigned long long), cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMemcpy1 fail");
cudaMemcpy(d_b, h_b, dsize*sizeof(unsigned long long), cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMemcpy2 fail");
#ifdef GPU
cudaEventRecord(t_start_gpu, 0);
#endif
paradd<<<at_once, nTPB>>>(dsize, prob_size, d_c, d_a, d_b);
cudaCheckErrors("Kernel Fail");
#ifdef GPU
cudaEventRecord(t_end_gpu, 0);
#endif
cudaMemcpy(h_c, d_c, dsize*sizeof(unsigned long long), cudaMemcpyDeviceToHost);
cudaCheckErrors("cudaMemcpy3 fail");
#ifdef GPUCOPY
cudaEventRecord(t_end_gpu, 0);
#endif
cudaEventSynchronize(t_end_gpu);
cudaEventElapsedTime(&et_gpu, t_start_gpu, t_end_gpu);
tot_gpu += et_gpu;
cudaEventRecord(t_start_cpu, 0);
//also compute result on CPU for comparison
for (int j=0; j<at_once; j++) {
unsigned rc=0;
for (int n=0; n<prob_size; n++){
unsigned i = (j*prob_size) + n;
c[i] = h_a[i] + h_b[i];
if (c[i] < h_a[i]) {
c[i] += rc;
rc=1;}
else {
if ((c[i] += rc) != 0) rc=0;
}
if (c[i] != h_c[i]) {printf("Results mismatch at offset %d, GPU = 0x%lX, CPU = 0x%lX\n", i, h_c[i], c[i]); return 1;}
}
}
cudaEventRecord(t_end_cpu, 0);
cudaEventSynchronize(t_end_cpu);
cudaEventElapsedTime(&et_cpu, t_start_cpu, t_end_cpu);
tot_cpu += et_cpu;
if ((loops%(LOOPCNT/10)) == 0) printf("*\n");
}
printf("\nResults Match!\n");
printf("Average GPU time = %fms\n", (tot_gpu/LOOPCNT));
printf("Average CPU time = %fms\n", (tot_cpu/LOOPCNT));
return 0;
}