Premultiplying image alpha efficiently

Premultiplying image alpha efficiently - c

Load a 32bit image into a buffer and I then premultiply the color values with the corresponding alpha to use for blending.
The following works but I am wondering if there is a more efficient way of doing this, even if it only results in a good-enough approximation?
image data is a pointer of this type:
typedef struct rgba_pixel
{
uint8_t r;
uint8_t g;
uint8_t b;
uint8_t a;
} rgba_pixel;
rgba_pixel * image_data;
for ( i = 0; i < length; i++ )
{
if ( image_data[i].a == 0 )
image_data[i].r = image_data[i].g = image_data[i].b = 0;
else if ( image_data[i].a < 255 )
{
alpha_factor = image_data[i].a / 255.0;
image_data[i].r = image_data[i].r * alpha_factor;
image_data[i].g = image_data[i].g * alpha_factor;
image_data[i].b = image_data[i].b * alpha_factor;
}
}

Given that your a, r, g and b components are unsigned char, you can improve performance by turning floating point multiplication to integer multiplication and use shr 8 (division by 256) instead of dividing by 255:
for ( i = 0; i < length; i++ )
{
if ( image_data[i].a == 0 )
image_data[i].r = image_data[i].g = image_data[i].b = 0;
else if ( image_data[i].a < 255 )
{
image_data[i].r = (unsigned short)image_data[i].r * image_data[i].a >> 8;
image_data[i].g = (unsigned short)image_data[i].g * image_data[i].a >> 8;
image_data[i].b = (unsigned short)image_data[i].b * image_data[i].a >> 8;
}
}
This will convert 1 fp division and 3 fp multiplications into 3 integer multiplications and 3 bit shifts.
Another improvement which can be done is using union struct for the pixel data:
typedef union rgba_pixel
{
struct {
uint8_t r;
uint8_t g;
uint8_t b;
uint8_t a;
};
uint32_t u32;
} rgba_pixel;
And then assigning zero to r, g and b at once:
//image_data[i].r = image_data[i].g = image_data[i].b = 0;
image_data[i].u32 = 0; //use this instead
According to https://godbolt.org/ with x86-64 gcc 7.2, the latter generates less instructions at -O3. Which of course may or may not be faster in practice.
Another thing to be considered is partial loop unrolling, i.e. processing multiple (for example 4) pixels per loop iteration. If you are guaranteed that your rows are multiples of 4 in width, you do it even without additional checks.

Related

How to do 1024-bit operations using arrays of uint64_t

I am trying to find a way to compute values that are of type uint1024_t (unsigned 1024-bit integer), by defining the 5 basic operations: plus, minus, times, divide, modulus.
The way that I can do that is by creating a structure that will have the following prototype:
typedef struct {
uint64_t chunk[16];
} uint1024_t;
Now since it is complicated to wrap my head around such operations with uint64_t as block size, I have first written some code for manipulating uint8_t. Here is what I came up with:
#define UINT8_HI(x) (x >> 4)
#define UINT8_LO(x) (((1 << 4) - 1) & x)
void uint8_add(uint8_t a, uint8_t b, uint8_t *res, int i) {
uint8_t s0, s1, s2;
uint8_t x = UINT8_LO(a) + UINT8_LO(b);
s0 = UINT8_LO(x);
x = UINT8_HI(a) + UINT8_HI(b) + UINT8_HI(x);
s1 = UINT8_LO(x);
s2 = UINT8_HI(x);
uint8_t result = s0 + (s1 << 4);
uint8_t carry = s2;
res[1 + i] = result;
res[0 + i] = carry;
}
void uint8_multiply(uint8_t a, uint8_t b, uint8_t *res, int i) {
uint8_t s0, s1, s2, s3;
uint8_t x = UINT8_LO(a) * UINT8_LO(b);
s0 = UINT8_LO(x);
x = UINT8_HI(a) * UINT8_LO(b) + UINT8_HI(x);
s1 = UINT8_LO(x);
s2 = UINT8_HI(x);
x = s1 + UINT8_LO(a) * UINT8_HI(b);
s1 = UINT8_LO(x);
x = s2 + UINT8_HI(a) * UINT8_HI(b) + UINT8_HI(x);
s2 = UINT8_LO(x);
s3 = UINT8_HI(x);
uint8_t result = s1 << 4 | s0;
uint8_t carry = s3 << 4 | s2;
res[1 + i] = result;
res[0 + i] = carry;
}
And it seems to work just fine, however I am unable to define the same operations for division, subtraction and modulus...
Furthermore I just can't seem to see how to implement the same principal to my custom uint1024_t structure even though it is pretty much identical with a few lines of code more to manage overflows.
I would really appreciate some help in implementing the 5 basic operations for my structure.
EDIT:
I have answered below with my implementation for resolving this problem.

find a way to compute ... the 5 basic operations: plus, minus, times, divide, modulus.
If uint1024_t used uint32_t, it would be easier.
I would recommend 1) half the width of the widest type uintmax_t, or 2) unsigned, whichever is smaller. E.g. 32-bit.
(Also consider something other than uintN_t to avoid collisions with future versions of C.)
typedef struct {
uint32_t chunk[1024/32];
} u1024;
Example of some untested code to give OP an idea of how using uint32_t simplifies the task.
void u1024_mult(u1024 *product, const u1024 *a, const u1024 *b) {
memset(product, 0, sizeof product[0]);
unsigned n = sizeof product->chunk / sizeof product->chunk[0];
for (unsigned ai = 0; ai < n; ai++) {
uint64_t acc = 0;
uint32_t m = a->chunk[ai];
for (unsigned bi = 0; ai + bi < n; bi++) {
acc += (uint64_t) m * b->chunk[bi] + product->chunk[ai + bi];
product->chunk[ai + bi] = (uint32_t) acc;
acc >>= 32;
}
}
}
+, - are quite similar to the above.
/, % could be combined into one routine that computes the quotient and remainder together.
It is not that hard to post those functions here as it really is the same as grade school math, but instead of base 10, base 232. I am against posting it though as it is fun exercise to do oneself.
I hope the * sample code above inspires rather than answers.

There are some problems with your implementation for uint8_t arrays:
you did not parenthesize the macro arguments in the expansion. This is very error prone as it may cause unexpected operator precedence problems if the arguments are expressions. You should write:
#define UINT8_HI(x) ((x) >> 4)
#define UINT8_LO(x) (((1 << 4) - 1) & (x))
storing the array elements with the most significant part first is counter intuitive. Multi-precision arithmetics usually represents the large values as arrays with the least significant part first.
for a small type such as uint8_t, there is no need to split it into halves as larger types are available. Furthermore, you must propagate the carry from the previous addition. Here is a much simpler implementation for the addition:
void uint8_add(uint8_t a, uint8_t b, uint8_t *res, int i) {
uint16_t result = a + b + res[i + 0]; // add previous carry
res[i + 0] = (uint8_t)result;
res[i + 1] = (uint8_t)(result >> 8); // assuming res has at least i+1 elements and is initialized to 0
}
for the multiplication, you must add the result of multiplying each part of each number to the appropriately chosen parts of the result number, propagating the carry to the higher parts.
Division is more difficult to implement efficiently. I recommend you study an open source multi-precision package such as QuickJS' libbf.c.
To transpose this to arrays of uint64_t, you can use unsigned 128-bit integer types if available on your platform (64-bit compilers gcc, clang and vsc all support such types).
Here is a simple implementation for the addition and multiplication:
#include <limits.h>
#include <stddef.h>
#include <stdint.h>
#define NB_CHUNK 16
typedef __uint128_t uint128_t;
typedef struct {
uint64_t chunk[NB_CHUNK];
} uint1024_t;
void uint0124_add(uint1024_t *dest, const uint1024_t *a, const uint1024_t *b) {
uint128_t result = 0;
for (size_t i = 0; i < NB_CHUNK; i++) {
result += (uint128_t)a->chunk[i] + b->chunk[i];
dest->chunk[i] = (uint64_t)result;
result >>= CHAR_BIT * sizeof(uint64_t);
}
}
void uint0124_multiply(uint1024_t *dest, const uint1024_t *a, const uint1024_t *b) {
for (size_t i = 0; i < NB_CHUNK; i++)
dest->chunk[i] = 0;
for (size_t i = 0; i < NB_CHUNK; i++) {
uint128_t result = 0;
for (size_t j = 0, k = i; k < NB_CHUNK; j++, k++) {
result += (uint128_t)a->chunk[i] * b->chunk[j] + dest->chunk[k];
dest->chunk[k] = (uint64_t)result;
result >>= CHAR_BIT * sizeof(uint64_t);
}
}
}
If 128-bit integers are not available, your 1024-bit type could be implemented as an array of 32-bit integers. Here is a flexible implementation with selectable types for the array elements and the intermediary result:
#include <limits.h>
#include <stddef.h>
#include <stdint.h>
#if 1 // if platform has 128 bit integers
typedef uint64_t type1;
typedef __uint128_t type2;
#else
typedef uint32_t type1;
typedef uint64_t type2;
#endif
#define TYPE1_BITS (CHAR_BIT * sizeof(type1))
#define NB_CHUNK (1024 / TYPE1_BITS)
typedef struct uint1024_t {
type1 chunk[NB_CHUNK];
} uint1024_t;
void uint0124_add(uint1024_t *dest, const uint1024_t *a, const uint1024_t *b) {
type2 result = 0;
for (size_t i = 0; i < NB_CHUNK; i++) {
result += (type2)a->chunk[i] + b->chunk[i];
dest->chunk[i] = (type1)result;
result >>= TYPE1_BITS;
}
}
void uint0124_multiply(uint1024_t *dest, const uint1024_t *a, const uint1024_t *b) {
for (size_t i = 0; i < NB_CHUNK; i++)
dest->chunk[i] = 0;
for (size_t i = 0; i < NB_CHUNK; i++) {
type2 result = 0;
for (size_t j = 0, k = i; k < NB_CHUNK; j++, k++) {
result += (type2)a->chunk[i] * b->chunk[j] + dest->chunk[k];
dest->chunk[k] = (type1)result;
result >>= TYPE1_BITS;
}
}
}

C Bitwise operations left shift and bitwise OR

Very short, I am having issues understanding the workings of this code, it is much more efficient then my 20 or so lines to get the same outcome. I understand how left shift is supposed to work and the bitwise Or but would appreciate a little guidance to understand how the two come together to make the line in the for loop work.
Code is meant to take in an array of bits(bits) of a given size(count) and return the integer value of the bits.
unsigned binary_array_to_numbers(const unsigned *bits, size_t count) {
unsigned res = 0;
for (size_t i = 0; i < count; i++)
res = res << 1 | bits[i];
return res;
}
EDIT: As requested, My newbie solution that still passed all tests: Added is a sample of possible assignment to bits[]
unsigned binary_array_to_numbers(const unsigned *bits, size_t count)
{
int i, j = 0;
unsigned add = 0;
for (i = count - 1; i >= 0; i--){
if(bits[i] == 1){
if(j >= 1){
j = j * 2;
add = add + j;
}
else{
j++;
add = add + j;
}
}
else {
if( j>= 1){
j = j * 2;
}
else{
j++;
}
}
}
return add;
}
void main(){
const unsigned bits[] = {0,1,1,0};
size_t count = sizeof(bits)/sizeof(bits[0]);
binary_array_to_numbers(bits, count);
}

a breakdown:
every left shift operation on a binary number effectively multiplies
it by 2 0111(7) << 1 = 1110(14)
consider rhubarbdog answer - the operation can be seen as two separate actions. first left-shift (multiply by two) and then OR with the current bit being reviewed
the PC does not distinguish between the value displayed and the binary
representation of the number
lets try and review a case in-which your input is:
bits = {0, 1, 0, 1};
count = 4;
unsigned binary_array_to_numbers(const unsigned *bits, size_t count) {
unsigned res = 0;
for (size_t i = 0; i < count; i++)
res = res << 1 // (a)
res = res | bits[i]; /* (b) according to rhubarbdog answer */
return res;
}
iteration 0:
- bits[i] = 0;
- (a) res = b0; (left shift of 0)
- (b) res = b0; (bitwise OR with 0)
iteration 1:
- bits[i] = 1;
- (a) res = b0; (left shift of 0)
- (b) res = b1; (bitwise OR with 1)
iteration 2:
- bits[i] = 0;
- (a) res = b10; (left shift of 1 - decimal value is 2)
- (b) res = b10; (bitwise OR with 0)
iteration 3:
- bits[i] = 1;
- (a) res = b100; (left shift of 1 - decimal value is 4)
- (b) res = b101; (bitwise OR with 1)
the final result for res is binary(101) and decimal(5) as one would expect
NOTE: the use of unsigned is a must since a signed value will be interpreted as a negative value if the MSB is 1
hope that helps...

consider them as 2 operations i'll re-write res= ... as 2 lines
res = res << 1
res = res | 1
The firs pass res gets set to 1, next time it's shifted *2 then because it's now even +1

large integer addition with CUDA

I've been developing a cryptographic algorithm on the GPU and currently stuck with an algorithm to perform large integer addition. Large integers are represented in a usual way as a bunch of 32-bit words.
For example, we can use one thread to add two 32-bit words. For simplicity, let assume
that the numbers to be added are of the same length and number of threads per block == number of words. Then:
__global__ void add_kernel(int *C, const int *A, const int *B) {
int x = A[threadIdx.x];
int y = B[threadIdx.x];
int z = x + y;
int carry = (z < x);
/** do carry propagation in parallel somehow ? */
............
z = z + newcarry; // update the resulting words after carry propagation
C[threadIdx.x] = z;
}
I am pretty sure that there is a way to do carry propagation via some tricky reduction procedure but could not figure it out..
I had a look at CUDA thrust extensions but big integer package seems not to be implemented yet.
Perhaps someone can give me a hint how to do that on CUDA ?

You are right, carry propagation can be done via prefix sum computation but it's a bit tricky to define the binary function for this operation and prove that it is associative (needed for parallel prefix sum). As a matter of fact, this algorithm is used (theoretically) in Carry-lookahead adder.
Suppose we have two large integers a[0..n-1] and b[0..n-1].
Then we compute (i = 0..n-1):
s[i] = a[i] + b[i]l;
carryin[i] = (s[i] < a[i]);
We define two functions:
generate[i] = carryin[i];
propagate[i] = (s[i] == 0xffffffff);
with quite intuitive meaning: generate[i] == 1 means that the carry is generated at
position i while propagate[i] == 1 means that the carry will be propagated from position
(i - 1) to (i + 1). Our goal is to compute the function carryout[0..n-1] used to update the resulting sum s[0..n-1]. carryout can be computed recursively as follows:
carryout[i] = generate[i] OR (propagate[i] AND carryout[i-1])
carryout[0] = 0
Here carryout[i] == 1 if carry is generated at position i OR it is generated sometimes earlier AND propagated to position i. Finally, we update the resulting sum:
s[i] = s[i] + carryout[i-1]; for i = 1..n-1
carry = carryout[n-1];
Now it is quite straightforward to prove that carryout function is indeed binary associative and hence parallel prefix sum computation applies. To implement this on CUDA, we can merge both flags 'generate' and 'propagate' in a single variable since they are mutually exclusive, i.e.:
cy[i] = (s[i] == -1u ? -1u : 0) | carryin[i];
In other words,
cy[i] = 0xffffffff if propagate[i]
cy[i] = 1 if generate[i]
cy[u] = 0 otherwise
Then, one can verify that the following formula computes prefix sum for carryout function:
cy[i] = max((int)cy[i], (int)cy[k]) & cy[i];
for all k < i. The example code below shows large addition for 2048-word integers. Here I used CUDA blocks with 512 threads:
// add & output carry flag
#define UADDO(c, a, b) \
asm volatile("add.cc.u32 %0, %1, %2;" : "=r"(c) : "r"(a) , "r"(b));
// add with carry & output carry flag
#define UADDC(c, a, b) \
asm volatile("addc.cc.u32 %0, %1, %2;" : "=r"(c) : "r"(a) , "r"(b));
#define WS 32
__global__ void bignum_add(unsigned *g_R, const unsigned *g_A,const unsigned *g_B) {
extern __shared__ unsigned shared[];
unsigned *r = shared;
const unsigned N_THIDS = 512;
unsigned thid = threadIdx.x, thid_in_warp = thid & WS-1;
unsigned ofs, cf;
uint4 a = ((const uint4 *)g_A)[thid],
b = ((const uint4 *)g_B)[thid];
UADDO(a.x, a.x, b.x) // adding 128-bit chunks with carry flag
UADDC(a.y, a.y, b.y)
UADDC(a.z, a.z, b.z)
UADDC(a.w, a.w, b.w)
UADDC(cf, 0, 0) // save carry-out
// memory consumption: 49 * N_THIDS / 64
// use "alternating" data layout for each pair of warps
volatile short *scan = (volatile short *)(r + 16 + thid_in_warp +
49 * (thid / 64)) + ((thid / 32) & 1);
scan[-32] = -1; // put identity element
if(a.x == -1u && a.x == a.y && a.x == a.z && a.x == a.w)
// this indicates that carry will propagate through the number
cf = -1u;
// "Hillis-and-Steele-style" reduction
scan[0] = cf;
cf = max((int)cf, (int)scan[-2]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-4]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-8]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-16]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-32]) & cf;
scan[0] = cf;
int *postscan = (int *)r + 16 + 49 * (N_THIDS / 64);
if(thid_in_warp == WS - 1) // scan leading carry-outs once again
postscan[thid >> 5] = cf;
__syncthreads();
if(thid < N_THIDS / 32) {
volatile int *t = (volatile int *)postscan + thid;
t[-8] = -1; // load identity symbol
cf = t[0];
cf = max((int)cf, (int)t[-1]) & cf;
t[0] = cf;
cf = max((int)cf, (int)t[-2]) & cf;
t[0] = cf;
cf = max((int)cf, (int)t[-4]) & cf;
t[0] = cf;
}
__syncthreads();
cf = scan[0];
int ps = postscan[(int)((thid >> 5) - 1)]; // postscan[-1] equals to -1
scan[0] = max((int)cf, ps) & cf; // update carry flags within warps
cf = scan[-2];
if(thid_in_warp == 0)
cf = ps;
if((int)cf < 0)
cf = 0;
UADDO(a.x, a.x, cf) // propagate carry flag if needed
UADDC(a.y, a.y, 0)
UADDC(a.z, a.z, 0)
UADDC(a.w, a.w, 0)
((uint4 *)g_R)[thid] = a;
}
Note that macros UADDO / UADDC might not be necessary anymore since CUDA 4.0 has corresponding intrinsics (however I am not entirely sure).
Also remark that, though parallel reduction is quite fast, if you need to add several large integers in a row, it might be better to use some redundant representation (which was suggested in comments above), i.e., first accumulate the results of additions in 64-bit words, and then perform one carry propagation at the very end in "one sweep".

I thought I would post my answer also, in addition to #asm, so this SO question can be a sort of repository of ideas. Similar to #asm, I detect and store the carry condition as well as the "carry-through" condition, ie. when the intermediate word result is all 1's (0xF...FFF) so that if a carry were to propagate into this word, it would "carry-through" to the next word.
I didn't use any PTX or asm in my code, so I chose to use 64-bit unsigned ints instead of 32-bit, to achieve the 2048x32bit capability, using 1024 threads.
A larger difference from #asm's code is in my parallel carry propagation scheme. I construct a bit-packed array ("carry") where each bit represents the carry condition generated from the independent intermediate 64-bit adds from each of the 1024 threads. I also construct a bit-packed array ("carry_through") where each bit represents the carry_through condition of the individual 64-bit intermediate results. For 1024 threads, this amounts to 1024/64 = 16x64 bit words of shared memory for each bit-packed array, so total shared mem usage is 64+3 32bit quantites. With these bit packed arrays, I perform the following to generate a combined propagated carry indicator:
carry = carry | (carry_through ^ ((carry & carry_through) + carry_through);
(note that carry is shifted left by one: carry[i] indicates that the result of a[i-1] + b[i-1] generated a carry)
The explanation is as follows:
the bitwise and of carry and carry_through generates the candidates where a carry will
interact with a sequence of one or more carry though conditions
adding the result of step one to carry_through generates a result which
has changed bits which represent all words that will be affected by
the propagation of the carry into the carry_through sequence
taking the exclusive-or of carry_through plus the result from step 2
shows the affected results indicated with a 1 bit
taking the bitwise or of the result from step 3 and the ordinary
carry indicators gives a combined carry condition, which is then
used to update all the intermediate results.
Note that the addition in step 2 requires another multi-word add (for big ints composed of more than 64 words). I believe this algorithm works, and it has passed the test cases I have thrown at it.
Here is my example code which implements this:
// parallel add of large integers
// requires CC 2.0 or higher
// compile with:
// nvcc -O3 -arch=sm_20 -o paradd2 paradd2.cu
#include <stdio.h>
#include <stdlib.h>
#define MAXSIZE 1024 // the number of 64 bit quantities that can be added
#define LLBITS 64 // the number of bits in a long long
#define BSIZE ((MAXSIZE + LLBITS -1)/LLBITS) // MAXSIZE when packed into bits
#define nTPB MAXSIZE
// define either GPU or GPUCOPY, not both -- for timing
#define GPU
//#define GPUCOPY
#define LOOPCNT 1000
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
// perform c = a + b, for unsigned integers of psize*64 bits.
// all work done in a single threadblock.
// multiple threadblocks are handling multiple separate addition problems
// least significant word is at a[0], etc.
__global__ void paradd(const unsigned size, const unsigned psize, unsigned long long *c, const unsigned long long *a, const unsigned long long *b){
__shared__ unsigned long long carry_through[BSIZE];
__shared__ unsigned long long carry[BSIZE+1];
__shared__ volatile unsigned mcarry;
__shared__ volatile unsigned mcarry_through;
unsigned idx = threadIdx.x + (psize * blockIdx.x);
if ((threadIdx.x < psize) && (idx < size)){
// handle 64 bit unsigned add first
unsigned long long cr1 = a[idx];
unsigned long long lc = cr1 + b[idx];
// handle carry
if (threadIdx.x < BSIZE){
carry[threadIdx.x] = 0;
carry_through[threadIdx.x] = 0;
}
if (threadIdx.x == 0){
mcarry = 0;
mcarry_through = 0;
}
__syncthreads();
if (lc < cr1){
if ((threadIdx.x%LLBITS) != (LLBITS-1))
atomicAdd(&(carry[threadIdx.x/LLBITS]), (2ull<<(threadIdx.x%LLBITS)));
else atomicAdd(&(carry[(threadIdx.x/LLBITS)+1]), 1);
}
// handle carry-through
if (lc == 0xFFFFFFFFFFFFFFFFull)
atomicAdd(&(carry_through[threadIdx.x/LLBITS]), (1ull<<(threadIdx.x%LLBITS)));
__syncthreads();
if (threadIdx.x < ((psize + LLBITS-1)/LLBITS)){
// only 1 warp executing within this if statement
unsigned long long cr3 = carry_through[threadIdx.x];
cr1 = carry[threadIdx.x] & cr3;
// start of sub-add
unsigned long long cr2 = cr3 + cr1;
if (cr2 < cr1) atomicAdd((unsigned *)&mcarry, (2u<<(threadIdx.x)));
if (cr2 == 0xFFFFFFFFFFFFFFFFull) atomicAdd((unsigned *)&mcarry_through, (1u<<threadIdx.x));
if (threadIdx.x == 0) {
unsigned cr4 = mcarry & mcarry_through;
cr4 += mcarry_through;
mcarry |= (mcarry_through ^ cr4);
}
if (mcarry & (1u<<threadIdx.x)) cr2++;
// end of sub-add
carry[threadIdx.x] |= (cr2 ^ cr3);
}
__syncthreads();
if (carry[threadIdx.x/LLBITS] & (1ull<<(threadIdx.x%LLBITS))) lc++;
c[idx] = lc;
}
}
int main() {
unsigned long long *h_a, *h_b, *h_c, *d_a, *d_b, *d_c, *c;
unsigned at_once = 256; // valid range = 1 .. 65535
unsigned prob_size = MAXSIZE ; // valid range = 1 .. MAXSIZE
unsigned dsize = at_once * prob_size;
cudaEvent_t t_start_gpu, t_start_cpu, t_end_gpu, t_end_cpu;
float et_gpu, et_cpu, tot_gpu, tot_cpu;
tot_gpu = 0;
tot_cpu = 0;
if (sizeof(unsigned long long) != (LLBITS/8)) {printf("Word Size Error\n"); return 1;}
if ((c = (unsigned long long *)malloc(dsize * sizeof(unsigned long long))) == 0) {printf("Malloc Fail\n"); return 1;}
cudaHostAlloc((void **)&h_a, dsize * sizeof(unsigned long long), cudaHostAllocDefault);
cudaCheckErrors("cudaHostAlloc1 fail");
cudaHostAlloc((void **)&h_b, dsize * sizeof(unsigned long long), cudaHostAllocDefault);
cudaCheckErrors("cudaHostAlloc2 fail");
cudaHostAlloc((void **)&h_c, dsize * sizeof(unsigned long long), cudaHostAllocDefault);
cudaCheckErrors("cudaHostAlloc3 fail");
cudaMalloc((void **)&d_a, dsize * sizeof(unsigned long long));
cudaCheckErrors("cudaMalloc1 fail");
cudaMalloc((void **)&d_b, dsize * sizeof(unsigned long long));
cudaCheckErrors("cudaMalloc2 fail");
cudaMalloc((void **)&d_c, dsize * sizeof(unsigned long long));
cudaCheckErrors("cudaMalloc3 fail");
cudaMemset(d_c, 0, dsize*sizeof(unsigned long long));
cudaEventCreate(&t_start_gpu);
cudaEventCreate(&t_end_gpu);
cudaEventCreate(&t_start_cpu);
cudaEventCreate(&t_end_cpu);
for (unsigned loops = 0; loops <LOOPCNT; loops++){
//create some test cases
if (loops == 0){
for (int j=0; j<at_once; j++)
for (int k=0; k<prob_size; k++){
int i= (j*prob_size) + k;
h_a[i] = 0xFFFFFFFFFFFFFFFFull;
h_b[i] = 0;
}
h_a[prob_size-1] = 0;
h_b[prob_size-1] = 1;
h_b[0] = 1;
}
else if (loops == 1){
for (int i=0; i<dsize; i++){
h_a[i] = 0xFFFFFFFFFFFFFFFFull;
h_b[i] = 0;
}
h_b[0] = 1;
}
else if (loops == 2){
for (int i=0; i<dsize; i++){
h_a[i] = 0xFFFFFFFFFFFFFFFEull;
h_b[i] = 2;
}
h_b[0] = 1;
}
else {
for (int i = 0; i<dsize; i++){
h_a[i] = (((unsigned long long)lrand48())<<33) + (unsigned long long)lrand48();
h_b[i] = (((unsigned long long)lrand48())<<33) + (unsigned long long)lrand48();
}
}
#ifdef GPUCOPY
cudaEventRecord(t_start_gpu, 0);
#endif
cudaMemcpy(d_a, h_a, dsize*sizeof(unsigned long long), cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMemcpy1 fail");
cudaMemcpy(d_b, h_b, dsize*sizeof(unsigned long long), cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMemcpy2 fail");
#ifdef GPU
cudaEventRecord(t_start_gpu, 0);
#endif
paradd<<<at_once, nTPB>>>(dsize, prob_size, d_c, d_a, d_b);
cudaCheckErrors("Kernel Fail");
#ifdef GPU
cudaEventRecord(t_end_gpu, 0);
#endif
cudaMemcpy(h_c, d_c, dsize*sizeof(unsigned long long), cudaMemcpyDeviceToHost);
cudaCheckErrors("cudaMemcpy3 fail");
#ifdef GPUCOPY
cudaEventRecord(t_end_gpu, 0);
#endif
cudaEventSynchronize(t_end_gpu);
cudaEventElapsedTime(&et_gpu, t_start_gpu, t_end_gpu);
tot_gpu += et_gpu;
cudaEventRecord(t_start_cpu, 0);
//also compute result on CPU for comparison
for (int j=0; j<at_once; j++) {
unsigned rc=0;
for (int n=0; n<prob_size; n++){
unsigned i = (j*prob_size) + n;
c[i] = h_a[i] + h_b[i];
if (c[i] < h_a[i]) {
c[i] += rc;
rc=1;}
else {
if ((c[i] += rc) != 0) rc=0;
}
if (c[i] != h_c[i]) {printf("Results mismatch at offset %d, GPU = 0x%lX, CPU = 0x%lX\n", i, h_c[i], c[i]); return 1;}
}
}
cudaEventRecord(t_end_cpu, 0);
cudaEventSynchronize(t_end_cpu);
cudaEventElapsedTime(&et_cpu, t_start_cpu, t_end_cpu);
tot_cpu += et_cpu;
if ((loops%(LOOPCNT/10)) == 0) printf("*\n");
}
printf("\nResults Match!\n");
printf("Average GPU time = %fms\n", (tot_gpu/LOOPCNT));
printf("Average CPU time = %fms\n", (tot_cpu/LOOPCNT));
return 0;
}

Rotating Hash for 16 bit

This site gives description of rotating hash as follows.
unsigned rot_hash ( void *key, int len )
{
unsigned char *p = key;
unsigned h = 0;
int i;
for ( i = 0; i < len; i++ )
h = ( h << 4 ) ^ ( h >> 28 ) ^ p[i];
return h;
}
The returned value is 32 bit here. However, I want to return a 16 bit hash value. For that purpose, is it correct to assign h as follows in the loop? Consider h to be declared as a 16 bit integer here.
for ( i = 0; i < len; i++ )
h = ( h << 4 ) ^ ( h >> 12 ) ^ p[i];

It is probably best to keep the big hash, and only truncate on return, like:
for ( i = 0; i < len; i++ )
h = ( h << 4 ) ^ ( h >> 28 ) ^ p[i];
return h & 0xffff;
The shift constants 4 and 28 are probably not the best (in short: because they have a common divisor)
After some experimentation, I came to the following hashfunction, which is aimed at having maximal entropy in the lower bits (such that a power-of-two table size can be used) (this is the one used in Wakkerbot):
unsigned hash_mem(void *dat, size_t len)
{
unsigned char *str = (unsigned char*) dat;
unsigned val=0;
size_t idx;
for(idx=0; idx < len; idx++ ) {
val ^= (val >> 2) ^ (val << 5) ^ (val << 13) ^ str[idx] ^ 0x80001801;
}
return val;
}
The extra perturbance with 0x80001801 is not strictly needed, but helps if the hashed items have long common prefixes. It also helps if these prefixes consist of 0x0 values.

It's hard to talk about "correct" with hashes, because any deterministic result can be considered correct. Perhaps the hash distribution won't be so good, but this hash doesn't seem like the strongest anyway.
With the change you suggest, the number you'll get will still be a 32 bit number, and the high 16 bits won't be zeros.
The easiest thing to do is change nothing, and cast the result to unsigned short.

apply checksum in c to a char

How do you apply checkksum to a char in c?

A checksum reduces a sequence of bits to a shorter sequence, such that a change to the larger sequence results in a random change to the shorter checksum.
A char is already quite small. To produce a checksum you will need a bitfield. Actually, you will need two, since one bitfield alone will be padded at least to a full byte.
struct twochars_checksum {
unsigned sum_a : CHAR_BIT / 2;
unsigned sum_b : CHAR_BIT / 2;
};
void sum_char( char c, struct twochars_checksum *dest, int which ) {
int sum;
sum = c ^ c >> CHAR_BIT / 2; // suboptimal, but passable
if ( which == 0 ) {
dest->sum_a = sum;
} else {
dest->sum_b = sum;
}
}

Suggest to follow similar approach to transfer a byte of data with checksum.
The algorithm for calculating checksum is quite simple and is as follows.
1.Check if the bit is on then add the corresponding bit value ( ie 2 to the power of bit position ) to the check sum.
2.If the bit is off then detect the sum by 1.
Note : You can use your own checksum algorithm by altering function calculate_checksum().
You can include your own processing logic in set_transfer_data().
#include <stdio.h>
typedef unsigned char uint8_t;
typedef unsigned short uint16_t;
typedef unsigned int uint32_t;
#define NUM_BITS (8)
uint16_t calculate_checksum(const uint8_t data)
{
uint16_t checksum = 0;
uint8_t bit_index = 0;
uint8_t bit_value = 0;
while( bit_index < NUM_BITS )
{
bit_value = 1 << bit_index++;
checksum += ( data & bit_value ) ? bit_value : -1;
}
return ( checksum );
}
uint8_t set_transfer_data( uint32_t *dest_data , const uint8_t src_data , const uint16_t checksum )
{
uint8_t return_value = 0;
*dest_data = checksum << NUM_BITS | src_data ;
return ( return_value );
}
int main()
{
uint8_t return_value = 0;
uint8_t source_data = 0xF3;
uint32_t transfer_data = 0;
uint16_t checksum = 0;
checksum = calculate_checksum( source_data );
printf( "\nChecksum calculated = %x",checksum );
return_value = set_transfer_data( &transfer_data,source_data,checksum );
if( 0 == return_value )
{
printf( "\nChecksum added successfully; transfer_data = %x",
transfer_data );
}
else
{
printf( "\nError adding checksum" );
}
return ( 0 );
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Premultiplying image alpha efficiently - c

Related

How to do 1024-bit operations using arrays of uint64_t

C Bitwise operations left shift and bitwise OR

large integer addition with CUDA

Rotating Hash for 16 bit

apply checksum in c to a char

Categories

Resources