Issue working with uint2 and CUDA - c

Recently I started working with CUDA and Ethereum and I found a little bit of code snipet on a function that when I try to port to a cuda file I get some errors.
Here is the code snippet:
void keccak_f1600_round(uint2* a, uint r, uint out_size)
{
#if !__ENDIAN_LITTLE__
for (uint i = 0; i != 25; ++i)
a[i] = make_uint2(a[i].y, a[i].x);
#endif
uint2 b[25];
uint2 t;
// Theta
b[0] = a[0] ^ a[5] ^ a[10] ^ a[15] ^ a[20];
#if !__ENDIAN_LITTLE__
for (uint i = 0; i != 25; ++i)
a[i] = make_uint2(a[i].y, a[i].x);
#endif
}
The error I am getting concern the b[0] line and is:
error: no operator "^=" matches these operands operand types are: uint2 ^= uint2
TO be honest I don't have a lot of experience with uint2 and cuda and that is why I am asking what should I do to correct this issue.

The exclusive-or operator works with unsigned long long, but not with uint2 (which for CUDA is a built-in struct containing two unsigned ints).
To make the code work, there are several options. Some that come to my mind:
you can use reinterpret-cast<unsigned long long &> before each uint2 in the line that does the exclusive-or (see How to use reinterpret_cast in C++?)
you can rewrite the code to use unsigned long long types everywhere you use uint2 now. This probably produces the most maintainable code.
you can rewrite the line for the exclusive-or among uint2 types, as a pair of exclusive-or lines using the .x and .y members of the uint2, as each is an unsigned int type.
you can define a union type to allow access to the data that is currently type uint2, as either a uint2 or a unsigned long long.
you can overload the ^ exclusive-or operator to work with uint2 types.
you can replace the line that produces the error with asm statements to generate the PTX code to perform the exclusive-or for you. See http://docs.nvidia.com/cuda/inline-ptx-assembly/index.html#using-inline-ptx-assembly-in-cuda

uint2 is simply a struct, you'll need to implement ^ using a[].x and a[].y. I couldn't find where the builtin declarations are but Are there advantages to using the CUDA vector types? has a good description of their use.

Related

What is the Default addition Operator '+' of __m64

I found that the following code(C Files) can be compiled successfully in x86_64, gcc 10.1.0.
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
typedef union{
__m64 x;
#if defined(__arm__) || defined(__aarch64__)
int32x2_t d[1];
#endif
uint8_t i8u[8];
}u_m64;
int main()
{
u_m64 a, b, c;
c.x = a.x + b.x;
return 0;
}
But there are lots of add function for __m64, like "_mm_add_pi16, _mm_hadd_pi16", "_mm_add_si64" and so on(The same applies to __mm128, __mm256...). So which one is called by the operate '+' ? And how can a 'Operator Overloading' be used in a C Files?
Yeah, gcc and clang provide basic operators for builtin SIMD types, which is frankly so beyond stupid that it's not even remotely funny :(
Anyhow, this mechanism isn't working in the same way as operator overloading in C++. What it's actually doing, is promoting __m64 to be a true intrinsic type (such as int/float), meaning the operators are at a language level, rather than overload level. (That's why it works in C).
In this case I would assume it is calling add (rather than horizontal add).
However, we now hit the biggest problem! - The contents of __m64 are NOT known at compile time!
Within any given __m64, we could be storing any permutation of:
8 x int8
4 x int16
2 x int32
8 x uint8
4 x uint16
2 x uint32
For addition (ignoring the saturated variants) that means the addition operator could be calling any one these perfectly valid choices:
_mm_add_pi8
_mm_add_pi16
_mm_add_pi32
I don't know which of those instructions gcc/clang ends up calling in this context, however I do know that it's always going to be the wrong instruction 66.66% of the time :(

How to implement wrapping signed int addition in C

This is a complete rewrite of the question. Hopefully it is clearer now.
I want to implement in C a function that performs addition of signed ints with wrapping in case of overflow.
I want to target mainly the x86-64 architecture, but of course the more portable the implementation is the better. I'm also concerned mostly about producing decent assembly code through gcc, clang, icc, and whatever is used on Windows.
The goal is twofold:
write correct C code that doesn't fall into the undefined behavior blackhole;
write code that gets compiled to decent machine code.
By decent machine code I mean a single leal or a single addl instruction on machines which natively support the operation.
I'm able to satisfy either of the two requisites, but not both.
Attempt 1
The first implementation that comes to mind is
int add_wrap(int x, int y) {
return (unsigned) x + (unsigned) y;
}
This seems to work with gcc, clang and icc. However, as far as I know, the C standard doesn't specify the cast from unsigned int to signed int, leaving freedom to the implementations (see also here).
Otherwise, if the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised.
I believe most (all?) major compilers do the expected conversion from unsigned to int, meaning that they take the correct representative modulus 2^N, where N is the number of bits, but it's not mandated by the standard so it cannot be relied upon (stupid C standard hits again). Also, while this is the simplest thing to do on two's complement machines, it is impossible on ones' complement machines, because there is a class which is not representable: 2^(N/2).
Attempt 2
According to the clang docs, one can use __builtin_add_overflow like this
int add_wrap(int x, int y) {
int res;
__builtin_add_overflow(x, y, &res);
return res;
}
and this should do the trick with clang, because the docs clearly say
If possible, the result will be equal to mathematically-correct result and the builtin will return 0. Otherwise, the builtin will return 1 and the result will be equal to the unique value that is equivalent to the mathematically-correct result modulo two raised to the k power, where k is the number of bits in the result type.
The problem is that in the GCC docs they say
These built-in functions promote the first two operands into infinite precision signed type and perform addition on those promoted operands. The result is then cast to the type the third pointer argument points to and stored there.
As far as I know, casting from long int to int is implementation specific, so I don't see any guarantee that this will result in the wrapping behavior.
As you can see [here][godbolt], GCC will also generate the expected code, but I wanted to be sure that this is not by chance ans is indeed part of the specification of __builtin_add_overflow.
icc also seems to produce something reasonable.
This produces decent assembly, but relies on intrinsics, so it's not really standard compliant C.
Attempt 3
Follow the suggestions of those pedantic guys from SEI CERT C Coding Standard.
In their CERT INT32-C recommendation they explain how to check in advance for potential overflow. Here is what comes out following their advice:
#include <limits.h>
int add_wrap(int x, int y) {
if ((x > 0) && (y > INT_MAX - x))
return (x + INT_MIN) + (y + INT_MIN);
else if ((x < 0) && (y < INT_MIN - x))
return (x - INT_MIN) + (y - INT_MIN);
else
return x + y;
}
The code performs the correct checks and compiles to leal with gcc, but not with clang or icc.
The whole CERT INT32-C recommendation is complete garbage, because it tries to transform C into a "safe" language by forcing the programmers to perform checks that should be part of the definition of the language in the first place. And in doing so it forces also the programmer to write code which the compiler can no longer optimize, so what is the reason to use C anymore?!
Edit
The contrast is between compatibility and decency of the assembly generated.
For instance, with both gcc and clang the two following functions which are supposed to do the same get compiled to different assembly.
f is bad in both cases, g is good in both cases (addl+jo or addl+cmovnol). I don't know if jo is better than cmovnol, but the function g is consistently better than f.
#include <limits.h>
signed int f(signed int si_a, signed int si_b) {
signed int sum;
if (((si_b > 0) && (si_a > (INT_MAX - si_b))) ||
((si_b < 0) && (si_a < (INT_MIN - si_b)))) {
return 0;
} else {
return si_a + si_b;
}
}
signed int g(signed int si_a, signed int si_b) {
signed int sum;
if (__builtin_add_overflow(si_a, si_b, &sum)) {
return 0;
} else {
return sum;
}
}
A bit like #Andrew's answer without the memcpy().
Use a union to negate the need for memcpy(). With C2x, we are sure that int is 2's compliment.
int add_wrap(int x, int y) {
union {
unsigned un;
int in;
} u = {.un = (unsigned) x + (unsigned) y};
return u.in;
}
For those who like 1-liners, use a compound literal.
int add_wrap2(int x, int y) {
return ( union { unsigned un; int in; }) {.un = (unsigned) x + (unsigned) y}.in;
}
I'm not so sure because of the rules for casting from unsigned to signed
You exactly quoted the rules. If you convert from a unsigned value to a signed one, then the result is implementation-defined or a signal is raised. In simple words, what will happen is described by your compiler.
For example the gcc9.2.0 compiler has the following to in it's documentation about implementation defined behavior of integers:
The result of, or the signal raised by, converting an integer to a signed integer type when the value cannot be represented in an object of that type (C90 6.2.1.2, C99 and C11 6.3.1.3).
For conversion to a type of width N, the value is reduced modulo 2^N to be within range of the type; no signal is raised.
I had to do something similar; however, I was working with known width types from stdint.h and needed to handle wrapping 32-bit signed integer operations. The implementation below works because stdint types are required to be 2's complement. I was trying to emulate the behaviour in Java, so I had some Java code generate a bunch of test cases and have tested on clang, gcc and MSVC.
inline int32_t add_wrap_i32(int32_t a, int32_t b)
{
const int64_t a_widened = a;
const int64_t b_widened = b;
const int64_t sum = a_widened + b_widened;
return (int32_t)(sum & INT64_C(0xFFFFFFFF));
}
inline int32_t sub_wrap_i32(int32_t a, int32_t b)
{
const int64_t a_widened = a;
const int64_t b_widened = b;
const int64_t difference = a_widened - b_widened;
return (int32_t)(difference & INT64_C(0xFFFFFFFF));
}
inline int32_t mul_wrap_i32(int32_t a, int32_t b)
{
const int64_t a_widened = a;
const int64_t b_widened = b;
const int64_t product = a_widened * b_widened;
return (int32_t)(product & INT64_C(0xFFFFFFFF));
}
It seems ridiculous, but I think that the recommended method is to use memcpy. Apparently all modern compilers optimize the memcpy away and it ends up doing just what you're hoping in the first place -- preserving the bit pattern from the unsigned addition.
int a;
int b;
unsigned u = (unsigned)a + b;
int result;
memcpy(&result, &u, sizeof(result));
On x86 clang with optimization, this is a single instruction if the destination is a register.

How to understand the following code in C?

I'm using the PCG Random Number Generation package,and can not understand the following code.
time(NULL) ^ (intptr_t)&printf
which is a argument for the function to generate the seed for randomization:
void pcg32_srandom(uint64_t seed, uint64_t seq)
In the main function, it will be used as follow:
pcg32_srandom(time(NULL) ^ (intptr_t)&printf, 54u);
BTW I also want to ask why "54u" should be written in such way?
I am not sure how much you know about random number generators, but often a random number generator is initialized by passing a number to it called a "seed". In this case, the seed is chosen to be the time returned by the time function XORed with the address of the printf function. I think this is not a very good random seed, and I wouldn't trust it with any important cryptographic tasks.
In C, when you write 54u it tells the compiler that you are writing an unsigned number. The u is not actually needed in the code you posted.
Let's break it down...
/* original: pcg32_srandom(time(NULL) ^ (intptr_t)&printf, 54u); */
uint64_t a = time(NULL);
uint64_t b = (intptr_t)&printf;
uint64_t c = a ^ b;
uint64_t d = 54u;
printf("a=%llx\n", (unsigned long long) a); /* no guessing about length */
printf("b=%llx\n", (unsigned long long) b); /* (thank you #chux) */
printf("c=%llx\n", (unsigned long long) c);
printf("d=%llx\n", (unsigned long long) d);
pcg32_srandom( a ^ b, d);
So... the ^ is the bitwise xor operator (edit: I originally wrote or).
Adding the printf should help you trace what the code is doing.
Apparently the code is xor-ing together something from time with
the address of the printf function (which is clever, I haven't seen
that before).
The u on the 54u is probably the original author being cautious.
When doing bit manipulations usually you don't want signed numbers.
This has some background: http://soundsoftware.ac.uk/c-pitfall-unsigned
We can see the API for pcg32_srandom() here:
http://www.pcg-random.org/using-pcg-c-basic.html
and these variants for the global RNG:
void pcg32_srandom(uint64_t initstate, uint64_t initseq)
uint32_t pcg32_random()
uint32_t pcg32_boundedrand(uint32_t bound)
So it looks like they're trying to come up with a seed for
the random number generator "initstate" and, for some reason,
want to use 54u as the "initseq".

Why a and b are not swapped in this code?

Here's the code:
#include <stdio.h>
union
{
unsigned u;
double d;
} a,b;
int main(void)
{
printf("Enter a, b:");
scanf("%lf %lf",&a.d,&b.d);
if(a.d>b.d)
{
a.u^=b.u^=a.u^=b.u;
}
printf("a=%g, b=%g\n",a.d,b.d);
return 0;
}
The a.u^=b.u^=a.u^=b.u; statement should have swapped a and b if a>b, but it seems that whatever I enter, the output will always be exactly my input.
a.u^=b.u^=a.u^=b.u; causes undefined behaviour by writing to a.u twice without a sequence point. See here for discussion of this code.
You could write:
unsigned tmp;
tmp = a.u;
a.u = b.u;
b.u = tmp;
which will swap a.u and b.u. However this may not achieve the goal of swapping the two doubles, if double is a larger type than unsigned on your system (a common scenario).
It's likely that double is 64 bits, while unsigned is only 32 bits. When you swap the unsigned members of the unions, you're only getting half of the doubles.
If you change d to float, or change u to unsigned long long, it will probably work, since they're likely to be the same size.
You're also causing UB by writing to the variables twice without a sequence point. The proper way to write the XOR swap is with multiple statements.
b.u ^= a.u;
a.u ^= b.u;
b.u ^= a.u;
For more about why not to use XOR for swapping, see Why don't people use xor swaps?
In usual environment, memory size of datatype 'unsigned' and 'double' are different.
That is why variables are not look like changed.
And you cannot using XOR swap on floating point variable.
because they are represented totally different in memory.

CUDA __float_as_int in acosf implementation

CUDA C's maths function implementation (cuda/math_function.h) of acosf contains the passage:
if (__float_as_int(a) < 0) {
t1 = CUDART_PI_F - t1;
}
where a and t1 are floats and CUDART_PI_F is a float previously set to a numerical value close to the mathematical constant Pi.
I am trying to understand what the conditional (if-clause) is testing for and what would be the C equivalent of it or the function/macro __float_as_int(a). I searched for the implementation of __float_as_int() but without success. It seems that __float_as_int() is a built-in macro or function to NVIDIA NVCC. Looking at the PTX that NVCC produces out of the above passage:
.reg .u32 %r<4>;
.reg .f32 %f<46>;
.reg .pred %p<4>;
// ...
mov.b32 %r1, %f1;
mov.s32 %r2, 0;
setp.lt.s32 %p2, %r1, %r2;
selp.f32 %f44, %f43, %f41, %p2;
it becomes clear that __float_as_int() is not a float to int rounding. (This would have yielded a cvt.s32.f32.) Instead it assigns the float %f1 as a bit-copy (b32) to %r1 (notice: %r1 is of type u32 (unsigned int)!!) and then compares %r1 as if it was a s32 (signed int, confusing!!) with %r2 (who's value is 0).
To me this looks a little odd. But obviously it is correct.
Can someone explain what's going on and especially explain what __float_as_int() is doing in the context of the if-clause testing for being negative (<0)? .. and provide a C equivalent of the if-clause and/or __float_as_int() marco ?
__float_as_int reinterprets float as an int. int is <0 when it has most significant bit on. For float it also means that the sign bit is on, but it does not exactly mean that number is negative (e.g. it can be 'negative zero'). It can be faster to check then checking if float is < 0.0.
C function could look like:
int __float_as_int(float in) {
union fi { int i; float f; } conv;
conv.f = in;
return conv.i;
}
In some other version of this header __cuda___signbitf is used instead.

Resources