What is the Default addition Operator '+' of __m64

What is the Default addition Operator '+' of __m64 - c

I found that the following code(C Files) can be compiled successfully in x86_64, gcc 10.1.0.
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
typedef union{
__m64 x;
#if defined(__arm__) || defined(__aarch64__)
int32x2_t d[1];
#endif
uint8_t i8u[8];
}u_m64;
int main()
{
u_m64 a, b, c;
c.x = a.x + b.x;
return 0;
}
But there are lots of add function for __m64, like "_mm_add_pi16, _mm_hadd_pi16", "_mm_add_si64" and so on(The same applies to __mm128, __mm256...). So which one is called by the operate '+' ? And how can a 'Operator Overloading' be used in a C Files?

Yeah, gcc and clang provide basic operators for builtin SIMD types, which is frankly so beyond stupid that it's not even remotely funny :(
Anyhow, this mechanism isn't working in the same way as operator overloading in C++. What it's actually doing, is promoting __m64 to be a true intrinsic type (such as int/float), meaning the operators are at a language level, rather than overload level. (That's why it works in C).
In this case I would assume it is calling add (rather than horizontal add).
However, we now hit the biggest problem! - The contents of __m64 are NOT known at compile time!
Within any given __m64, we could be storing any permutation of:
8 x int8
4 x int16
2 x int32
8 x uint8
4 x uint16
2 x uint32
For addition (ignoring the saturated variants) that means the addition operator could be calling any one these perfectly valid choices:
_mm_add_pi8
_mm_add_pi16
_mm_add_pi32
I don't know which of those instructions gcc/clang ends up calling in this context, however I do know that it's always going to be the wrong instruction 66.66% of the time :(

Related

Issue working with uint2 and CUDA

Recently I started working with CUDA and Ethereum and I found a little bit of code snipet on a function that when I try to port to a cuda file I get some errors.
Here is the code snippet:
void keccak_f1600_round(uint2* a, uint r, uint out_size)
{
#if !__ENDIAN_LITTLE__
for (uint i = 0; i != 25; ++i)
a[i] = make_uint2(a[i].y, a[i].x);
#endif
uint2 b[25];
uint2 t;
// Theta
b[0] = a[0] ^ a[5] ^ a[10] ^ a[15] ^ a[20];
#if !__ENDIAN_LITTLE__
for (uint i = 0; i != 25; ++i)
a[i] = make_uint2(a[i].y, a[i].x);
#endif
}
The error I am getting concern the b[0] line and is:
error: no operator "^=" matches these operands operand types are: uint2 ^= uint2
TO be honest I don't have a lot of experience with uint2 and cuda and that is why I am asking what should I do to correct this issue.

The exclusive-or operator works with unsigned long long, but not with uint2 (which for CUDA is a built-in struct containing two unsigned ints).
To make the code work, there are several options. Some that come to my mind:
you can use reinterpret-cast<unsigned long long &> before each uint2 in the line that does the exclusive-or (see How to use reinterpret_cast in C++?)
you can rewrite the code to use unsigned long long types everywhere you use uint2 now. This probably produces the most maintainable code.
you can rewrite the line for the exclusive-or among uint2 types, as a pair of exclusive-or lines using the .x and .y members of the uint2, as each is an unsigned int type.
you can define a union type to allow access to the data that is currently type uint2, as either a uint2 or a unsigned long long.
you can overload the ^ exclusive-or operator to work with uint2 types.
you can replace the line that produces the error with asm statements to generate the PTX code to perform the exclusive-or for you. See http://docs.nvidia.com/cuda/inline-ptx-assembly/index.html#using-inline-ptx-assembly-in-cuda

uint2 is simply a struct, you'll need to implement ^ using a[].x and a[].y. I couldn't find where the builtin declarations are but Are there advantages to using the CUDA vector types? has a good description of their use.

C programming: testing oversized shift

Negative and oversized bit shifts are undefined behaviour. But I need to do positive/negative shifts in many places of my code, so I wrote a generic function for that:
uint64_t shift_bit(uint64_t b, int i)
{
// detect oversized shift
assert(-63 <= i && i <= 63);
return i >= 0 ? b << i : b >> -i;
}
That takes care of negative shifts. But what about oversized shift?
Is a C compiler allowed to optimize away the assert()? (Or to replace it with assert(true), which has the same effect).
Is there a way to code around it?
PS: Please base your answers only on what the standard guarantees, not on what your specific compiler does. My program needs to be 100% portable, as it gets compiled for many different platforms, and with many different compilers.

assert() is replaced to "nothing" on release.
#ifdef NDEBUG
#define assert(exp) ((void)0)
#else
#define assert(exp) /*implementation*/
#endif

The difference in these 2 snippets of code

Here are 2 snippets of code, one is a macro and one is a function. They seem to do the same thing but after running them it seems that they exhibit different behavior and I don't know why. Could anyone help me please? Thanks!
#define ROL(a, offset) ((((Lane)a) << ((offset) % LANE_BIT_SIZE)) ^ (((Lane)a) >> (LANE_BIT_SIZE-((offset) % LANE_BIT_SIZE))))
Lane rotateLeft(Lane lane, int rotateCount)
{
return ((Lane)lane << (rotateCount % LANE_BIT_SIZE)) ^ ((Lane)lane >> (LANE_BIT_SIZE - (rotateCount % LANE_BIT_SIZE))) ;
}
Note: the Lane type is just an unsigned int and LANE_BIT_SIZE is a number representing the size of Lane in terms of No. of bits.

Think of using a macro as substituting the body of the macro into the place you're using it.
As an example, suppose you were to define a macro: #define quadruple(a) ((a) * (a) * (a) * (a))
... then you were to use that macro like so:
int main(void) {
int x = 1;
printf("%d\n", quadruple(x++));
}
What would you expect to happen here? Substituting the macro into the code results in:
int main(void) {
int x = 1;
printf("%d\n", ((x++) * (x++) * (x++) * (x++)));
}
As it turns out, this code uses undefined behaviour because it modifies x multiple times in the same expression. That's no good! Do you suppose this could account for your difference in behaviour?

one is macro and other one is function, the simple understanding gives difference in the way it will be called.
As in case of function CONTEXT SWITCHING will be there, you code flow will be changed to the calling function and will return eventually so there will be very small delay in execution when compared to MACRO.
other than that there should not be any other difference.
Please try by declaring the function as inline function then both should be same.

Lane may be promoted to a type with more bits, e.g. when it's an unsigned char or unsigned short, or when it is used in a larger assignment with mixed types. The <<operation will then shift the higher bits into the additional bits of the larger type.
With the function call these bits will be just cut off, because it returns a Lane, while the macro gives you the full result of the promoted type, including the additional bits - beside the other problems of macros, like multiple evaluations of the arguments.

Here are 2 snippets of code, one is a macro and one is a function.
They seem to do the same thing but after running them it seems that
they exhibit different behavior and I don't know why.
No they are doing the same thing.
ROL(a, offset); //does a*(2^offset)
rotateLeft(Lane lane, int rotateCount); //does lane*(2^rotateCount)
The only difference is that ROL is implemented through a macro , and rotateLeft() is a function.
Differences between Macros and functions
Macros are executed in the preprocessing stage of the compiler ,
whereas function executes , when it is called at runtime execution.
As a result Macros execute faster than functions , but when called
multiple times , the macro text is substitutes same code redundantly, and they end up consuming more "code" memory than an implementation using functions.
Unlike a function , there is no Type Enforcement in a macro.

How to store the contents of a __m128d simd vector as doubles without accessing it as a union?

The code i want to optimize is basically a simple but large arithmetic formula, it should be fairly simple to analyze the code automatically to compute the independent multiplications/additions in parallel, but i read that autovectorization only works for loops.
I've read multiple times now that access of single elements in a vector via union or some other way should be avoided at all costs, instead should be replaced by a _mm_shuffle_pd (i'm working on doubles only)...
I don't seem to figure out how I can store the content of a __m128d vector as doubles without accessing it as a union. Also, does an operation like this give any performance gain when compared to scalar code?
union {
__m128d v;
double d[2];
} vec;
union {
__m128d v;
double d[2];
} vec2;
vec.v = index1;
vec2.v = index2;
temp1 = _mm_mul_pd(temp1, _mm_set_pd(bvec[vec.d[1]], bvec[vec2[1]]));
also, the two unions look ridiculously ugly, but when using
union dvec {
__m128d v;
double d[2];
} vec;
Trying to declare the indexX as dvec, the compiler complained dvec is undeclared.

Unfortunately if you look at MSDN it says the following:
You should not access the __m128d fields directly. You can, however, see these types in the debugger. A variable of type __m128 maps to the XMM[0-7] registers.
I'm no expert in SIMD, however this tells me that what you're doing won't work as it's just not designed to.
EDIT:
I've just found this, and it says:
Use __m128, __m128d, and __m128i only on the left-hand side of an assignment, as a return value, or as a parameter. Do not use it in other arithmetic expressions such as "+" and ">>".
It also says:
Use __m128, __m128d, and __m128i objects in aggregates, such as unions (for example, to access the float elements) and structures.
So maybe you can use them, but only in unions. Seems contradictory to what MSDN says, however.
EDIT2:
Here is another interesting resource that describes with examples on how to use these SIMD types
In the above link, you'll find this line:
#include <math.h>
#include <emmintrin.h>
double in1_min(__m128d x)
{
return x[0];
}
In the above we use a new extension in gcc 4.6 to access the high and low parts via indexing. Older versions of gcc require using a union and writing to an array of two doubles. This is cumbersome, and extra slow when optimization is turned off.

_mm_cvtsd_f64 + _mm_unpackhi_pd
For doubles:
#include <assert.h>
#include <x86intrin.h>
int main(void) {
__m128d x = _mm_set_pd(1.5, 2.5);
/* _mm_cvtsd_f64 + _mm_unpackhi_pd */
assert(_mm_cvtsd_f64(x) == 2.5);
assert(_mm_cvtsd_f64(_mm_unpackhi_pd(x, x)) == 1.5);
}
For floats, I have posted the following examples at How to convert a hex float to a float in C/C++ using _mm_extract_ps SSE GCC instrinc function
_mm_cvtss_f32 + _mm_shuffle_ps
_MM_EXTRACT_FLOAT
For ints you can use _mm_extract_epi32:
#include <assert.h>
#include <x86intrin.h>
int main(void) {
__m128i x = _mm_set_epi32(1, 2, 3, 4);
assert(_mm_extract_epi32(x, 3) == 4);
assert(_mm_extract_epi32(x, 2) == 3);
assert(_mm_extract_epi32(x, 1) == 1);
assert(_mm_extract_epi32(x, 0) == 1);
}
GitHub upstream.
Compile and run examples with:
gcc -ggdb3 -O0 -std=c99 -Wall -Wextra -pedantic -o main.out main.c
./main.out
Tested on Ubuntu 19.04 amd64.

There is a double _mm_cvtsd_f64 (__m128d a) function in defined in "emmintrin.h" to access the lower double of an sse vector of two doubles.
From the Intel Intrinsics guide:
Synopsis
double _mm_cvtsd_f64 (__m128d a)
include "emmintrin.h"
Instruction: movsd
CPUID Feature Flag: SSE2
Description:
Copy the lower double-precision (64-bit) floating-point element of a to dst.
Operation
dst[63:0] := a[63:0]

Can the C preprocessor perform integer arithmetic?

As the questions says, is the C preprocessor able to do it?
E.g.:
#define PI 3.1416
#define OP PI/100
#define OP2 PI%100
Is there any way OP and/or OP2 get calculated in the preprocessing phase?

Integer arithmetic? Run the following program to find out:
#include "stdio.h"
int main() {
#if 1 + 1 == 2
printf("1+1==2\n");
#endif
#if 1 + 1 == 3
printf("1+1==3\n");
#endif
}
Answer is "yes", there is a way to make the preprocessor perform integer arithmetic, which is to use it in a preprocessor condition.
Note however that your examples are not integer arithmetic. I just checked, and gcc's preprocessor fails if you try to make it do float comparisons. I haven't checked whether the standard ever allows floating point arithmetic in the preprocessor.
Regular macro expansion does not evaluate integer expressions, it leaves it to the compiler, as can be seen by preprocessing (-E in gcc) the following:
#define ONEPLUSONE (1 + 1)
#if ONEPLUSONE == 2
int i = ONEPLUSONE;
#endif
Result is int i = (1 + 1); (plus probably some stuff to indicate source file names and line numbers and such).

The code you wrote doesn't actually make the preprocessor do any calculation. A #define does simple text replacement, so with this defined:
#define PI 3.1416
#define OP PI/100
This code:
if (OP == x) { ... }
becomes
if (3.1416/100 == x) { ... }
and then it gets compiled. The compiler in turn may choose to take such an expression and calculate it at compile time and produce a code equivalent to this:
if (0.031416 == x) { ... }
But this is the compiler, not the preprocessor.
To answer your question, yes, the preprocessor CAN do some arithmetic. This can be seen when you write something like this:
#if (3.141/100 == 20)
printf("yo");
#elif (3+3 == 6)
printf("hey");
#endif

YES, I mean: it can do arithmetic :)
As demonstrated in 99 bottles of beer.

Yes, it can be done with the Boost Preprocessor. And it is compatible with pure C so you can use it in C programs with C only compilations. Your code involves floating point numbers though, so I think that needs to be done indirectly.
#include <boost/preprocessor/arithmetic/div.hpp>
BOOST_PP_DIV(11, 5) // expands to 2
#define KB 1024
#define HKB BOOST_PP_DIV(A,2)
#define REM(A,B) BOOST_PP_SUB(A, BOOST_PP_MUL(B, BOOST_PP_DIV(A,B)))
#define RKB REM(KB,2)
int div = HKB;
int rem = RKB;
This preprocesses to (check with gcc -S)
int div = 512;
int rem = 0;
Thanks to this thread.

Yes.
I can't believe that no one has yet linked to a certain obfuscated C contest winner. The guy implemented an ALU in the preprocessor via recursive includes. Here is the implementation, and here is something of an explanation.
Now, that said, you don't want to do what that guy did. It's fun and all, but look at the compile times in his hint file (not to mention the fact that the resulting code is unmaintainable). More commonly, people use the pre-processor strictly for text replacement, and evaluation of constant integer arithmetic happens either at compile time or run time.
As others noted however, you can do some arithmetic in #if statements.

Be carefull when doing arithmetic: add parenthesis.
#define SIZE4 4
#define SIZE8 8
#define TOTALSIZE SIZE4 + SIZE8
If you ever use something like:
unsigned int i = TOTALSIZE/4;
and expect i to be 3, you would get 4 + 2 = 6 instead.
Add parenthesis:
#define TOTALSIZE (SIZE4 + SIZE8)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

What is the Default addition Operator '+' of __m64 - c

Related

Issue working with uint2 and CUDA

C programming: testing oversized shift

The difference in these 2 snippets of code

How to store the contents of a __m128d simd vector as doubles without accessing it as a union?

Can the C preprocessor perform integer arithmetic?

Categories

Resources