An intrinsic to shift left/right based on a mask - c

I'm trying to combine a left and right shift and I was wondering if there's a built-in intrinsic for it that I don't know about.
uint32_t x = whatever;
if(value < 0) {
x <<= value;
} else {
x >>= value;
}
//The way I would do it
_a = _mm256_sllv_epi32(x, shiftValues);
_b = _mm256_srlv_epi32(x, shiftValues);
_value = some gather / blend function to merge _a and _b
I'm limiting my intrinsic list to mm256 since that's all my processor can support but if there's an mm512 intrinsic for this problem I'll still keep it in note. Thanks
I have looked through the intel intrinsics list for something of the sorts but the naming scheme isn't the most "aha! that's a thing" at least when I read "sllv".

Related

Keep float number in range [duplicate]

Is there a more efficient way to clamp real numbers than using if statements or ternary operators?
I want to do this both for doubles and for a 32-bit fixpoint implementation (16.16). I'm not asking for code that can handle both cases; they will be handled in separate functions.
Obviously, I can do something like:
double clampedA;
double a = calculate();
clampedA = a > MY_MAX ? MY_MAX : a;
clampedA = a < MY_MIN ? MY_MIN : a;
or
double a = calculate();
double clampedA = a;
if(clampedA > MY_MAX)
clampedA = MY_MAX;
else if(clampedA < MY_MIN)
clampedA = MY_MIN;
The fixpoint version would use functions/macros for comparisons.
This is done in a performance-critical part of the code, so I'm looking for an as efficient way to do it as possible (which I suspect would involve bit-manipulation)
EDIT: It has to be standard/portable C, platform-specific functionality is not of any interest here. Also, MY_MIN and MY_MAX are the same type as the value I want clamped (doubles in the examples above).
Both GCC and clang generate beautiful assembly for the following simple, straightforward, portable code:
double clamp(double d, double min, double max) {
const double t = d < min ? min : d;
return t > max ? max : t;
}
> gcc -O3 -march=native -Wall -Wextra -Wc++-compat -S -fverbose-asm clamp_ternary_operator.c
GCC-generated assembly:
maxsd %xmm0, %xmm1 # d, min
movapd %xmm2, %xmm0 # max, max
minsd %xmm1, %xmm0 # min, max
ret
> clang -O3 -march=native -Wall -Wextra -Wc++-compat -S -fverbose-asm clamp_ternary_operator.c
Clang-generated assembly:
maxsd %xmm0, %xmm1
minsd %xmm1, %xmm2
movaps %xmm2, %xmm0
ret
Three instructions (not counting the ret), no branches. Excellent.
This was tested with GCC 4.7 and clang 3.2 on Ubuntu 13.04 with a Core i3 M 350.
On a side note, the straightforward C++ code calling std::min and std::max generated the same assembly.
This is for doubles. And for int, both GCC and clang generate assembly with five instructions (not counting the ret) and no branches. Also excellent.
I don't currently use fixed-point, so I will not give an opinion on fixed-point.
Old question, but I was working on this problem today (with doubles/floats).
The best approach is to use SSE MINSS/MAXSS for floats and SSE2 MINSD/MAXSD for doubles. These are branchless and take one clock cycle each, and are easy to use thanks to compiler intrinsics. They confer more than an order of magnitude increase in performance compared with clamping with std::min/max.
You may find that surprising. I certainly did! Unfortunately VC++ 2010 uses simple comparisons for std::min/max even when /arch:SSE2 and /FP:fast are enabled. I can't speak for other compilers.
Here's the necessary code to do this in VC++:
#include <mmintrin.h>
float minss ( float a, float b )
{
// Branchless SSE min.
_mm_store_ss( &a, _mm_min_ss(_mm_set_ss(a),_mm_set_ss(b)) );
return a;
}
float maxss ( float a, float b )
{
// Branchless SSE max.
_mm_store_ss( &a, _mm_max_ss(_mm_set_ss(a),_mm_set_ss(b)) );
return a;
}
float clamp ( float val, float minval, float maxval )
{
// Branchless SSE clamp.
// return minss( maxss(val,minval), maxval );
_mm_store_ss( &val, _mm_min_ss( _mm_max_ss(_mm_set_ss(val),_mm_set_ss(minval)), _mm_set_ss(maxval) ) );
return val;
}
The double precision code is the same except with xxx_sd instead.
Edit: Initially I wrote the clamp function as commented. But looking at the assembler output I noticed that the VC++ compiler wasn't smart enough to cull the redundant move. One less instruction. :)
If your processor has a fast instruction for absolute value (as the x86 does), you can do a branchless min and max which will be faster than an if statement or ternary operation.
min(a,b) = (a + b - abs(a-b)) / 2
max(a,b) = (a + b + abs(a-b)) / 2
If one of the terms is zero (as is often the case when you're clamping) the code simplifies a bit further:
max(a,0) = (a + abs(a)) / 2
When you're combining both operations you can replace the two /2 into a single /4 or *0.25 to save a step.
The following code is over 3x faster than ternary on my Athlon II X2, when using the optimization for FMIN=0.
double clamp(double value)
{
double temp = value + FMAX - abs(value-FMAX);
#if FMIN == 0
return (temp + abs(temp)) * 0.25;
#else
return (temp + (2.0*FMIN) + abs(temp-(2.0*FMIN))) * 0.25;
#endif
}
Ternary operator is really the way to go, because most compilers are able to compile them into a native hardware operation that uses a conditional move instead of a branch (and thus avoids the mispredict penalty and pipeline bubbles and so on). Bit-manipulation is likely to cause a load-hit-store.
In particular, PPC and x86 with SSE2 have a hardware op that could be expressed as an intrinsic something like this:
double fsel( double a, double b, double c ) {
return a >= 0 ? b : c;
}
The advantage is that it does this inside the pipeline, without causing a branch. In fact, if your compiler uses the intrinsic, you can use it to implement your clamp directly:
inline double clamp ( double a, double min, double max )
{
a = fsel( a - min , a, min );
return fsel( a - max, max, a );
}
I strongly suggest you avoid bit-manipulation of doubles using integer operations. On most modern CPUs there is no direct means of moving data between double and int registers other than by taking a round trip to the dcache. This will cause a data hazard called a load-hit-store which basically empties out the CPU pipeline until the memory write has completed (usually around 40 cycles or so).
The exception to this is if the double values are already in memory and not in a register: in that case there is no danger of a load-hit-store. However your example indicates you've just calculated the double and returned it from a function which means it's likely to still be in XMM1.
For the 16.16 representation, the simple ternary is unlikely to be bettered speed-wise.
And for doubles, because you need it standard/portable C, bit-fiddling of any kind will end badly.
Even if a bit-fiddle was possible (which I doubt), you'd be relying on the binary representation of doubles. THIS (and their size) IS IMPLEMENTATION-DEPENDENT.
Possibly you could "guess" this using sizeof(double) and then comparing the layout of various double values against their common binary representations, but I think you're on a hiding to nothing.
The best rule is TELL THE COMPILER WHAT YOU WANT (ie ternary), and let it optimise for you.
EDIT: Humble pie time. I just tested quinmars idea (below), and it works - if you have IEEE-754 floats. This gave a speedup of about 20% on the code below. IObviously non-portable, but I think there may be a standardised way of asking your compiler if it uses IEEE754 float formats with a #IF...?
double FMIN = 3.13;
double FMAX = 300.44;
double FVAL[10] = {-100, 0.23, 1.24, 3.00, 3.5, 30.5, 50 ,100.22 ,200.22, 30000};
uint64 Lfmin = *(uint64 *)&FMIN;
uint64 Lfmax = *(uint64 *)&FMAX;
DWORD start = GetTickCount();
for (int j=0; j<10000000; ++j)
{
uint64 * pfvalue = (uint64 *)&FVAL[0];
for (int i=0; i<10; ++i)
*pfvalue++ = (*pfvalue < Lfmin) ? Lfmin : (*pfvalue > Lfmax) ? Lfmax : *pfvalue;
}
volatile DWORD hacktime = GetTickCount() - start;
for (int j=0; j<10000000; ++j)
{
double * pfvalue = &FVAL[0];
for (int i=0; i<10; ++i)
*pfvalue++ = (*pfvalue < FMIN) ? FMIN : (*pfvalue > FMAX) ? FMAX : *pfvalue;
}
volatile DWORD normaltime = GetTickCount() - (start + hacktime);
The bits of IEEE 754 floating point are ordered in a way that if you compare the bits interpreted as an integer you get the same results as if you would compare them as floats directly. So if you find or know a way to clamp integers you can use it for (IEEE 754) floats as well. Sorry, I don't know a faster way.
If you have the floats stored in an arrays you can consider to use some CPU extensions like SSE3, as rkj said. You can take a look at liboil it does all the dirty work for you. Keeps your program portable and uses faster cpu instructions if possible. (I'm not sure tho how OS/compiler-independent liboil is).
Rather than testing and branching, I normally use this format for clamping:
clampedA = fmin(fmax(a,MY_MIN),MY_MAX);
Although I have never done any performance analysis on the compiled code.
Realistically, no decent compiler will make a difference between an if() statement and a ?: expression. The code is simple enough that they'll be able to spot the possible paths. That said, your two examples are not identical. The equivalent code using ?: would be
a = (a > MAX) ? MAX : ((a < MIN) ? MIN : a);
as that avoid the A < MIN test when a > MAX. Now that could make a difference, as the compiler otherwise would have to spot the relation between the two tests.
If clamping is rare, you can test the need to clamp with a single test:
if (abs(a - (MAX+MIN)/2) > ((MAX-MIN)/2)) ...
E.g. with MIN=6 and MAX=10, this will first shift a down by 8, then check if it lies between -2 and +2. Whether this saves anything depends a lot on the relative cost of branching.
Here's a possibly faster implementation similar to #Roddy's answer:
typedef int64_t i_t;
typedef double f_t;
static inline
i_t i_tmin(i_t x, i_t y) {
return (y + ((x - y) & -(x < y))); // min(x, y)
}
static inline
i_t i_tmax(i_t x, i_t y) {
return (x - ((x - y) & -(x < y))); // max(x, y)
}
f_t clip_f_t(f_t f, f_t fmin, f_t fmax)
{
#ifndef TERNARY
assert(sizeof(i_t) == sizeof(f_t));
//assert(not (fmin < 0 and (f < 0 or is_negative_zero(f))));
//XXX assume IEEE-754 compliant system (lexicographically ordered floats)
//XXX break strict-aliasing rules
const i_t imin = *(i_t*)&fmin;
const i_t imax = *(i_t*)&fmax;
const i_t i = *(i_t*)&f;
const i_t iclipped = i_tmin(imax, i_tmax(i, imin));
#ifndef INT_TERNARY
return *(f_t *)&iclipped;
#else /* INT_TERNARY */
return i < imin ? fmin : (i > imax ? fmax : f);
#endif /* INT_TERNARY */
#else /* TERNARY */
return fmin > f ? fmin : (fmax < f ? fmax : f);
#endif /* TERNARY */
}
See Compute the minimum (min) or maximum (max) of two integers without branching and Comparing floating point numbers
The IEEE float and double formats were
designed so that the numbers are
“lexicographically ordered”, which –
in the words of IEEE architect William
Kahan means “if two floating-point
numbers in the same format are ordered
( say x < y ), then they are ordered
the same way when their bits are
reinterpreted as Sign-Magnitude
integers.”
A test program:
/** gcc -std=c99 -fno-strict-aliasing -O2 -lm -Wall *.c -o clip_double && clip_double */
#include <assert.h>
#include <iso646.h> // not, and
#include <math.h> // isnan()
#include <stdbool.h> // bool
#include <stdint.h> // int64_t
#include <stdio.h>
static
bool is_negative_zero(f_t x)
{
return x == 0 and 1/x < 0;
}
static inline
f_t range(f_t low, f_t f, f_t hi)
{
return fmax(low, fmin(f, hi));
}
static const f_t END = 0./0.;
#define TOSTR(f, fmin, fmax, ff) ((f) == (fmin) ? "min" : \
((f) == (fmax) ? "max" : \
(is_negative_zero(ff) ? "-0.": \
((f) == (ff) ? "f" : #f))))
static int test(f_t p[], f_t fmin, f_t fmax, f_t (*fun)(f_t, f_t, f_t))
{
assert(isnan(END));
int failed_count = 0;
for ( ; ; ++p) {
const f_t clipped = fun(*p, fmin, fmax), expected = range(fmin, *p, fmax);
if(clipped != expected and not (isnan(clipped) and isnan(expected))) {
failed_count++;
fprintf(stderr, "error: got: %s, expected: %s\t(min=%g, max=%g, f=%g)\n",
TOSTR(clipped, fmin, fmax, *p),
TOSTR(expected, fmin, fmax, *p), fmin, fmax, *p);
}
if (isnan(*p))
break;
}
return failed_count;
}
int main(void)
{
int failed_count = 0;
f_t arr[] = { -0., -1./0., 0., 1./0., 1., -1., 2,
2.1, -2.1, -0.1, END};
f_t minmax[][2] = { -1, 1, // min, max
0, 2, };
for (int i = 0; i < (sizeof(minmax) / sizeof(*minmax)); ++i)
failed_count += test(arr, minmax[i][0], minmax[i][1], clip_f_t);
return failed_count & 0xFF;
}
In console:
$ gcc -std=c99 -fno-strict-aliasing -O2 -lm *.c -o clip_double && ./clip_double
It prints:
error: got: min, expected: -0. (min=-1, max=1, f=0)
error: got: f, expected: min (min=-1, max=1, f=-1.#INF)
error: got: f, expected: min (min=-1, max=1, f=-2.1)
error: got: min, expected: f (min=-1, max=1, f=-0.1)
I tried the SSE approach to this myself, and the assembly output looked quite a bit cleaner, so I was encouraged at first, but after timing it thousands of times, it was actually quite a bit slower. It does indeed look like the VC++ compiler isn't smart enough to know what you're really intending, and it appears to move things back and forth between the XMM registers and memory when it shouldn't. That said, I don't know why the compiler isn't smart enough to use the SSE min/max instructions on the ternary operator when it seems to use SSE instructions for all floating point calculations anyway. On the other hand, if you're compiling for PowerPC, you can use the fsel intrinsic on the FP registers, and it's way faster.
As pointed out above, fmin/fmax functions work well (in gcc, with -ffast-math). Although gfortran has patterns to use IA instructions corresponding to max/min, g++ does not. In icc one must use instead std::min/max, because icc doesn't allow short-cutting the specification of how fmin/fmax work with non-finite operands.
My 2 cents in C++. Probably not any different than use ternary operators and hopefully no branching code is generated
template <typename T>
inline T clamp(T val, T lo, T hi) {
return std::max(lo, std::min(hi, val));
}
If I understand properly, you want to limit a value "a" to a range between MY_MIN and MY_MAX. The type of "a" is a double. You did not specify the type of MY_MIN or MY_MAX.
The simple expression:
clampedA = (a > MY_MAX)? MY_MAX : (a < MY_MIN)? MY_MIN : a;
should do the trick.
I think there may be a small optimization to be made if MY_MAX and MY_MIN happen to be integers:
int b = (int)a;
clampedA = (b > MY_MAX)? (double)MY_MAX : (b < MY_MIN)? (double)MY_MIN : a;
By changing to integer comparisons, it is possible you might get a slight speed advantage.
If you want to use fast absolute value instructions, check out this snipped of code I found in minicomputer, which clamps a float to the range [0,1]
clamped = 0.5*(fabs(x)-fabs(x-1.0f) + 1.0f);
(I simplified the code a bit). We can think about it as taking two values, one reflected to be >0
fabs(x)
and the other reflected about 1.0 to be <1.0
1.0-fabs(x-1.0)
And we take the average of them. If it is in range, then both values will be the same as x, so their average will again be x. If it is out of range, then one of the values will be x, and the other will be x flipped over the "boundary" point, so their average will be precisely the boundary point.

GCC access high/low machine words in double machine word types (including asm)

I use various double machine word types, like e.g. (u)int128_t on x86_64 and (u)int64_t on i386, ARM etc. in GCC.
I am looking for a correct/portable/clean way of accessing and manipulating the individual actual machine words (mostly in assembler). E.g. on 32bit machines I want to directly access the high/low 32bit part of an int64_t which gcc uses internally, without using stupid error-prone code like below. Similarly for the "native" 128bit types I want to access the 64b parts gcc is using (not for the below example as "add" is simple enough, but generally).
Consider the 32bit ASM path in the following code to add two int128_t together (which may be "native" to gcc, "native" to the machine or "half native" to the machine); it's horrendous and hard to maintain (and slower).
#define BITS 64
#if defined(USENATIVE)
// USE "NATIVE" 128bit GCC TYPE
typedef __int128_t int128_t;
typedef __uint128_t uint128_t;
typedef int128_t I128;
#define HIGH(x) x
#define HIGHVALUE(x) ((uint64_t)(x >> BITS))
#define LOW(x) x
#define LOWVALUE(x) (x & UMYINTMAX)
#else
typedef struct I128 {
int64_t high;
uint64_t low;
} I128;
#define HIGH(x) x.high
#define HIGHVALUE(x) x.high
#define LOW(x) x.low
#define LOWVALUE(x) x.low
#endif
#define HIGHHIGH(x) (HIGHVALUE(x) >> (BITS / 2))
#define HIGHLOW(x) (HIGHVALUE(x) & 0xFFFFFFFF)
#define LOWHIGH(x) (LOWVALUE(x) >> (BITS / 2))
#define LOWLOW(x) (LOWVALUE(x) & 0xFFFFFFFF)
inline I128 I128add(I128 a, const I128 b) {
#if defined(USENATIVE)
return a + b;
#elif defined(USEASM) && defined(X86_64)
__asm(
"ADD %[blo], %[alo]\n"
"ADC %[bhi], %[ahi]"
: [alo] "+g" (a.low), [ahi] "+g" (a.high)
: [blo] "g" (b.low), [bhi] "g" (b.high)
: "cc"
);
return a;
#elif defined(USEASM) && defined(X86_32)
// SLOWER DUE TO ALL THE CRAP
int32_t ahihi = HIGHHIGH(a), bhihi = HIGHHIGH(b);
uint32_t ahilo = HIGHLOW(a), bhilo = HIGHLOW(b);
uint32_t alohi = LOWHIGH(a), blohi = LOWHIGH(b);
uint32_t alolo = LOWLOW(a), blolo = LOWLOW(b);
__asm(
"ADD %[blolo], %[alolo]\n"
"ADC %[blohi], %[alohi]\n"
"ADC %[bhilo], %[ahilo]\n"
"ADC %[bhihi], %[ahihi]\n"
: [alolo] "+r" (alolo), [alohi] "+r" (alohi), [ahilo] "+r" (ahilo), [ahihi] "+r" (ahihi)
: [blolo] "g" (blolo), [blohi] "g" (blohi), [bhilo] "g" (bhilo), [bhihi] "g" (bhihi)
: "cc"
);
a.high = ((int64_t)ahihi << (BITS / 2)) + ahilo;
a.low = ((uint64_t)alohi << (BITS / 2)) + alolo;
return a;
#else
// this seems faster than adding to a directly
I128 r = {a.high + b.high, a.low + b.low};
// check for overflow of low 64 bits, add carry to high
// avoid conditionals
r.high += r.low < a.low || r.low < b.low;
return r;
#endif
}
Please note that I don't use C/ASM much, in fact this is my first attempt at inline ASM. Being used to Java/C#/JS/PHP etc. means that something very obvious to a routine C dev may not be apparent to me (besides the obvious insecure quirkiness in code style ;)). Also all this may be called something else entirely, because I had a very hard time finding anything online regarding the subject (non-native speaker as well).
Thanks a lot!
Edit 1
After much digging I have found the following theoretical solution, which works, but is unnecessary slow (slower than the much longer gcc output!) because it forces everything to memory and I am looking for a generic solution (reg/mem/possibly imm). I have also found that if you use an "r" constraint on e.g. 64bit int on 32bit machine, gcc will actually put both values in 2 registers (e.g. eax and ebx). The problem is not being able to reliably access the second part. I am sure there some hidden operator modifier that's just hard to find to tell gcc I want to access that second part.
uint32_t t1, t2;
__asm(
"MOV %[blo], %[t1]\n"
"MOV 4+%[blo], %[t2]\n"
"ADD %[t1], %[alo]\n"
"ADC %[t2], 4+%[alo]\n"
"MOV %[bhi], %[t1]\n"
"MOV 4+%[bhi], %[t2]\n"
"ADC %[t1], %[ahi]\n"
"ADC %[t2], 4+%[ahi]\n"
: [alo] "+o" (a.low), [ahi] "+o" (a.high), [t1] "=&r" (t1), [t2] "=&r" (t2)
: [blo] "o" (b.low), [bhi] "o" (b.high)
: "cc"
);
return a;
I have heard that this site isn't really for "code review", but since this is an interesting aspect to me, I think I can offer some advice:
In the 32-bit version, you could do the HIGHHIGH and co. with some clever overlaying of arrays of int/uint, instead of shifting and anding. Using a union is one way to do this, the other being pointer "magic". Since the assembler part of the code isn't particularly portable in the first place, using non-portable code in form of type punning casts or non-portable unions isn't really that big a deal.
Edit: relying on position within the words may also be fine. So for example, just passing in the address of the input and the output in registers, and then using (%0+4) and (%1+4) to do the remaining parts is definitely an option.
It will of course get more interesting if you have to do this for multiply and divide... I'm not sure I'd like to go there...

Unset the most significant bit in a word (int32) [C]

How can I unset the most significant setted bit of a word (e.g. 0x00556844 -> 0x00156844)? There is a __builtin_clz in gcc, but it just counts the zeroes, which is unneeded to me. Also, how should I replace __builtin_clz for msvc or intel c compiler?
Current my code is
int msb = 1<< ((sizeof(int)*8)-__builtin_clz(input)-1);
int result = input & ~msb;
UPDATE: Ok, if you says that this code is rather fast, I'll ask you, how should I add a portability to this code? This version is for GCC, but MSVC & ICC?
Just round down to the nearest power of 2 and then XOR that with the original value, e.g. using flp2() from Hacker's Delight:
uint32_t flp2(uint32_t x) // round x down to nearest power of 2
{
x = x | (x >> 1);
x = x | (x >> 2);
x = x | (x >> 4);
x = x | (x >> 8);
x = x | (x >>16);
return x - (x >> 1);
}
uint32_t clr_msb(uint32_t x) // clear most significant set bit in x
{
msb = flp2(x); // get MS set bit in x
return x ^ msb; // XOR MS set bit to clear it
}
If you are truly concerned with performance, the best way to clear the msb has recently changed for x86 with the addition of BMI instructions.
In x86 assembly:
clear_msb:
bsrq %rdi, %rax
bzhiq %rax, %rdi, %rax
retq
Now to rewrite in C and let the compiler emit these instructions while gracefully degrading for non-x86 architectures or older x86 processors that don't support BMI instructions.
Compared to the assembly code, the C version is really ugly and verbose. But at least it meets the objective of portability. And if you have the necessary hardware and compiler directives (-mbmi, -mbmi2) to match, you're back to the beautiful assembly code after compilation.
As written, bsr() relies on a GCC/Clang builtin. If targeting other compilers you can replace with equivalent portable C code and/or different compiler-specific builtins.
#include <inttypes.h>
#include <stdio.h>
uint64_t bsr(const uint64_t n)
{
return 63 - (uint64_t)__builtin_clzll(n);
}
uint64_t bzhi(const uint64_t n,
const uint64_t index)
{
const uint64_t leading = (uint64_t)1 << index;
const uint64_t keep_bits = leading - 1;
return n & keep_bits;
}
uint64_t clear_msb(const uint64_t n)
{
return bzhi(n, bsr(n));
}
int main(void)
{
uint64_t i;
for (i = 0; i < (uint64_t)1 << 16; ++i) {
printf("%" PRIu64 "\n", clear_msb(i));
}
return 0;
}
Both assembly and C versions lend themselves naturally to being replaced with 32-bit instructions, as the original question was posed.
You can do
unsigned resetLeadingBit(uint32_t x) {
return x & ~(0x80000000U >> __builtin_clz(x))
}
For MSVC there is _BitScanReverse, which is 31-__builtin_clz().
Actually its the other way around, BSR is the natural x86 instruction, and the gcc intrinsic is implemented as 31-BSR.

SIMD code for exponentiation

I am using SIMD to compute fast exponentiation result. I compare the timing with non-simd code. The exponentiation is implemented using square and multiply algorithm.
Ordinary(non-simd) version of code:
b = 1;
for (i=WPE-1; i>=0; --i){
ew = e[i];
for(j=0; j<BPW; ++j){
b = (b * b) % p;
if (ew & 0x80000000U) b = (b * a) % p;
ew <<= 1;
}
}
SIMD version:
B.data[0] = B.data[1] = B.data[2] = B.data[3] = 1U;
P.data[0] = P.data[1] = P.data[2] = P.data[3] = p;
for (i=WPE-1; i>=0; --i) {
EW.data[0] = e1[i]; EW.data[1] = e2[i]; EW.data[2] = e3[i]; EW.data[3] = e4[i];
for (j=0; j<BPW;++j){
B.v *= B.v; B.v -= (B.v / P.v) * P.v;
EWV.v = _mm_srli_epi32(EW.v,31);
M.data[0] = (EWV.data[0]) ? a1 : 1U;
M.data[1] = (EWV.data[1]) ? a2 : 1U;
M.data[2] = (EWV.data[2]) ? a3 : 1U;
M.data[3] = (EWV.data[3]) ? a4 : 1U;
B.v *= M.v; B.v -= (B.v / P.v) * P.v;
EW.v = _mm_slli_epi32(EW.v,1);
}
}
The issue is though it is computing correctly, simd version is taking more time than non-simd version.
Please help me debug the reasons. Any suggestions on SIMD coding is also welcome.
Thanks & regards,
Anup.
All functions in the for loops should be SIMD functions, not only two. Time taking to set the arguments for your 2 functions is less optimal then your original example (which is most likely optimized by the compiler)
A SIMD loop for 32 bit int data typically looks something like this:
for (i = 0; i < N; i += 4)
{
// load input vector(s) with data at array index i..i+3
__m128 va = _mm_load_si128(&A[i]);
__m128 vb = _mm_load_si128(&B[i]);
// process vectors using SIMD instructions (i.e. no scalar code)
__m128 vc = _mm_add_epi32(va, vb);
// store result vector(s) at array index i..i+3
_mm_store_si128(&C[i], vc);
}
If you find that you need to move between scalar code and SIMD code within the loop then you probably won't gain anything from SIMD optimisation.
Much of the skill in SIMD programming comes from finding ways to make your algorithm work with the limited number of supported instructions and data types that a given SIMD architecture provides. You will often need to exploit a priori knowledge of your data set to get the best possible performance, e.g. if you know for certain that your 32 bit integer values actually have a range that fits within 16 bits then that would make the multiplication part of your algorithm a lot easier to implement.

Computing high 64 bits of a 64x64 int product in C

I would like my C function to efficiently compute the high 64 bits of the product of two 64 bit signed ints. I know how to do this in x86-64 assembly, with imulq and pulling the result out of %rdx. But I'm at a loss for how to write this in C at all, let alone coax the compiler to do it efficiently.
Does anyone have any suggestions for writing this in C? This is performance sensitive, so "manual methods" (like Russian Peasant, or bignum libraries) are out.
This dorky inline assembly function I wrote works and is roughly the codegen I'm after:
static long mull_hi(long inp1, long inp2) {
long output = -1;
__asm__("movq %[inp1], %%rax;"
"imulq %[inp2];"
"movq %%rdx, %[output];"
: [output] "=r" (output)
: [inp1] "r" (inp1), [inp2] "r" (inp2)
:"%rax", "%rdx");
return output;
}
If you're using a relatively recent GCC on x86_64:
int64_t mulHi(int64_t x, int64_t y) {
return (int64_t)((__int128_t)x*y >> 64);
}
At -O1 and higher, this compiles to what you want:
_mulHi:
0000000000000000 movq %rsi,%rax
0000000000000003 imulq %rdi
0000000000000006 movq %rdx,%rax
0000000000000009 ret
I believe that clang and VC++ also have support for the __int128_t type, so this should also work on those platforms, with the usual caveats about trying it yourself.
The general answer is that x * y can be broken down into (a + b) * (c + d), where a and c are the high order parts.
First, expand to ac + ad + bc + bd
Now, you multiply the terms as 32 bit numbers stored as long long (or better yet, uint64_t), and you just remember that when you multiplied a higher order number, you need to scale by 32 bits. Then you do the adds, remembering to detect carry. Keep track of the sign. Naturally, you need to do the adds in pieces.
For code implementing the above, see my other answer.
With regard to your assembly solution, don't hard-code the mov instructions! Let the compiler do it for you. Here's a modified version of your code:
static long mull_hi(long inp1, long inp2) {
long output;
__asm__("imulq %2"
: "=d" (output)
: "a" (inp1), "r" (inp2));
return output;
}
Helpful reference: Machine Constraints
Since you did a pretty good job solving your own problem with the machine code, I figured you deserved some help with the portable version. I would leave an ifdef in where you do just use the assembly if in gnu on x86.
Anyway, here is an implementation based on my general answer. I'm pretty sure this is correct, but no guarantees, I just banged this out last night. You probably should get rid of the statics positive_result[] and result_negative - those are just artefacts of my unit test.
#include <stdlib.h>
#include <stdio.h>
// stdarg.h doesn't help much here because we need to call llabs()
typedef unsigned long long uint64_t;
typedef signed long long int64_t;
#define B32 0xffffffffUL
static uint64_t positive_result[2]; // used for testing
static int result_negative; // used for testing
static void mixed(uint64_t *result, uint64_t innerTerm)
{
// the high part of innerTerm is actually the easy part
result[1] += innerTerm >> 32;
// the low order a*d might carry out of the low order result
uint64_t was = result[0];
result[0] += (innerTerm & B32) << 32;
if (result[0] < was) // carry!
++result[1];
}
static uint64_t negate(uint64_t *result)
{
uint64_t t = result[0] = ~result[0];
result[1] = ~result[1];
if (++result[0] < t)
++result[1];
return result[1];
}
uint64_t higherMul(int64_t sx, int64_t sy)
{
uint64_t x, y, result[2] = { 0 }, a, b, c, d;
x = (uint64_t)llabs(sx);
y = (uint64_t)llabs(sy);
a = x >> 32;
b = x & B32;
c = y >> 32;
d = y & B32;
// the highest and lowest order terms are easy
result[1] = a * c;
result[0] = b * d;
// now have the mixed terms ad + bc to worry about
mixed(result, a * d);
mixed(result, b * c);
// now deal with the sign
positive_result[0] = result[0];
positive_result[1] = result[1];
result_negative = sx < 0 ^ sy < 0;
return result_negative ? negate(result) : result[1];
}
Wait, you have a perfectly good, optimized assembly solution already
working for this, and you want to back it out and try to write it in
an environment that doesn't support 128 bit math? I'm not following.
As you're obviously aware, this operation is a single instruction on
x86-64. Obviously nothing you do is going to make it work any better.
If you really want portable C, you'll need to do something like
DigitalRoss's code above and hope that your optimizer figures out what
you're doing.
If you need architecture portability but are willing to limit yourself
to gcc platforms, there are __int128_t (and __uint128_t) types in the
compiler intrinsics that will do what you want.

Resources