modulo operation vectorization

modulo operation vectorization - c

there is a cycle:
long a* = new long[32];
long b* = new long[32];
double c* = new double[32];
double d = 3.14159268;
//set a, b and c arrays
//.....
for(int i = 0; i < 32; i ++){
d+= (a[i] % b[i])/c[i];
}
how i can implement this cycle using Intel C++ vectoriation capabilities (e.g. #pragma simd or sse- instructions)?
If i write:
#pragma simd reduction(+:c)
for(int i = 0; i < 32; i ++){
d+= (a[i] % b[i])/c[i];
}
then speed does not increase :(

The Intel 64 and IA-32 architectures do not have a vectorized integer divide or remainder/modulo instruction, so there is no way to vectorize general remainder operations in hardware while using integer arithmetic.
There are some floating-point vector divide instructions. The double-precision divide (DIVPD) is not truly vectorized in processors I checked; it takes twice as long as a single-precision divide, so the hardware implements it by using one divider serially (and not even pipelined to any significant degree).
If single-precision suffices, you might be able to get some boost from using the single-precision vector divide (DIVPS), but you would have to deal with floating-point rounding and take care to ensure you got the desired result. Using the approximate-reciprocal instruction (RCPPS) with the Newton-Raphson might be faster than using DIVPS but will require even more care in the design.

Related

C - portable fast dot product between a float matrix and a sparse boolean value matrix

I am working on a spiking neural network project in C where spikes are boolean values. Right now I have built a custom bit matrix type to represent the spike matrixes.
I frequently need the dot product of the bit matrix and a matrix of single precision floats of the same size, so I was wondering how I should speed things up?
I also need to do pointwise multiplication of the float matrix and the bit matrix later.
My plan right now was just to loop through with and if statement and bitshift. I want to speed this up.
float current = 0;
for (int i = 0; i < n_elem; i++, bit_vec >>= 1) {
if (bit_vec & 1)
current += weights[i];
}
I don't necessarily need to use a bit vector, it could be represented in other ways too. I have seen other answers here, but they are hardware specific and I am looking for something that can be portable.
I am not using any BLAS functions either, mostly because I am never operating on two floats. Should I be?
Thanks.

The bit_vec >>= 1 and current += weights[i] instruction cause a loop carried dependency that will certainly prevent the compiler to generate a fast implementation (and also prevent the processor to execute it efficiently).
You can solve this by unrolling the loop. Additionally, most mainstream compilers are not smart enough so to optimize out the condition en use a blend instruction available on most architecture. Conditional branches are slow, especially when they cannot be easily predicted (which is certainly you case). You can use a multiplication so to help the compiler generating better instructions. Here is the result:
const unsigned int blockSize = 4;
float current[blockSize] = {0.f};
int i;
for (i = 0; i < n_elem-blockSize+1; i+=blockSize, bit_vec >>= blockSize)
for(int j = 0 ; j < blockSize ; ++j)
current[j] += weights[i] * (bit_vec >> j);
for (; i < n_elem; ++i, bit_vec >>= 1)
if (bit_vec & 1)
current[0] += weights[i];
float sum = 0.f;
for(int j = 0 ; j < blockSize ; ++j)
sum += current[j];
This code should be faster assuming n_elem is relatively big. It should still be far from being efficient since compilers like GCC and Clang fail to auto-vectorize it. This is sad since it would be several time faster with SIMD instructions (like SSE, AVX, Neon, etc.). That being said, this is exactly why people use non-portable code: to manually use efficient instruction since compiler often fail to do that in non-trivial cases.

Why is my vector multiplication routine in C so slow? [duplicate]

This question already has answers here:
How can i optimize my AVX implementation of dot product?
(1 answer)
AVX2: Computing dot product of 512 float arrays
(1 answer)
Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)
(1 answer)
Dot Product of Vectors with SIMD
(2 answers)
Closed 1 year ago.
I am trying to find the most efficient way to multiply two 2dim-array (Single Precision) in C and started with the naive idea to implement it by following the arithmetic rules:
for (i = 0; i < n; i++) {
sum += a[i] * b[i]; }
It worked, but was probably not the fastest routine on earth. Switching to pointer arithmetics and doing some loop unrolling the speed improved. However, when applying SIMD the speed dropped again.
To be more precise: Compiled on Intel oneAPI with -O3 on a Intel Core i5-4690, 3.5 GHz I see the following results:
Naive implementation: Approx. 800 MFlop/s
Using Pointer - Loop unrolling: Up to 5 GFlop/s
Applying SIMD: 3,5 - 5 GFlop/s
The speed of course varied with the size of the vectors and between the different test runs, therefore the figures above are more of indicative nature, but still raise the question why the SIMD-routine does not give a significant push:
float hsum_float_avx(float *pt_a, float *pt_b) {
__m256 AVX2_vect1, AVX2_vect2, res_mult, hsum;
float sumAVX;
// load unaligned memory into two vectors
AVX2_vect1 = _mm256_loadu_ps(pt_a);
AVX2_vect2 = _mm256_loadu_ps(pt_b);
// multiply the two vectors
res_mult = _mm256_mul_ps(AVX2_vect1, AVX2_vect2);
// calculate horizontal sum of resulting vector
hsum = _mm256_hadd_ps(res_mult, res_mult);
hsum = _mm256_add_ps(hsum, _mm256_permute2f128_ps(hsum, hsum, 0x1));
// store result
_mm_store_ss(&sumAVX, _mm_hadd_ps(_mm256_castps256_ps128(hsum), _mm256_castps256_ps128(hsum)));
return sumAVX; }
There must be something wrong, but I cannot find it - therefore any hint would be highly appreciated.

If your compiler supports OpenMP 4.0 or later, I'd use that to request that the compiler vectorize the original loop (Which it might already be doing so if using a high enough optimization level; but OpenMP lets you give hints about things like alignment etc. to improve the results). That has the advantage over AVX intrinsics that it'll work on other architectures like ARM, or with other x86 SIMD instruction sets (Assuming you tell the compiler to target them) with just a simple recompilation instead of having to rewrite your code:
float sum = 0.0f;
#pragma omp simd reduction(+:sum)
for (i = 0; i < n; i++) {
sum += a[i] * b[i];
}

Is fmod faster than % for integer modulus calculation

Just found the following line in some old src code:
int e = (int)fmod(matrix[i], n);
where matrix is an array of int, and n is a size_t
I'm wondering why the use of fmod rather than % where we have integer arguments, i.e. why not:
int e = (matrix[i]) % n;
Could there possibly be a performance reason for choosing fmod over % or is it just a strange bit of code?

Could there possibly be a performance reason for choosing fmod over %
or is it just a strange bit of code?
The fmod might be a bit faster on architectures with high-latency IDIV instruction, that takes (say) ~50 cycles or more, so fmod's function call and int <---> doubleconversions cost can be amortized.
According to Agner's Fog instruction tables, IDIV on AMD K10 architecture takes 24-55 cycles. Comparing with modern Intel Haswell, its latency range is listed as 22-29 cycles, however if there are no dependency chains, the reciprocal throughput is much better on Intel, 8-11 clock cycles.

fmod might be a tiny bit faster than the integer division on selected architectures.
Note however that if n has a known non zero value at compile time, matrix[i] % n would be compiled as a multiplication with a small adjustment, which should be much faster than both the integer modulus and the floating point modulus.
Another interesting difference is the behavior on n == 0 and INT_MIN % -1. The integer modulus operation invokes undefined behavior on overflow which results in abnormal program termination on many current architectures. Conversely, the floating point modulus does not have these corner cases, the result is +Infinity, -Infinity, Nan depending on the value of matrix[i] and -INT_MIN, all exceeding the range of int and the conversion back to int is implementation defined, but does not usually cause abnormal program termination. This might be the reason for the original programmer to have chosen this surprising solution.

Experimentally (and quite counter-intuitively), fmod is faster than % - at least on AMD Phenom(tm) II X4 955 with 6400 bogomips. Here are two programs that use either of the techniques, both compiled with the same compiler (GCC) and the same options (cc -O3 foo.c -lm), and ran on the same hardware:
#include <math.h>
#include <stdio.h>
int main()
{
int volatile a=10,b=12;
int i, sum = 0;
for (i = 0; i < 1000000000; i++)
sum += a % b;
printf("%d\n", sum);
return 0;
}
Running time: 9.07 sec.
#include <math.h>
#include <stdio.h>
int main()
{
int volatile a=10,b=12;
int i, sum = 0;
for (i = 0; i < 1000000000; i++)
sum += (int)fmod(a, b);
printf("%d\n", sum);
return 0;
}
Running time: 8.04 sec.

Fast float to int conversion (truncate)

I'm looking for a way to truncate a float into an int in a fast and portable (IEEE 754) way. The reason is because in this function 50% of the time is spent in the cast:
float fm_sinf(float x) {
const float a = 0.00735246819687011731341356165096815f;
const float b = -0.16528911397014738207016302002888890f;
const float c = 0.99969198629596757779830113868360584f;
float r, x2;
int k;
/* bring x in range */
k = (int) (F_1_PI * x + copysignf(0.5f, x)); /* <-- 50% of time is spent in cast */
x -= k * F_PI;
/* if x is in an odd pi count we must flip */
r = 1 - 2 * (k & 1); /* trick for r = (k % 2) == 0 ? 1 : -1; */
x2 = x * x;
return r * x*(c + x2*(b + a*x2));
}

The slowness of float->int casts mainly occurs when using x87 FPU instructions on x86. To do the truncation, the rounding mode in the FPU control word needs to be changed to round-to-zero and back, which tends to be very slow.
When using SSE instead of x87 instructions, a truncation is available without control word changes. You can do this using compiler options (like -mfpmath=sse -msse -msse2 in GCC) or by compiling the code as 64-bit.
The SSE3 instruction set has the FISTTP instruction to convert to integer with truncation without changing the control word. A compiler may generate this instruction if instructed to assume SSE3.
Alternatively, the C99 lrint() function will convert to integer with the current rounding mode (round-to-nearest unless you changed it). You can use this if you remove the copysignf term. Unfortunately, this function is still not ubiquitous after more than ten years.

I found a fast truncate method by Sree Kotay which provides exactly the optimization that I needed.

to be portable you would have to add some directives and learn a couple assembler languages but you could theoretically could use some inline assembly to move portions of the floating point register into eax/rax ebx/rbx and convert what you would need by hand, floating point specification though is a pain in the butt, but I am pretty certain that if you do it with assembly you will be way faster, as your needs are very specific and the system method is probably more generic and less efficient for your purpose

You could skip the conversion to int altogether by using frexpf to get the mantissa and exponent, and inspect the raw mantissa (use a union) at the appropriate bit position (calculated using the exponent) to determine (the quadrant dependent) r.

Why is floor() so slow?

I wrote some code recently (ISO/ANSI C), and was surprised at the poor performance it achieved. Long story short, it turned out that the culprit was the floor() function. Not only it was slow, but it did not vectorize (with Intel compiler, aka ICL).
Here are some benchmarks for performing floor for all cells in a 2D matrix:
VC: 0.10
ICL: 0.20
Compare that to a simple cast:
VC: 0.04
ICL: 0.04
How can floor() be that much slower than a simple cast?! It does essentially the same thing (apart for negative numbers).
2nd question: Does someone know of a super-fast floor() implementation?
PS: Here is the loop that I was benchmarking:
void Floor(float *matA, int *intA, const int height, const int width, const int width_aligned)
{
float *rowA=NULL;
int *intRowA=NULL;
int row, col;
for(row=0 ; row<height ; ++row){
rowA = matA + row*width_aligned;
intRowA = intA + row*width_aligned;
#pragma ivdep
for(col=0 ; col<width; ++col){
/*intRowA[col] = floor(rowA[col]);*/
intRowA[col] = (int)(rowA[col]);
}
}
}

A couple of things make floor slower than a cast and prevent vectorization.
The most important one:
floor can modify the global state. If you pass a value that is too huge to be represented as an integer in float format, the errno variable gets set to EDOM. Special handling for NaNs is done as well. All this behavior is for applications that want to detect the overflow case and handle the situation somehow (don't ask me how).
Detecting these problematic conditions is not simple and makes up more than 90% of the execution time of floor. The actual rounding is cheap and could be inlined/vectorized. Also It's a lot of code, so inlining the whole floor-function would make your program run slower.
Some compilers have special compiler flags that allow the compiler to optimize away some of the rarely used c-standard rules. For example GCC can be told that you're not interested in errno at all. To do so pass -fno-math-errno or -ffast-math. ICC and VC may have similar compiler flags.
Btw - You can roll your own floor-function using simple casts. You just have to handle the negative and positive cases differently. That may be a lot faster if you don't need the special handling of overflows and NaNs.

If you are going to convert the result of the floor() operation to an int, and if you aren't worried about overflow, then the following code is much faster than (int)floor(x):
inline int int_floor(double x)
{
int i = (int)x; /* truncate */
return i - ( i > x ); /* convert trunc to floor */
}

Branch-less Floor and Ceiling (better utilize the pipiline) no error check
int f(double x)
{
return (int) x - (x < (int) x); // as dgobbi above, needs less than for floor
}
int c(double x)
{
return (int) x + (x > (int) x);
}
or using floor
int c(double x)
{
return -(f(-x));
}

The actual fastest implementation for a large array on modern x86 CPUs would be
change the MXCSR FP rounding mode to round towards -Infinity (aka floor). In C, this should be possible with fenv stuff, or _mm_getcsr / _mm_setcsr.
loop over the array doing _mm_cvtps_epi32 on SIMD vectors, converting 4 floats to 32-bit integer using the current rounding mode. (And storing the result vectors to the destination.)
cvtps2dq xmm0, [rdi] is a single micro-fused uop on any Intel or AMD CPU since K10 or Core 2. (https://agner.org/optimize/) Same for the 256-bit AVX version, with YMM vectors.
restore the current rounding mode to the normal IEEE default mode, using the original value of the MXCSR. (round-to-nearest, with even as a tiebreak)
This allows loading + converting + storing 1 SIMD vector of results per clock cycle, just as fast as with truncation. (SSE2 has a special FP->int conversion instruction for truncation, exactly because it's very commonly needed by C compilers. In the bad old days with x87, even (int)x required changing the x87 rounding mode to truncation and then back. cvttps2dq for packed float->int with truncation (note the extra t in the mnemonic). Or for scalar, going from XMM to integer registers, cvttss2si or cvttsd2si for scalar double to scalar integer.
With some loop unrolling and/or good optimization, this should be possible without bottlenecking on the front-end, just 1-per-clock store throughput assuming no cache-miss bottlenecks. (And on Intel before Skylake, also bottlenecked on 1-per-clock packed-conversion throughput.) i.e. 16, 32, or 64 bytes per cycle, using SSE2, AVX, or AVX512.
Without changing the current rounding mode, you need SSE4.1 roundps to round a float to the nearest integer float using your choice of rounding modes. Or you could use one of the tricks shows in other answers that work for floats with small enough magnitude to fit in a signed 32-bit integer, since that's your ultimate destination format anyway.)
(With the right compiler options, like -fno-math-errno, and the right -march or -msse4 options, compilers can inline floor using roundps, or the scalar and/or double-precision equivalent, e.g. roundsd xmm1, xmm0, 1, but this costs 2 uops and has 1 per 2 clock throughput on Haswell for scalar or vectors. Actually, gcc8.2 will inline roundsd for floor even without any fast-math options, as you can see on the Godbolt compiler explorer. But that's with -march=haswell. It's unfortunately not baseline for x86-64, so you need to enable it if your machine supports it.)

Yes, floor() is extremely slow on all platforms since it has to implement a lot of behaviour from the IEEE fp spec. You can't really use it in inner loops.
I sometimes use a macro to approximate floor():
#define PSEUDO_FLOOR( V ) ((V) >= 0 ? (int)(V) : (int)((V) - 1))
It does not behave exactly as floor(): for example, floor(-1) == -1 but PSEUDO_FLOOR(-1) == -2, but it's close enough for most uses.

An actually branchless version that requires a single conversion between floating point and integer domains would shift the value x to all positive or all negative range, then cast/truncate and shift it back.
long fast_floor(double x)
{
const unsigned long offset = ~(ULONG_MAX >> 1);
return (long)((unsigned long)(x + offset) - offset);
}
long fast_ceil(double x) {
const unsigned long offset = ~(ULONG_MAX >> 1);
return (long)((unsigned long)(x - offset) + offset );
}
As pointed in the comments, this implementation relies on the temporary value x +- offset not overflowing.
On 64-bit platforms, the original code using int64_t intermediate value will result in three instruction kernel, the same available for int32_t reduced range floor/ceil, where |x| < 0x40000000 --
inline int floor_x64(double x) {
return (int)((int64_t)(x + 0x80000000UL) - 0x80000000LL);
}
inline int floor_x86_reduced_range(double x) {
return (int)(x + 0x40000000) - 0x40000000;
}

They do not do the same thing. floor() is a function. Therefore, using it incurs a function call, allocating a stack frame, copying of parameters and retrieving the result.
Casting is not a function call, so it uses faster mechanisms (I believe that it may use registers to process the values).
Probably floor() is already optimized.
Can you squeeze more performance out of your algorithm? Maybe switching rows and columns may help? Can you cache common values? Are all your compiler's optimizations on? Can you switch an operating system? a compiler?
Jon Bentley's Programming Pearls has a great review of possible optimizations.

Fast double round
double round(double x)
{
return double((x>=0.5)?(int(x)+1):int(x));
}
Terminal log
test custom_1 8.3837
test native_1 18.4989
test custom_2 8.36333
test native_2 18.5001
test custom_3 8.37316
test native_3 18.5012
Test
void test(char* name, double (*f)(double))
{
int it = std::numeric_limits<int>::max();
clock_t begin = clock();
for(int i=0; i<it; i++)
{
f(double(i)/1000.0);
}
clock_t end = clock();
cout << "test " << name << " " << double(end - begin) / CLOCKS_PER_SEC << endl;
}
int main(int argc, char **argv)
{
test("custom_1",round);
test("native_1",std::round);
test("custom_2",round);
test("native_2",std::round);
test("custom_3",round);
test("native_3",std::round);
return 0;
}
Result
Type casting and using your brain is ~3 times faster than using native functions.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight