Enabling HVX SIMD in Hexagon DSP by using instruction intrinsics

Enabling HVX SIMD in Hexagon DSP by using instruction intrinsics - c

I was using Hexagon-SDK 3.0 to compile my sample application for HVX DSP architecture. There are many tools related to Hexagon-LLVM available to use located folder at:
~/Qualcomm/HEXAGON_Tools/7.2.12/Tools/bin
I wrote a small example to calculate the product of two arrays to makes sure I can utilize the HVX hardware acceleration. However, when I generate my assembly, either with -S , or, with -S -emit-llvm I don't find any definition of HVX instructions such as vmem, vX, etc. My C application is executing on hexagon-sim for now till I manage to find a way to run in on the board as well.
As far as I understood, I need to define my HVX part of the code in C Intrinsics, but was not able to adapt the existing examples to match my own needs. It would be great if somebody could demonstrate how this process can be done. Also in the Hexagon V62 Programmer's Reference Manual many of the intrinsic instructions are not defined.
Here is my small app in pure C:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#if defined(__hexagon__)
#include "hexagon_standalone.h"
#include "subsys.h"
#endif
#include "io.h"
#include "hvx.cfg.h"
#define KERNEL_SIZE 9
#define Q 8
#define PRECISION (1<<Q)
double vectors_dot_prod2(const double *x, const double *y, int n)
{
double res = 0.0;
int i = 0;
for (; i <= n-4; i+=4)
{
res += (x[i] * y[i] +
x[i+1] * y[i+1] +
x[i+2] * y[i+2] +
x[i+3] * y[i+3]);
}
for (; i < n; i++)
{
res += x[i] * y[i];
}
return res;
}
int main (int argc, char* argv[])
{
int n;
long long start_time, total_cycles;
/* -----------------------------------------------------*/
/* Allocate memory for input/output */
/* -----------------------------------------------------*/
//double *res = memalign(VLEN, 4 *sizeof(double));
const double *x = memalign(VLEN, n *sizeof(double));
const double *y = memalign(VLEN, n *sizeof(double));
if ( *x == NULL || *y == NULL ){
printf("Error: Could not allocate Memory for image\n");
return 1;
}
#if defined(__hexagon__)
subsys_enable();
SIM_ACQUIRE_HVX;
#if LOG2VLEN == 7
SIM_SET_HVX_DOUBLE_MODE;
#endif
#endif
/* -----------------------------------------------------*/
/* Call fuction */
/* -----------------------------------------------------*/
RESET_PMU();
start_time = READ_PCYCLES();
vectors_dot_prod2(x,y,n);
total_cycles = READ_PCYCLES() - start_time;
DUMP_PMU();
printf("Array product of x[i] * y[i] = %f\n",vectors_dot_prod2(x,y,4));
#if defined(__hexagon__)
printf("AppReported (HVX%db-mode): Array product of x[i] * y[i] =%f\n", VLEN, vectors_dot_prod2(x,y,4));
#endif
return 0;
}
I compile it using hexagon-clang:
hexagon-clang -v -O2 -mv60 -mhvx-double -DLOG2VLEN=7 -I../../common/include -I../include -DQDSP6SS_PUB_BASE=0xFE200000 -o arrayProd.o -c arrayProd.c
Then link it with subsys.o (is found in DSK and already compiled) and -lhexagon to generate my executable:
hexagon-clang -O2 -mv60 -o arrayProd.exe arrayProd.o subsys.o -lhexagon
Finally, run it using the sim:
hexagon-sim -mv60 arrayProd.exe

A bit late, but might still be useful.
Hexagon Vector eXtensions are not emitted automatically and current instruction set (as of 8.0 SDK) only supports integer manipulation, so compiler will not emit anything for the C code containing "double" type (it is similar to SSE programming, you have to manually pack xmm registers and use SSE intrinsics to do what you need).
You need to define what your application really requires.
E.g., if you are writing something 3D-related and really need to calculate double (or float) dot products, you might convert yout floats to 16.16 fixed point and then use instructions (i.e., C intrinsics) like
Q6_Vw_vmpyio_VwVh and Q6_Vw_vmpye_VwVuh to emulate fixed-point multiplication.
To "enable" HVX you should use HVX-related types defined in
#include <hexagon_types.h>
#include <hexagon_protos.h>
The instructions like 'vmem' and 'vmemu' are emitted automatically for statements like
// I assume 64-byte mode, no `-mhvx-double`. For 128-byte mode use 32 int array
int values[16] = { 1, 2, 3, ..... };
/* The following line compiles to
{
r4 = __address_of_values
v1 = vmem(r4 + #0)
}
You can get the exact code by using '-S' switch, as you already do
*/
HVX_Vector v = *(HVX_Vector*)values;
Your (fixed-point) version of dot_product may read out 16 integers at a time, multiply all 16 integers in a couple of instructions (see HVX62 programming manual, there is a tip to implement 32-bit integer multiplication from 16-bit one),
then shuffle/deal/ror data around and sum up rearranged vectors to get dot product (this way you may calculate 4 dot products almost at once and if you preload 4 HVX registers - that is 16 4D vectors - you may calculate 16 dot products in parallel).
If what you are doing is really just byte/int image processing, you might use specific 16-bit and 8-bit hardware dot products in Hexagon instruction set, instead of emulating doubles and floats.

Related

Why does my running clock measuring function give me 0 clocks?

I'm doing some exercise projects in a C book, and I was asked to write a program that uses clock function in C library to measure how long it takes qsort function to sort an array that's reversed from a sorted state. So I wrote below:
/*
* Write a program that uses the clock function to measure how long it takes qsort to sort
* an array of 1000 integers that are originally in reverse order. Run the program for arrays
* of 10000 and 100000 integers as well.
*/
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define SIZE 10000
int compf(const void *, const void *);
int main(void)
{
int arr[SIZE];
clock_t start_clock, end_clock;
for (int i = 0; i < SIZE; ++i) {
arr[i] = SIZE - i;
}
start_clock = clock();
qsort(arr, SIZE, sizeof(arr[0]), compf);
end_clock = clock();
printf("start_clock: %ld\nend_clock: %ld\n", start_clock, end_clock);
printf("Measured seconds used: %g\n", (end_clock - start_clock) / (double)CLOCKS_PER_SEC);
return EXIT_SUCCESS;
}
int compf(const void *p, const void *q)
{
return *(int *)p - *(int *)q;
}
But running the program gives me the results below:
start_clock: 0
end_clock: 0
Measured clock used: 0
How can it be my system used 0 clock to sort an array? What am I doing wrong?
I'm using GCC included in mingw-w64 which is x86_64-8.1.0-release-win32-seh-rt_v6-rev0.
Also I'm compiling with arguments -g -Wall -Wextra -pedantic -std=c99 -D__USE_MINGW_ANSI_STDIO given to gcc.exe.

3 possible answers to your issue:
what is clock_t? Is it just a normal data type like int? Or is it some sort of struct? Make sure you are using it correctly for its data type
What is this running on? If your clock isn't already running you need to start it on, for instance, most microcontrollers. If you try pulling from it without starting, it will just be 0 at all times since the clock is not moving
Is your code fast enough that it's not registering? Is it actually taking 0 seconds (rounded down) to run? 1 full second is a very long time in the coding world, you can run millions of lines of code in less than a second. Make sure your timing process can handle small numbers (ie. you can register 1 micro-second of timing), or your code is running slow enough to register on your clock speed

Why is iterating through an array backwards faster than forward in C

I'm studying for an exam and am trying to follow this problem:
I have the following C code to do some array initialisation:
int i, n = 61440;
double x[n];
for(i=0; i < n; i++) {
x[i] = 1;
}
But the following runs faster (0.5s difference in 1000 iterations):
int i, n = 61440;
double x[n];
for(i=n-1; i >= 0; i--) {
x[i] = 1;
}
I first thought that it was due to the loop accessing the n variable, thus having to do more reads (as suggested here for example: Why is iterating through an array backwards faster than forwards). But even if I change the n in the first loop to a hard coded value, or vice versa move the 0 in the bottom loop to a variable, the performance remains the same. I also tried to change the loops to only do half the work (go from 0 to < 30720, or from n-1 to >= 30720), to eliminate any special treatment of the 0 value, but the bottom loop is still faster
I assume it is because of some compiler optimisations? But everything I look up for the generated machine code suggests, that < and >= ought to be equal.
Thankful for any hints or advice! Thank you!
Edit: Makefile, for compiler details (this is part of a multi threading exercise, hence the OpenMP, though for this case it's all running on 1 core, without any OpenMP instructions in the code)
#CC = gcc
CC = /opt/rh/devtoolset-2/root/usr/bin/gcc
OMP_FLAG = -fopenmp
CFLAGS = -std=c99 -O2 -c ${OMP_FLAG}
LFLAGS = -lm
.SUFFIXES : .o .c
.c.o:
${CC} ${CFLAGS} -o $# $*.c
sblas:sblas.o
${CC} ${OMP_FLAG} -o $# $#.o ${LFLAGS}
Edit2: I redid the experiment with n * 100, getting the same results:
Forward: ~170s
Backward: ~120s
Similar to the previous values of 1.7s and 1.2s, just times 100
Edit3: Minimal Example - changes described above where all localized to the vector update method. This is the default forward version, which takes longer than the backwards version for(i = limit - 1; i >= 0; i--)
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
void vector_update(double a[], double b[], double x[], int limit);
/* SBLAS code */
void *main() {
int n = 1024*60;
int nsteps = 1000;
int k;
double a[n], b[n], x[n];
double vec_update_start;
double vec_update_time = 0;
for(k = 0; k < nsteps; k++) {
// Loop over whole program to get reasonable execution time
// (simulates a time-steping code)
vec_update_start = omp_get_wtime();
vector_update(a, b, x, n);
vec_update_time = vec_update_time + (omp_get_wtime() - vec_update_start);
}
printf( "vector update time = %f seconds \n \n", vec_update_time);
}
void vector_update(double a[], double b[], double x[] ,int limit) {
int i;
for (i = 0; i < limit; i++ ) {
x[i] = 0.0;
a[i] = 3.142;
b[i] = 3.142;
}
}
Edit4: the CPU is AMD quad-core Opteron 8378. The machine uses 4 of those, but I'm using only one on the main processor (core ID 0 in the AMD architecture)

It's not the backward iteration but the comparison with zero which causes the loop in the second case run faster.
for(i=n-1; i >= 0; i--) {
Comparison with zero can be done with a single assembly instruction whereas comparison with any other number takes multiple instructions.

The main reason is that your compiler isn't very good at optimising. In theory there's no reason that a better compiler couldn't have converted both versions of your code into the exact same machine code instead of letting one be slower.
Everything beyond that depends on what the resulting machine code is and what it's running on. This can include differences in RAM and/or CPU speeds, differences in cache behaviour, differences in hardware prefetching (and number of prefetchers), differences in instruction costs and instruction pipelining, differences in speculation, etc. Note that (in theory) this doesn't exclude the possibility that (on most computers but not on your computer) the machine code your compiler generates for forward loop is faster than the machine code it generates for backward loop (your sample size isn't large enough to be statistically significant, unless you're working on embedded systems or game consoles where all computers that run the code are identical).

Keep float number in range [duplicate]

Is there a more efficient way to clamp real numbers than using if statements or ternary operators?
I want to do this both for doubles and for a 32-bit fixpoint implementation (16.16). I'm not asking for code that can handle both cases; they will be handled in separate functions.
Obviously, I can do something like:
double clampedA;
double a = calculate();
clampedA = a > MY_MAX ? MY_MAX : a;
clampedA = a < MY_MIN ? MY_MIN : a;
or
double a = calculate();
double clampedA = a;
if(clampedA > MY_MAX)
clampedA = MY_MAX;
else if(clampedA < MY_MIN)
clampedA = MY_MIN;
The fixpoint version would use functions/macros for comparisons.
This is done in a performance-critical part of the code, so I'm looking for an as efficient way to do it as possible (which I suspect would involve bit-manipulation)
EDIT: It has to be standard/portable C, platform-specific functionality is not of any interest here. Also, MY_MIN and MY_MAX are the same type as the value I want clamped (doubles in the examples above).

Both GCC and clang generate beautiful assembly for the following simple, straightforward, portable code:
double clamp(double d, double min, double max) {
const double t = d < min ? min : d;
return t > max ? max : t;
}
> gcc -O3 -march=native -Wall -Wextra -Wc++-compat -S -fverbose-asm clamp_ternary_operator.c
GCC-generated assembly:
maxsd %xmm0, %xmm1 # d, min
movapd %xmm2, %xmm0 # max, max
minsd %xmm1, %xmm0 # min, max
ret
> clang -O3 -march=native -Wall -Wextra -Wc++-compat -S -fverbose-asm clamp_ternary_operator.c
Clang-generated assembly:
maxsd %xmm0, %xmm1
minsd %xmm1, %xmm2
movaps %xmm2, %xmm0
ret
Three instructions (not counting the ret), no branches. Excellent.
This was tested with GCC 4.7 and clang 3.2 on Ubuntu 13.04 with a Core i3 M 350.
On a side note, the straightforward C++ code calling std::min and std::max generated the same assembly.
This is for doubles. And for int, both GCC and clang generate assembly with five instructions (not counting the ret) and no branches. Also excellent.
I don't currently use fixed-point, so I will not give an opinion on fixed-point.

Old question, but I was working on this problem today (with doubles/floats).
The best approach is to use SSE MINSS/MAXSS for floats and SSE2 MINSD/MAXSD for doubles. These are branchless and take one clock cycle each, and are easy to use thanks to compiler intrinsics. They confer more than an order of magnitude increase in performance compared with clamping with std::min/max.
You may find that surprising. I certainly did! Unfortunately VC++ 2010 uses simple comparisons for std::min/max even when /arch:SSE2 and /FP:fast are enabled. I can't speak for other compilers.
Here's the necessary code to do this in VC++:
#include <mmintrin.h>
float minss ( float a, float b )
{
// Branchless SSE min.
_mm_store_ss( &a, _mm_min_ss(_mm_set_ss(a),_mm_set_ss(b)) );
return a;
}
float maxss ( float a, float b )
{
// Branchless SSE max.
_mm_store_ss( &a, _mm_max_ss(_mm_set_ss(a),_mm_set_ss(b)) );
return a;
}
float clamp ( float val, float minval, float maxval )
{
// Branchless SSE clamp.
// return minss( maxss(val,minval), maxval );
_mm_store_ss( &val, _mm_min_ss( _mm_max_ss(_mm_set_ss(val),_mm_set_ss(minval)), _mm_set_ss(maxval) ) );
return val;
}
The double precision code is the same except with xxx_sd instead.
Edit: Initially I wrote the clamp function as commented. But looking at the assembler output I noticed that the VC++ compiler wasn't smart enough to cull the redundant move. One less instruction. :)

If your processor has a fast instruction for absolute value (as the x86 does), you can do a branchless min and max which will be faster than an if statement or ternary operation.
min(a,b) = (a + b - abs(a-b)) / 2
max(a,b) = (a + b + abs(a-b)) / 2
If one of the terms is zero (as is often the case when you're clamping) the code simplifies a bit further:
max(a,0) = (a + abs(a)) / 2
When you're combining both operations you can replace the two /2 into a single /4 or *0.25 to save a step.
The following code is over 3x faster than ternary on my Athlon II X2, when using the optimization for FMIN=0.
double clamp(double value)
{
double temp = value + FMAX - abs(value-FMAX);
#if FMIN == 0
return (temp + abs(temp)) * 0.25;
#else
return (temp + (2.0*FMIN) + abs(temp-(2.0*FMIN))) * 0.25;
#endif
}

Ternary operator is really the way to go, because most compilers are able to compile them into a native hardware operation that uses a conditional move instead of a branch (and thus avoids the mispredict penalty and pipeline bubbles and so on). Bit-manipulation is likely to cause a load-hit-store.
In particular, PPC and x86 with SSE2 have a hardware op that could be expressed as an intrinsic something like this:
double fsel( double a, double b, double c ) {
return a >= 0 ? b : c;
}
The advantage is that it does this inside the pipeline, without causing a branch. In fact, if your compiler uses the intrinsic, you can use it to implement your clamp directly:
inline double clamp ( double a, double min, double max )
{
a = fsel( a - min , a, min );
return fsel( a - max, max, a );
}
I strongly suggest you avoid bit-manipulation of doubles using integer operations. On most modern CPUs there is no direct means of moving data between double and int registers other than by taking a round trip to the dcache. This will cause a data hazard called a load-hit-store which basically empties out the CPU pipeline until the memory write has completed (usually around 40 cycles or so).
The exception to this is if the double values are already in memory and not in a register: in that case there is no danger of a load-hit-store. However your example indicates you've just calculated the double and returned it from a function which means it's likely to still be in XMM1.

For the 16.16 representation, the simple ternary is unlikely to be bettered speed-wise.
And for doubles, because you need it standard/portable C, bit-fiddling of any kind will end badly.
Even if a bit-fiddle was possible (which I doubt), you'd be relying on the binary representation of doubles. THIS (and their size) IS IMPLEMENTATION-DEPENDENT.
Possibly you could "guess" this using sizeof(double) and then comparing the layout of various double values against their common binary representations, but I think you're on a hiding to nothing.
The best rule is TELL THE COMPILER WHAT YOU WANT (ie ternary), and let it optimise for you.
EDIT: Humble pie time. I just tested quinmars idea (below), and it works - if you have IEEE-754 floats. This gave a speedup of about 20% on the code below. IObviously non-portable, but I think there may be a standardised way of asking your compiler if it uses IEEE754 float formats with a #IF...?
double FMIN = 3.13;
double FMAX = 300.44;
double FVAL[10] = {-100, 0.23, 1.24, 3.00, 3.5, 30.5, 50 ,100.22 ,200.22, 30000};
uint64 Lfmin = *(uint64 *)&FMIN;
uint64 Lfmax = *(uint64 *)&FMAX;
DWORD start = GetTickCount();
for (int j=0; j<10000000; ++j)
{
uint64 * pfvalue = (uint64 *)&FVAL[0];
for (int i=0; i<10; ++i)
*pfvalue++ = (*pfvalue < Lfmin) ? Lfmin : (*pfvalue > Lfmax) ? Lfmax : *pfvalue;
}
volatile DWORD hacktime = GetTickCount() - start;
for (int j=0; j<10000000; ++j)
{
double * pfvalue = &FVAL[0];
for (int i=0; i<10; ++i)
*pfvalue++ = (*pfvalue < FMIN) ? FMIN : (*pfvalue > FMAX) ? FMAX : *pfvalue;
}
volatile DWORD normaltime = GetTickCount() - (start + hacktime);

The bits of IEEE 754 floating point are ordered in a way that if you compare the bits interpreted as an integer you get the same results as if you would compare them as floats directly. So if you find or know a way to clamp integers you can use it for (IEEE 754) floats as well. Sorry, I don't know a faster way.
If you have the floats stored in an arrays you can consider to use some CPU extensions like SSE3, as rkj said. You can take a look at liboil it does all the dirty work for you. Keeps your program portable and uses faster cpu instructions if possible. (I'm not sure tho how OS/compiler-independent liboil is).

Rather than testing and branching, I normally use this format for clamping:
clampedA = fmin(fmax(a,MY_MIN),MY_MAX);
Although I have never done any performance analysis on the compiled code.

Realistically, no decent compiler will make a difference between an if() statement and a ?: expression. The code is simple enough that they'll be able to spot the possible paths. That said, your two examples are not identical. The equivalent code using ?: would be
a = (a > MAX) ? MAX : ((a < MIN) ? MIN : a);
as that avoid the A < MIN test when a > MAX. Now that could make a difference, as the compiler otherwise would have to spot the relation between the two tests.
If clamping is rare, you can test the need to clamp with a single test:
if (abs(a - (MAX+MIN)/2) > ((MAX-MIN)/2)) ...
E.g. with MIN=6 and MAX=10, this will first shift a down by 8, then check if it lies between -2 and +2. Whether this saves anything depends a lot on the relative cost of branching.

Here's a possibly faster implementation similar to #Roddy's answer:
typedef int64_t i_t;
typedef double f_t;
static inline
i_t i_tmin(i_t x, i_t y) {
return (y + ((x - y) & -(x < y))); // min(x, y)
}
static inline
i_t i_tmax(i_t x, i_t y) {
return (x - ((x - y) & -(x < y))); // max(x, y)
}
f_t clip_f_t(f_t f, f_t fmin, f_t fmax)
{
#ifndef TERNARY
assert(sizeof(i_t) == sizeof(f_t));
//assert(not (fmin < 0 and (f < 0 or is_negative_zero(f))));
//XXX assume IEEE-754 compliant system (lexicographically ordered floats)
//XXX break strict-aliasing rules
const i_t imin = *(i_t*)&fmin;
const i_t imax = *(i_t*)&fmax;
const i_t i = *(i_t*)&f;
const i_t iclipped = i_tmin(imax, i_tmax(i, imin));
#ifndef INT_TERNARY
return *(f_t *)&iclipped;
#else /* INT_TERNARY */
return i < imin ? fmin : (i > imax ? fmax : f);
#endif /* INT_TERNARY */
#else /* TERNARY */
return fmin > f ? fmin : (fmax < f ? fmax : f);
#endif /* TERNARY */
}
See Compute the minimum (min) or maximum (max) of two integers without branching and Comparing floating point numbers
The IEEE float and double formats were
designed so that the numbers are
“lexicographically ordered”, which –
in the words of IEEE architect William
Kahan means “if two floating-point
numbers in the same format are ordered
( say x < y ), then they are ordered
the same way when their bits are
reinterpreted as Sign-Magnitude
integers.”
A test program:
/** gcc -std=c99 -fno-strict-aliasing -O2 -lm -Wall *.c -o clip_double && clip_double */
#include <assert.h>
#include <iso646.h> // not, and
#include <math.h> // isnan()
#include <stdbool.h> // bool
#include <stdint.h> // int64_t
#include <stdio.h>
static
bool is_negative_zero(f_t x)
{
return x == 0 and 1/x < 0;
}
static inline
f_t range(f_t low, f_t f, f_t hi)
{
return fmax(low, fmin(f, hi));
}
static const f_t END = 0./0.;
#define TOSTR(f, fmin, fmax, ff) ((f) == (fmin) ? "min" : \
((f) == (fmax) ? "max" : \
(is_negative_zero(ff) ? "-0.": \
((f) == (ff) ? "f" : #f))))
static int test(f_t p[], f_t fmin, f_t fmax, f_t (*fun)(f_t, f_t, f_t))
{
assert(isnan(END));
int failed_count = 0;
for ( ; ; ++p) {
const f_t clipped = fun(*p, fmin, fmax), expected = range(fmin, *p, fmax);
if(clipped != expected and not (isnan(clipped) and isnan(expected))) {
failed_count++;
fprintf(stderr, "error: got: %s, expected: %s\t(min=%g, max=%g, f=%g)\n",
TOSTR(clipped, fmin, fmax, *p),
TOSTR(expected, fmin, fmax, *p), fmin, fmax, *p);
}
if (isnan(*p))
break;
}
return failed_count;
}
int main(void)
{
int failed_count = 0;
f_t arr[] = { -0., -1./0., 0., 1./0., 1., -1., 2,
2.1, -2.1, -0.1, END};
f_t minmax[][2] = { -1, 1, // min, max
0, 2, };
for (int i = 0; i < (sizeof(minmax) / sizeof(*minmax)); ++i)
failed_count += test(arr, minmax[i][0], minmax[i][1], clip_f_t);
return failed_count & 0xFF;
}
In console:
$ gcc -std=c99 -fno-strict-aliasing -O2 -lm *.c -o clip_double && ./clip_double
It prints:
error: got: min, expected: -0. (min=-1, max=1, f=0)
error: got: f, expected: min (min=-1, max=1, f=-1.#INF)
error: got: f, expected: min (min=-1, max=1, f=-2.1)
error: got: min, expected: f (min=-1, max=1, f=-0.1)

I tried the SSE approach to this myself, and the assembly output looked quite a bit cleaner, so I was encouraged at first, but after timing it thousands of times, it was actually quite a bit slower. It does indeed look like the VC++ compiler isn't smart enough to know what you're really intending, and it appears to move things back and forth between the XMM registers and memory when it shouldn't. That said, I don't know why the compiler isn't smart enough to use the SSE min/max instructions on the ternary operator when it seems to use SSE instructions for all floating point calculations anyway. On the other hand, if you're compiling for PowerPC, you can use the fsel intrinsic on the FP registers, and it's way faster.

As pointed out above, fmin/fmax functions work well (in gcc, with -ffast-math). Although gfortran has patterns to use IA instructions corresponding to max/min, g++ does not. In icc one must use instead std::min/max, because icc doesn't allow short-cutting the specification of how fmin/fmax work with non-finite operands.

My 2 cents in C++. Probably not any different than use ternary operators and hopefully no branching code is generated
template <typename T>
inline T clamp(T val, T lo, T hi) {
return std::max(lo, std::min(hi, val));
}

If I understand properly, you want to limit a value "a" to a range between MY_MIN and MY_MAX. The type of "a" is a double. You did not specify the type of MY_MIN or MY_MAX.
The simple expression:
clampedA = (a > MY_MAX)? MY_MAX : (a < MY_MIN)? MY_MIN : a;
should do the trick.
I think there may be a small optimization to be made if MY_MAX and MY_MIN happen to be integers:
int b = (int)a;
clampedA = (b > MY_MAX)? (double)MY_MAX : (b < MY_MIN)? (double)MY_MIN : a;
By changing to integer comparisons, it is possible you might get a slight speed advantage.

If you want to use fast absolute value instructions, check out this snipped of code I found in minicomputer, which clamps a float to the range [0,1]
clamped = 0.5*(fabs(x)-fabs(x-1.0f) + 1.0f);
(I simplified the code a bit). We can think about it as taking two values, one reflected to be >0
fabs(x)
and the other reflected about 1.0 to be <1.0
1.0-fabs(x-1.0)
And we take the average of them. If it is in range, then both values will be the same as x, so their average will again be x. If it is out of range, then one of the values will be x, and the other will be x flipped over the "boundary" point, so their average will be precisely the boundary point.

CUDA c program strange behaviour - kernel works until I add more operations

I am puzzled by the following program (code below). It works fine and gives correct results when the two lines in the kernel defining specsin and speccos are given by (note the second term, which is sin(t)):
specsin+=sin(pi*t/my_tau)*sin(t)*sin(my_omega*(t+my_a0*my_a0/4.0/pi*(2.0*pi*t-my_tau*sin(2*pi*t/my_tau))));
speccos+=sin(pi*t/my_tau)*sin(t)*cos(my_omega*(t+my_a0*my_a0/4.0/pi*(2.0*pi*t-my_tau*sin(2*pi*t/my_tau))));
Once I change this second sin(t) term to sin(t+0.0*my_a0*my_a0), which shouldn't change the result, I get all zeros instead of correct answer.
Can it be that I ran out of kernel memory?
#include <stdio.h>
__global__ void Calculate_Spectrum(float * d_Detector_Data, int numCols, int numRows,
const float omega_min, float dOmega,
const float a0_min, float da0,
const float tau_min, float dtau, float dt)
{
int Global_x = blockIdx.x * blockDim.x + threadIdx.x;
int Global_y = blockIdx.y * blockDim.y + threadIdx.y;
int Position1D = Global_y * numCols + Global_x;
float my_omega=omega_min + Global_x * dOmega;
float my_a0=a0_min + Global_y*da0;
float my_tau=tau_min;
int total_time_steps=int(my_tau/dt);
float specsin=0.0;
float speccos=0.0;
float t=0.0;
float pi=3.14159265359;
for(int n=0; n<total_time_steps; n++)
{
t=n*dt;
specsin+=sin(pi*t/my_tau)*sin(t+0.0*my_a0*my_a0)*sin(my_omega*(t+my_a0*my_a0/4.0/pi*(2.0*pi*t-my_tau*sin(2*pi*t/my_tau))));
speccos+=sin(pi*t/my_tau)*sin(t+0.0*my_a0*my_a0)*cos(my_omega*(t+my_a0*my_a0/4.0/pi*(2.0*pi*t-my_tau*sin(2*pi*t/my_tau))));
}
d_Detector_Data[Position1D]=(specsin*specsin+speccos*speccos)*dt*dt*my_a0*my_a0*my_omega*my_omega/4.0/pi/pi;
}
int main(int argc, char ** argv)
{
const int omega_bins = 1024;
const int a0_bins = 512;
const int tau_bins = 1;
const float omega_min = 0.5;
const float omega_max = 1.1;
const float a0_min = 0.05;
const float a0_max = 1.0;
const float tau_min = 1200;
const float tau_max = 600;
const int steps_per_period=20; // for integrating
float dt=1.0/steps_per_period;
int TotalSize = omega_bins * a0_bins * tau_bins;
float dOmega=(omega_max-omega_min)/(omega_bins-1);
float da0=(a0_max-a0_min)/(a0_bins-1);
float dtau=0.;
float * d_Detector_Data;
int * d_Global_x;
int * d_Global_y;
float h_Detector_Data[TotalSize];
// allocate GPU memory
cudaMalloc((void **) &d_Detector_Data, TotalSize*sizeof(float));
Calculate_Spectrum<<<dim3(1,a0_bins,1), dim3(omega_bins,1,1)>>>(d_Detector_Data, omega_bins, a0_bins, omega_min, dOmega, a0_min, da0, tau_min, dtau, dt);
cudaMemcpy(h_Detector_Data, d_Detector_Data, TotalSize*sizeof(float), cudaMemcpyDeviceToHost);
FILE * SaveFile;
char TempStr[255];
sprintf(TempStr, "result.dat");
SaveFile = fopen(TempStr, "w");
int counter=0;
for(int j=0; j<a0_bins;j++)
{
for(int i=0; i<omega_bins; i++)
{
fprintf(SaveFile,"%e\t", h_Detector_Data[counter]);
counter++;
}
fprintf(SaveFile, "\n");
}
fclose(SaveFile);
// free GPU memory
return 0;
}

I believe this is due to a register limitation.
In order to launch a kernel, the total registers per thread must not exceed the maximum limit (i.e. the Maximum number of 32-bit registers per thread which the compiler should guarantee) and the registers per thread times the number of threads requested must not exceed the maximum limit (the Number of 32-bit registers per multiprocessor).
In the cases where you are getting incorrect results, I believe your kernel is not launching for this reason (too many registers requested in total). You're not doing any cuda error checking, but if you did, I believe you could confirm this.
You can work around this using any method that reduces the total to under the limit. Obviously reducing the threads per block is a direct way to do this. Other things like specifying the -G switch to the compiler also affect code generation and therefore may affect register per thread. Another way to work around this is to instruct the compiler to limit its usage of registers to some maximum amount per thread. This is documented in the nvcc manual, the usage is like this:
nvcc -maxrregcount=xx ... (rest of compile command)
Where xx is the number of registers per thread to limit the usage. If you limit it to let's say, 20 per thread, then even with 1024 threads per block, I will still only be using roughly 20K registers, and this will fit within any device that supports 1024 threads per block (cc 2.0 and above).
You can also get register usage statistics by asking the compiler to generate those:
nvcc -Xptxas -v ... (rest of compile command)
which will cause the compiler to generate a variety of statistics about resource usage, including the number of register per thread used/expected.
Note that resource usage, including register usage, may affect occupancy, which has implications for overall application performance. Therefore limiting register usage may not only allow the kernel to run, but may also allow multiple threadblocks to be resident on an SM, which generally suggests your occupancy is improved, which may improve the performance of your application.
Apparently the usage of 0.0 vs. 0.0f has some subtle effect on compiler behavior, which is showing up in code generation. I would also surmise that you may be right on the boundary of what is acceptable, so perhaps a small change in registers used per thread may be affecting what will run. You can investigate this further using the printout of resource usage statistics from the compiler that I referenced above, and/or possibly by inspecting the PTX code (an intermediate code, somthing like assembly code) generated:
nvcc -ptx ....
If you choose to inspect the PTX, you will want to refer to the PTX manual.

When should prefetch be used on modern machines? [duplicate]

Can anyone give an example or a link to an example which uses __builtin_prefetch in GCC (or just the asm instruction prefetcht0 in general) to gain a substantial performance advantage? In particular, I'd like the example to meet the following criteria:
It is a simple, small, self-contained example.
Removing the __builtin_prefetch instruction results in performance degradation.
Replacing the __builtin_prefetch instruction with the corresponding memory access results in performance degradation.
That is, I want the shortest example showing __builtin_prefetch performing an optimization that couldn't be managed without it.

Here's an actual piece of code that I've pulled out of a larger project. (Sorry, it's the shortest one I can find that had a noticable speedup from prefetching.)
This code performs a very large data transpose.
This example uses the SSE prefetch instructions, which may be the same as the one that GCC emits.
To run this example, you will need to compile this for x64 and have more than 4GB of memory. You can run it with a smaller datasize, but it will be too fast to time.
#include <iostream>
using std::cout;
using std::endl;
#include <emmintrin.h>
#include <malloc.h>
#include <time.h>
#include <string.h>
#define ENABLE_PREFETCH
#define f_vector __m128d
#define i_ptr size_t
inline void swap_block(f_vector *A,f_vector *B,i_ptr L){
// To be super-optimized later.
f_vector *stop = A + L;
do{
f_vector tmpA = *A;
f_vector tmpB = *B;
*A++ = tmpB;
*B++ = tmpA;
}while (A < stop);
}
void transpose_even(f_vector *T,i_ptr block,i_ptr x){
// Transposes T.
// T contains x columns and x rows.
// Each unit is of size (block * sizeof(f_vector)) bytes.
//Conditions:
// - 0 < block
// - 1 < x
i_ptr row_size = block * x;
i_ptr iter_size = row_size + block;
// End of entire matrix.
f_vector *stop_T = T + row_size * x;
f_vector *end = stop_T - row_size;
// Iterate each row.
f_vector *y_iter = T;
do{
// Iterate each column.
f_vector *ptr_x = y_iter + block;
f_vector *ptr_y = y_iter + row_size;
do{
#ifdef ENABLE_PREFETCH
_mm_prefetch((char*)(ptr_y + row_size),_MM_HINT_T0);
#endif
swap_block(ptr_x,ptr_y,block);
ptr_x += block;
ptr_y += row_size;
}while (ptr_y < stop_T);
y_iter += iter_size;
}while (y_iter < end);
}
int main(){
i_ptr dimension = 4096;
i_ptr block = 16;
i_ptr words = block * dimension * dimension;
i_ptr bytes = words * sizeof(f_vector);
cout << "bytes = " << bytes << endl;
// system("pause");
f_vector *T = (f_vector*)_mm_malloc(bytes,16);
if (T == NULL){
cout << "Memory Allocation Failure" << endl;
system("pause");
exit(1);
}
memset(T,0,bytes);
// Perform in-place data transpose
cout << "Starting Data Transpose... ";
clock_t start = clock();
transpose_even(T,block,dimension);
clock_t end = clock();
cout << "Done" << endl;
cout << "Time: " << (double)(end - start) / CLOCKS_PER_SEC << " seconds" << endl;
_mm_free(T);
system("pause");
}
When I run it with ENABLE_PREFETCH enabled, this is the output:
bytes = 4294967296
Starting Data Transpose... Done
Time: 0.725 seconds
Press any key to continue . . .
When I run it with ENABLE_PREFETCH disabled, this is the output:
bytes = 4294967296
Starting Data Transpose... Done
Time: 0.822 seconds
Press any key to continue . . .
So there's a 13% speedup from prefetching.
EDIT:
Here's some more results:
Operating System: Windows 7 Professional/Ultimate
Compiler: Visual Studio 2010 SP1
Compile Mode: x64 Release
Intel Core i7 860 # 2.8 GHz, 8 GB DDR3 # 1333 MHz
Prefetch : 0.868
No Prefetch: 0.960
Intel Core i7 920 # 3.5 GHz, 12 GB DDR3 # 1333 MHz
Prefetch : 0.725
No Prefetch: 0.822
Intel Core i7 2600K # 4.6 GHz, 16 GB DDR3 # 1333 MHz
Prefetch : 0.718
No Prefetch: 0.796
2 x Intel Xeon X5482 # 3.2 GHz, 64 GB DDR2 # 800 MHz
Prefetch : 2.273
No Prefetch: 2.666

Binary search is a simple example that could benefit from explicit prefetching. The access pattern in a binary search looks pretty much random to the hardware prefetcher, so there is little chance that it will accurately predict what to fetch.
In this example, I prefetch the two possible 'middle' locations of the next loop iteration in the current iteration. One of the prefetches will probably never be used, but the other will (unless this is the final iteration).
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
int binarySearch(int *array, int number_of_elements, int key) {
int low = 0, high = number_of_elements-1, mid;
while(low <= high) {
mid = (low + high)/2;
#ifdef DO_PREFETCH
// low path
__builtin_prefetch (&array[(mid + 1 + high)/2], 0, 1);
// high path
__builtin_prefetch (&array[(low + mid - 1)/2], 0, 1);
#endif
if(array[mid] < key)
low = mid + 1;
else if(array[mid] == key)
return mid;
else if(array[mid] > key)
high = mid-1;
}
return -1;
}
int main() {
int SIZE = 1024*1024*512;
int *array = malloc(SIZE*sizeof(int));
for (int i=0;i<SIZE;i++){
array[i] = i;
}
int NUM_LOOKUPS = 1024*1024*8;
srand(time(NULL));
int *lookups = malloc(NUM_LOOKUPS * sizeof(int));
for (int i=0;i<NUM_LOOKUPS;i++){
lookups[i] = rand() % SIZE;
}
for (int i=0;i<NUM_LOOKUPS;i++){
int result = binarySearch(array, SIZE, lookups[i]);
}
free(array);
free(lookups);
}
When I compile and run this example with DO_PREFETCH enabled, I see a 20% reduction in runtime:
$ gcc c-binarysearch.c -DDO_PREFETCH -o with-prefetch -std=c11 -O3
$ gcc c-binarysearch.c -o no-prefetch -std=c11 -O3
$ perf stat -e L1-dcache-load-misses,L1-dcache-loads ./with-prefetch
Performance counter stats for './with-prefetch':
356,675,702 L1-dcache-load-misses # 41.39% of all L1-dcache hits
861,807,382 L1-dcache-loads
8.787467487 seconds time elapsed
$ perf stat -e L1-dcache-load-misses,L1-dcache-loads ./no-prefetch
Performance counter stats for './no-prefetch':
382,423,177 L1-dcache-load-misses # 97.36% of all L1-dcache hits
392,799,791 L1-dcache-loads
11.376439030 seconds time elapsed
Notice that we are doing twice as many L1 cache loads in the prefetch version. We're actually doing a lot more work but the memory access pattern is more friendly to the pipeline. This also shows the tradeoff. While this block of code runs faster in isolation, we have loaded a lot of junk into the caches and this may put more pressure on other parts of the application.

I learned a lot from the excellent answers provided by #JamesScriven and #Mystical. However, their examples give only a modest boost - the objective of this answer is to present a (I must confess somewhat artificial) example, where prefetching has a bigger impact (about factor 4 on my machine).
There are three possible bottle-necks for the modern architectures: CPU-speed, memory-band-width and memory latency. Prefetching is all about reducing the latency of the memory-accesses.
In a perfect scenario, where latency corresponds to X calculation-steps, we would have a oracle, which would tell us which memory we would access in X calculation-steps, the prefetching of this data would be launched and it would arrive just in-time X calculation-steps later.
For a lot of algorithms we are (almost) in this perfect world. For a simple for-loop it is easy to predict which data will be needed X steps later. Out-of-order execution and other hardware tricks are doing a very good job here, concealing the latency almost completely.
That is the reason, why there is such a modest improvement for #Mystical's example: The prefetcher is already pretty good - there is just not much room for improvement. The task is also memory-bound, so probably not much band-width is left - it could be becoming the limiting factor. I could see at best around 8% improvement on my machine.
The crucial insight from the #JamesScriven example: neither we nor the CPU knows the next access-address before the the current data is fetched from memory - this dependency is pretty important, otherwise out-of-order execution would lead to a look-forward and the hardware would be able to prefetch the data. However, because we can speculate about only one step there is not that much potential. I was not able to get more than 40% on my machine.
So let's rig the competition and prepare the data in such a way that we know which address is accessed in X steps, but make it impossible for hardware to find it out due to dependencies on not yet accessed data (see the whole program at the end of the answer):
//making random accesses to memory:
unsigned int next(unsigned int current){
return (current*10001+328)%SIZE;
}
//the actual work is happening here
void operator()(){
//set up the oracle - let see it in the future oracle_offset steps
unsigned int prefetch_index=0;
for(int i=0;i<oracle_offset;i++)
prefetch_index=next(prefetch_index);
unsigned int index=0;
for(int i=0;i<STEP_CNT;i++){
//use oracle and prefetch memory block used in a future iteration
if(prefetch){
__builtin_prefetch(mem.data()+prefetch_index,0,1);
}
//actual work, the less the better
result+=mem[index];
//prepare next iteration
prefetch_index=next(prefetch_index); #update oracle
index=next(mem[index]); #dependency on `mem[index]` is VERY important to prevent hardware from predicting future
}
}
Some remarks:
data is prepared in such a way, that the oracle is alway right.
maybe surprisingly, the less CPU-bound task the bigger the speed-up: we are able to hide the latency almost completely, thus the speed-up is CPU-time+original-latency-time/CPU-time.
Compiling and executing leads:
>>> g++ -std=c++11 prefetch_demo.cpp -O3 -o prefetch_demo
>>> ./prefetch_demo
#preloops time no prefetch time prefetch factor
...
7 1.0711102260000001 0.230566831 4.6455521002498408
8 1.0511602149999999 0.22651144600000001 4.6406494398521474
9 1.049024333 0.22841439299999999 4.5926367389641687
....
to a speed-up between 4 and 5.
Listing of prefetch_demp.cpp:
//prefetch_demo.cpp
#include <vector>
#include <iostream>
#include <iomanip>
#include <chrono>
const int SIZE=1024*1024*1;
const int STEP_CNT=1024*1024*10;
unsigned int next(unsigned int current){
return (current*10001+328)%SIZE;
}
template<bool prefetch>
struct Worker{
std::vector<int> mem;
double result;
int oracle_offset;
void operator()(){
unsigned int prefetch_index=0;
for(int i=0;i<oracle_offset;i++)
prefetch_index=next(prefetch_index);
unsigned int index=0;
for(int i=0;i<STEP_CNT;i++){
//prefetch memory block used in a future iteration
if(prefetch){
__builtin_prefetch(mem.data()+prefetch_index,0,1);
}
//actual work:
result+=mem[index];
//prepare next iteration
prefetch_index=next(prefetch_index);
index=next(mem[index]);
}
}
Worker(std::vector<int> &mem_):
mem(mem_), result(0.0), oracle_offset(0)
{}
};
template <typename Worker>
double timeit(Worker &worker){
auto begin = std::chrono::high_resolution_clock::now();
worker();
auto end = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::nanoseconds>(end-begin).count()/1e9;
}
int main() {
//set up the data in special way!
std::vector<int> keys(SIZE);
for (int i=0;i<SIZE;i++){
keys[i] = i;
}
Worker<false> without_prefetch(keys);
Worker<true> with_prefetch(keys);
std::cout<<"#preloops\ttime no prefetch\ttime prefetch\tfactor\n";
std::cout<<std::setprecision(17);
for(int i=0;i<20;i++){
//let oracle see i steps in the future:
without_prefetch.oracle_offset=i;
with_prefetch.oracle_offset=i;
//calculate:
double time_with_prefetch=timeit(with_prefetch);
double time_no_prefetch=timeit(without_prefetch);
std::cout<<i<<"\t"
<<time_no_prefetch<<"\t"
<<time_with_prefetch<<"\t"
<<(time_no_prefetch/time_with_prefetch)<<"\n";
}
}

From the documentation:
for (i = 0; i < n; i++)
{
a[i] = a[i] + b[i];
__builtin_prefetch (&a[i+j], 1, 1);
__builtin_prefetch (&b[i+j], 0, 1);
/* ... */
}

Pre-fetching data can be optimized to the Cache Line size, which for most modern 64-bit processors is 64 bytes to for example pre-load a uint32_t[16] with one instruction.
For example on ArmV8 I discovered through experimentation casting the memory pointer to a uint32_t 4x4 matrix vector (which is 64 bytes in size) halved the required instructions required as before I had to increment by 8 as it was only loading half the data, even though my understanding was that it fetches a full cache line.
Pre-fetching an uint32_t[32] original code example...
int addrindex = &B[0];
__builtin_prefetch(&V[addrindex]);
__builtin_prefetch(&V[addrindex + 8]);
__builtin_prefetch(&V[addrindex + 16]);
__builtin_prefetch(&V[addrindex + 24]);
After...
int addrindex = &B[0];
__builtin_prefetch((uint32x4x4_t *) &V[addrindex]);
__builtin_prefetch((uint32x4x4_t *) &V[addrindex + 16]);
For some reason int datatype for the address index/offset gave better performance. Tested with GCC 8 on Cortex-a53. Using an equivalent 64 byte vector on other architectures might give the same performance improvement if you find it is not pre-fetching all the data like in my case. In my application with a one million iteration loop, it improved performance by 5% just by doing this. There were further requirements for the improvement.
the 128 megabyte "V" memory allocation had to be aligned to 64 bytes.
uint32_t *V __attribute__((__aligned__(64))) = (uint32_t *)(((uintptr_t)(__builtin_assume_aligned((unsigned char*)aligned_alloc(64,size), 64)) + 63) & ~ (uintptr_t)(63));
Also, I had to use C operators instead of Neon Intrinsics, since they require regular datatype pointers (in my case it was uint32_t *) otherwise the new built in prefetch method had a performance regression.
My real world example can be found at https://github.com/rollmeister/veriumMiner/blob/main/algo/scrypt.c in the scrypt_core() and its internal function which are all easy to read. The hard work is done by GCC8. Overall improvement to performance was 25%.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight