int u1, u2;
unsigned long elm1[20], _mulpre[16][20], res1[40], res2[40]; 64 bits long
res1, res2 initialized to zero.
l = 60;
while (l)
{
for (i = 0; i < 20; i += 2)
{
u1 = (elm1[i] >> l) & 15;
u2 = (elm1[i + 1] >> l) & 15;
for (k = 0; k < 20; k += 2)
{
simda = _mm_load_si128 ((__m128i *) &_mulpre[u1][k]);
simdb = _mm_load_si128 ((__m128i *) &res1[i + k]);
simdb = _mm_xor_si128 (simda, simdb);
_mm_store_si128 ((__m128i *)&res1[i + k], simdb);
simda = _mm_load_si128 ((__m128i *)&_mulpre[u2][k]);
simdb = _mm_load_si128 ((__m128i *)&res2[i + k]);
simdb = _mm_xor_si128 (simda, simdb);
_mm_store_si128 ((__m128i *)&res2[i + k], simdb);
}
}
l -= 4;
All res1, res2 values are left shifted by 4 bits.
}
The above mentioned code is called many times in my program (profiler shows 98%).
EDIT: In the inner loop, res1[i + k] values are loaded many times for same (i + k) values. I tried with this inside the while loop, I loaded all the res1 values into simd registers (array) and use array elements inside the innermost for loop to update array elements . Once both for loops are done, I stored the array values back to the res1, re2. But computation time increases with this. Any idea where I got wrong? The idea seemed to be correct
Any suggestion to make it faster is welcome.
Unfortunately the most obvious optimisations are probably already being done by the compiler:
You can pull &_mulpre[u1] and &mulpre[u2] our of the inner loop.
You can pull &res1[i] our of the inner loop.
Using different variables for the two inner operations, and reordering them, might allow for better pipelining.
Possibly swapping the outer loops would improve cache locality on elm1.
Well, you could always call it fewer times :-)
The total input & output data looks relatively small, depending on you design and expected input it might be feasible to just cache computations or do lazy evaluation instead of up-front.
There is very little you can do with a routine such as this, since loads and stores will be the dominant factor (you're doing 2 loads + 1 store = 4 bus cycles for a single computational instruction).
l = 60;
while (l)
{
for (i = 0; i < 20; i += 2)
{
u1 = (elm1[i] >> l) & 15;
u2 = (elm1[i + 1] >> l) & 15;
for (k = 0; k < 20; k += 2)
{
_mm_stream_si128 ((__m128i *)&res1[i + k],
_mm_xor_si128 (
_mm_load_si128 ((__m128i *) &_mulpre[u1][k]),
_mm_load_si128 ((__m128i *) &res1[i + k]
));
mm_stream_si128 ((__m128i *)&res2[i + k],
_mm_xor_si128 (
_mm_load_si128 ((__m128i *)&_mulpre[u2][k]),
_mm_load_si128 ((__m128i *)&res2[i + k])
));
}
}
l -= 4;
All res1, res2 values are left shifted by 4 bits.
}
Do remember your are using intrinsic, using less _128mi/_mm128 value will speed up your program.
try _mm_stream_si128(), it might speed up the storing process.
try prefetch
Related
I am looking to optimize a loop using loop unswitching and SIMD so that I can speedup execution time.
for (b_idx = 0; b_idx < e_idx; b_idx++) {
if (fxp < 0) {
fxp += LUT[b_idx];
x += ytmp;
y -= xtmp;
} else {
fxp -= LUT[b_idx];
x -= ytmp;
y += xtmp;
}
xtmp = x >> (b_idx + 1);
ytmp = y >> (b_idx + 1);
}
The inner loop contains a if condition that needs branch prediction. Branch misprediction can introduce big slow downs. It is most certainly the bottleneck of the algorithm.
In this case, you can easily remove the condition by noting that branches only differ by one sign.
for (b_idx = 0; b_idx < e_idx; b_idx++) {
int sgn = (fxp < 0) * 2 - 1; // +1 or -1
fxp += sgn * LUT[b_idx];
x += sgn * ytmp;
y -= sgn * xtmp;
xtmp = x >> (b_idx + 1);
ytmp = y >> (b_idx + 1);
}
But as the algorithm is incremental, loop b_idx+1 depends on loop b_idx, that loop cannot be parallelized.
You can consider moving if condition line out of the loop to avoid many times determination.
What is the best way to manipulate indexing in Armadillo? I was under the impression that it heavily used template expressions to avoid temporaries, but I'm not seeing these speedups.
Is direct array indexing still the best way to approach calculations that rely on consecutive elements within the same array?
Keep in mind, that I hope to parallelise these calculations in the future with TBB::parallel_for (In this case, from a maintainability perspective, it may be simpler to use direct accessing?) These calculations happen in a tight loop, and I hope to make them as optimal as possible.
ElapsedTimer timer;
int n = 768000;
int numberOfLoops = 5000;
arma::Col<double> directAccess1(n);
arma::Col<double> directAccess2(n);
arma::Col<double> directAccessResult1(n);
arma::Col<double> directAccessResult2(n);
arma::Col<double> armaAccess1(n);
arma::Col<double> armaAccess2(n);
arma::Col<double> armaAccessResult1(n);
arma::Col<double> armaAccessResult2(n);
std::valarray<double> valArrayAccess1(n);
std::valarray<double> valArrayAccess2(n);
std::valarray<double> valArrayAccessResult1(n);
std::valarray<double> valArrayAccessResult2(n);
// Prefil
for (int i = 0; i < n; i++) {
directAccess1[i] = i;
directAccess2[i] = n - i;
armaAccess1[i] = i;
armaAccess2[i] = n - i;
valArrayAccess1[i] = i;
valArrayAccess2[i] = n - i;
}
timer.Start();
for (int j = 0; j < numberOfLoops; j++) {
for (int i = 1; i < n; i++) {
directAccessResult1[i] = -directAccess1[i] / (directAccess1[i] + directAccess1[i - 1]) * directAccess2[i - 1];
directAccessResult2[i] = -directAccess1[i] / (directAccess1[i] + directAccess1[i]) * directAccess2[i];
}
}
timer.StopAndPrint("Direct Array Indexing Took");
std::cout << std::endl;
timer.Start();
for (int j = 0; j < numberOfLoops; j++) {
armaAccessResult1.rows(1, n - 1) = -armaAccess1.rows(1, n - 1) / (armaAccess1.rows(1, n - 1) + armaAccess1.rows(0, n - 2)) % armaAccess2.rows(0, n - 2);
armaAccessResult2.rows(1, n - 1) = -armaAccess1.rows(1, n - 1) / (armaAccess1.rows(1, n - 1) + armaAccess1.rows(1, n - 1)) % armaAccess2.rows(1, n - 1);
}
timer.StopAndPrint("Arma Array Indexing Took");
std::cout << std::endl;
timer.Start();
for (int j = 0; j < numberOfLoops; j++) {
for (int i = 1; i < n; i++) {
valArrayAccessResult1[i] = -valArrayAccess1[i] / (valArrayAccess1[i] + valArrayAccess1[i - 1]) * valArrayAccess2[i - 1];
valArrayAccessResult2[i] = -valArrayAccess1[i] / (valArrayAccess1[i] + valArrayAccess1[i]) * valArrayAccess2[i];
}
}
timer.StopAndPrint("Valarray Array Indexing Took:");
std::cout << std::endl;
In vs release mode (/02 - to avoid armadillo array indexing checks), they produce the following timings:
Started Performance Analysis!
Direct Array Indexing Took: 37.294 seconds elapsed
Arma Array Indexing Took: 39.4292 seconds elapsed
Valarray Array Indexing Took:: 37.2354 seconds elapsed
Your direct code is already quite optimal, so expression templates are not going to help here.
However, you may want to make sure the optimization level in your compiler actually enables auto-vectorization (-O3 in gcc). Secondly, you can get a bit of extra speed by #define ARMA_NO_DEBUG before including the Armadillo header. This will turn off all run-time checks (such as bound checks for element access), but this is not recommended until you have completely debugged your program.
I have a question about reducing number of memory calls in a loop. Consider the following code (This is not my code as I cannot represent it here because it is too long):
for(k=0;k<n;k++)
{
y[k] = x[0]*2 + z[1];
}
As you can see, in each iteration, the same blocks in the memory (x[0], z[1]) are being called . I was wondering if there is any way around to reduce memory access when the same block of memory is called several times.
Thanks in advance.
Simply, get the values before the loop:
i = x[0];
j = z[1];
for(k=0;k<n;k++)
{
y[k] = i*2 + j;
}
Ofcourse the compiler will optimize this(if it can) even if you don't change anything but it helps to write more readable and intuitive code. You don't need to get the values on every iteration and the code you write should be indicative of that.
Forget micro optimizations write more intuitive and readable code!
As rightly pointed out in comments the right hand expression is completely independent of the loop, so:
i = x[0]*2 + z[1];
for(k=0;k<n;k++)
{
y[k] = i;
}
Here is what you can do.
float value = x[0]*2 + z[1];
for(k=0;k<n;k++)
{
y[k] = value;
}
hope this helps.
v = x[0]*2 + z[1];
for(k=0;k<n;++k) y[k] = v
Assuming that x[0] and z[1] are NOT mapped to y[0..n-1]
If z has a type that shorter than int (e.g. char) you can try the following trick:
char value = x[0]*2 + z[1];
unsigned int value32 = value | (value << 8) | (value << 16) | (value << 24);
unsigned int k;
// Going by blocks of 4
for(k = 0; k < n - n%4; k+=4) {
(unsigned int)z[k] = value32;
}
// Finishing loop
for(; k < n; k++) {
z[k] = value;
}
Compiler will optimize this,
But in case you use a broken compiler without optimizations: you can put them both in register integers and then work with them. like this:
x0 = x[0]*2;
z1 = z[1];
y0 = x0 + z1;
register int k;
for(k=0;k<n;k++)
{
y[k] = y0;
}
This does not guarantee that x[0] and z[1] will be on a register, but atleast hints the compiler that they should be on a register.
Can anyone spot any way to improve the speed in the next Bilinear resizing Algorithm?
I need to improve Speed as this is critical, keeping good image quality. Is expected to be used in mobile devices with low speed CPUs.
The algorithm is used mainly for up-scale resizing. Any other faster Bilinear algorithm also would be appreciated. Thanks
void resize(int* input, int* output, int sourceWidth, int sourceHeight, int targetWidth, int targetHeight)
{
int a, b, c, d, x, y, index;
float x_ratio = ((float)(sourceWidth - 1)) / targetWidth;
float y_ratio = ((float)(sourceHeight - 1)) / targetHeight;
float x_diff, y_diff, blue, red, green ;
int offset = 0 ;
for (int i = 0; i < targetHeight; i++)
{
for (int j = 0; j < targetWidth; j++)
{
x = (int)(x_ratio * j) ;
y = (int)(y_ratio * i) ;
x_diff = (x_ratio * j) - x ;
y_diff = (y_ratio * i) - y ;
index = (y * sourceWidth + x) ;
a = input[index] ;
b = input[index + 1] ;
c = input[index + sourceWidth] ;
d = input[index + sourceWidth + 1] ;
// blue element
blue = (a&0xff)*(1-x_diff)*(1-y_diff) + (b&0xff)*(x_diff)*(1-y_diff) +
(c&0xff)*(y_diff)*(1-x_diff) + (d&0xff)*(x_diff*y_diff);
// green element
green = ((a>>8)&0xff)*(1-x_diff)*(1-y_diff) + ((b>>8)&0xff)*(x_diff)*(1-y_diff) +
((c>>8)&0xff)*(y_diff)*(1-x_diff) + ((d>>8)&0xff)*(x_diff*y_diff);
// red element
red = ((a>>16)&0xff)*(1-x_diff)*(1-y_diff) + ((b>>16)&0xff)*(x_diff)*(1-y_diff) +
((c>>16)&0xff)*(y_diff)*(1-x_diff) + ((d>>16)&0xff)*(x_diff*y_diff);
output [offset++] =
0x000000ff | // alpha
((((int)red) << 24)&0xff0000) |
((((int)green) << 16)&0xff00) |
((((int)blue) << 8)&0xff00);
}
}
}
Off the the top of my head:
Stop using floating-point, unless you're certain your target CPU has it in hardware with good performance.
Make sure memory accesses are cache-optimized, i.e. clumped together.
Use the fastest data types possible. Sometimes this means smallest, sometimes it means "most native, requiring least overhead".
Investigate if signed/unsigned for integer operations have performance costs on your platform.
Investigate if look-up tables rather than computations gain you anything (but these can blow the caches, so be careful).
And, of course, do lots of profiling and measurements.
In-Line Cache and Lookup Tables
Cache your computations in your algorithm.
Avoid duplicate computations (like (1-y_diff) or (x_ratio * j))
Go through all the lines of your algorithm, and try to identify patterns of repetitions. Extract these to local variables. And possibly extract to functions, if they are short enough to be inlined, to make things more readable.
Use a lookup-table
It's quite likely that, if you can spare some memory, you can implement a "store" for your RGB values and simply "fetch" them based on the inputs that produced them. Maybe you don't need to store all of them, but you could experiment and see if some come back often. Alternatively, you could "fudge" your colors and thus end up with less values to store for more lookup inputs.
If you know the boundaries for you inputs, you can calculate the complete domain space and figure out what makes sense to cache. For instance, if you can't cache the whole R, G, B values, maybe you can at least pre-compute the shiftings ((b>>16) and so forth...) that are most likely deterministic in your case).
Use the Right Data Types for Performance
If you can avoid double and float variables, use int. On most architectures, int would be test faster type for computations because of the memory model. You can still achieve decent precision by simply shifting your units (ie use 1026 as int instead of 1.026 as double or float). It's quite likely that this trick would be enough for you.
x = (int)(x_ratio * j) ;
y = (int)(y_ratio * i) ;
x_diff = (x_ratio * j) - x ;
y_diff = (y_ratio * i) - y ;
index = (y * sourceWidth + x) ;
Could surely use some optimization: you were using x_ration * j-1 just a few cycles earlier, so all you really need here is x+=x_ratio
My random guess (use a profiler instead of letting people guess!):
The compiler has to generate that works when input and output overlap which means it has to do generate loads of redundant stores and loads. Add restrict to the input and output parameters to remove that safety feature.
You could also try using a=b; and c=d; instead of loading them again.
here is my version, steal some ideas. My C-fu is quite weak, so some lines are pseudocodes, but you can fix them.
void resize(int* input, int* output,
int sourceWidth, int sourceHeight,
int targetWidth, int targetHeight
) {
// Let's create some lookup tables!
// you can move them into 2-dimensional arrays to
// group together values used at the same time to help processor cache
int sx[0..targetWidth ]; // target->source X lookup
int sy[0..targetHeight]; // target->source Y lookup
int mx[0..targetWidth ]; // left pixel's multiplier
int my[0..targetHeight]; // bottom pixel's multiplier
// we don't have to calc indexes every time, find out when
bool reloadPixels[0..targetWidth ];
bool shiftPixels[0..targetWidth ];
int shiftReloadPixels[0..targetWidth ]; // can be combined if necessary
int v; // temporary value
for (int j = 0; j < targetWidth; j++){
// (8bit + targetBits + sourceBits) should be < max int
v = 256 * j * (sourceWidth-1) / (targetWidth-1);
sx[j] = v / 256;
mx[j] = v % 256;
reloadPixels[j] = j ? ( sx[j-1] != sx[j] ? 1 : 0)
: 1; // always load first pixel
// if no reload -> then no shift too
shiftPixels[j] = j ? ( sx[j-1]+1 = sx[j] ? 2 : 0)
: 0; // nothing to shift at first pixel
shiftReloadPixels[j] = reloadPixels[i] | shiftPixels[j];
}
for (int i = 0; i < targetHeight; i++){
v = 256 * i * (sourceHeight-1) / (targetHeight-1);
sy[i] = v / 256;
my[i] = v % 256;
}
int shiftReload;
int srcIndex;
int srcRowIndex;
int offset = 0;
int lm, rm, tm, bm; // left / right / top / bottom multipliers
int a, b, c, d;
for (int i = 0; i < targetHeight; i++){
srcRowIndex = sy[ i ] * sourceWidth;
tm = my[i];
bm = 255 - tm;
for (int j = 0; j < targetWidth; j++){
// too much ifs can be too slow, measure.
// always true for first pixel in a row
if( shiftReload = shiftReloadPixels[ j ] ){
srcIndex = srcRowIndex + sx[j];
if( shiftReload & 2 ){
a = b;
c = d;
}else{
a = input[ srcIndex ];
c = input[ srcIndex + sourceWidth ];
}
b = input[ srcIndex + 1 ];
d = input[ srcIndex + 1 + sourceWidth ];
}
lm = mx[j];
rm = 255 - lm;
// WTF?
// Input AA RR GG BB
// Output RR GG BB AA
if( j ){
leftOutput = rightOutput ^ 0xFFFFFF00;
}else{
leftOutput =
// blue element
((( ( (a&0xFF)*tm
+ (c&0xFF)*bm )*lm
) & 0xFF0000 ) >> 8)
// green element
| ((( ( ((a>>8)&0xFF)*tm
+ ((c>>8)&0xFF)*bm )*lm
) & 0xFF0000 )) // no need to shift
// red element
| ((( ( ((a>>16)&0xFF)*tm
+ ((c>>16)&0xFF)*bm )*lm
) & 0xFF0000 ) << 8 )
;
}
rightOutput =
// blue element
((( ( (b&0xFF)*tm
+ (d&0xFF)*bm )*lm
) & 0xFF0000 ) >> 8)
// green element
| ((( ( ((b>>8)&0xFF)*tm
+ ((d>>8)&0xFF)*bm )*lm
) & 0xFF0000 )) // no need to shift
// red element
| ((( ( ((b>>16)&0xFF)*tm
+ ((d>>16)&0xFF)*bm )*lm
) & 0xFF0000 ) << 8 )
;
output[offset++] =
// alpha
0x000000ff
| leftOutput
| rightOutput
;
}
}
}
I have this statement in my c program and I want to optimize. By optimization I particularly want to refer to bitwise operators (but any other suggestion is also fine).
uint64_t h_one = hash[0];
uint64_t h_two = hash[1];
for ( int i=0; i<k; ++i )
{
(uint64_t *) k_hash[i] = ( h_one + i * h_two ) % size; //suggest some optimization for this line.
}
Any suggestion will be of great help.
Edit:
As of now size can be any int but it is not a problem and we can round it up to the next prime (but may be not a power of two as for larger values the power of 2 increases rapidly and it will lead to much wastage of memory)
h_two is a 64 bit int(basically a chuck of 64 bytes).
so essentially you're doing
k_0 = h_1 mod s
k_1 = h_1 + h_2 mod s = k_0 + h_2 mod s
k_2 = h_1 + h_2 + h_2 mod s = k_1 + h_2 mod s
..
k_n = k_(n-1) + h_2 mod s
Depending on overflow issues (which shouldn't differ from the original if size is less than half of 2**64), this could be faster (less easy to parallelize though):
uint64_t h_one = hash[0];
uint64_t h_two = hash[1];
k_hash[0] = h_one % size;
for ( int i=1; i<k; ++i )
{
(uint64_t *) k_hash[i] = ( k_hash[i-1] + h_two ) % size;
}
Note there is a possibility that your compiler already came to this form, depending on which optimization flags you use.
Of course this only eliminated one multiplication. If you want to eliminate or reduce the modulo, I guess that based on h_two%size and h_1%size you can predetermine the steps where you have to explicitly call %size, something like this:
uint64_t h_one = hash[0]%size;
uint64_t h_two = hash[1]%size;
k_hash[0] = h_one;
step = (size-(h_one))/(h_two)-1;
for ( int i=1; i<k; ++i )
{
(uint64_t *) k_hash[i] = ( k_hash[i-1] + h_two );
if(i==step)
{
k_hash[i] %= size;
}
}
Note I'm not sure of the formula (didn't test it), it's more a general idea. This would greatly depend on how good your branch prediction is (and how big a performance-hit a misprediction is). ALso it's only likely to help if step is big.
edit: or more simple (and probably with the same performance) -thanks to Mystical:
uint64_t h_one = hash[0]%size;
uint64_t h_two = hash[1]%size;
k_hash[0] = h_one;
for ( int i=1; i<k; ++i )
{
(uint64_t *) k_hash[i] = ( k_hash[i-1] + h_two );
if(k_hash[i] > size)
{
k_hash[i] -= size;
}
}
If size is a power of two, then applying a bitwise AND to size - 1 optimizes "% size":
(uint64_t *)k_hash[i] = (h_one + i * h_two) & (size - 1)