This question already has answers here:
Rearranging an equation
(7 answers)
Closed 8 years ago.
I am working on C code optimization and there are many calculations which adds 1.0 to the expression which returns double value like
val = 1.0 + u[8] / c_sq + (u[8] * u[8]) / (2.0 * c_sq * c_sq) - u_sq / (2.0 * c_sq)
so I was just curious to know is there any optimization technique to improve this piece of code.
That single line of code, taken on its own without any context whatsoever, is what it is. It cannot be optimized any further as long as the compiler you are using is at least doing CSE on the 2.0 * c_sq. Otherwise, that's pretty much all you can do outside of domain-specific optimizations that aren't apparent by just that code.
On typical current processors, division is time consuming; it can take dozens of CPU cycles. You can rearrange the expression to eliminate two divisions, per this answer. You can make some minor other improvements as well (which the compiler might have caught):
double t = u[8];
double v = 2*c_sq;
val = 1 + (t*(v+t) - u_sq*c_sq) / (v*c_sq);
Related
This question already has answers here:
How can i optimize my AVX implementation of dot product?
(1 answer)
AVX2: Computing dot product of 512 float arrays
(1 answer)
Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)
(1 answer)
Dot Product of Vectors with SIMD
(2 answers)
Closed 1 year ago.
I am trying to find the most efficient way to multiply two 2dim-array (Single Precision) in C and started with the naive idea to implement it by following the arithmetic rules:
for (i = 0; i < n; i++) {
sum += a[i] * b[i]; }
It worked, but was probably not the fastest routine on earth. Switching to pointer arithmetics and doing some loop unrolling the speed improved. However, when applying SIMD the speed dropped again.
To be more precise: Compiled on Intel oneAPI with -O3 on a Intel Core i5-4690, 3.5 GHz I see the following results:
Naive implementation: Approx. 800 MFlop/s
Using Pointer - Loop unrolling: Up to 5 GFlop/s
Applying SIMD: 3,5 - 5 GFlop/s
The speed of course varied with the size of the vectors and between the different test runs, therefore the figures above are more of indicative nature, but still raise the question why the SIMD-routine does not give a significant push:
float hsum_float_avx(float *pt_a, float *pt_b) {
__m256 AVX2_vect1, AVX2_vect2, res_mult, hsum;
float sumAVX;
// load unaligned memory into two vectors
AVX2_vect1 = _mm256_loadu_ps(pt_a);
AVX2_vect2 = _mm256_loadu_ps(pt_b);
// multiply the two vectors
res_mult = _mm256_mul_ps(AVX2_vect1, AVX2_vect2);
// calculate horizontal sum of resulting vector
hsum = _mm256_hadd_ps(res_mult, res_mult);
hsum = _mm256_add_ps(hsum, _mm256_permute2f128_ps(hsum, hsum, 0x1));
// store result
_mm_store_ss(&sumAVX, _mm_hadd_ps(_mm256_castps256_ps128(hsum), _mm256_castps256_ps128(hsum)));
return sumAVX; }
There must be something wrong, but I cannot find it - therefore any hint would be highly appreciated.
If your compiler supports OpenMP 4.0 or later, I'd use that to request that the compiler vectorize the original loop (Which it might already be doing so if using a high enough optimization level; but OpenMP lets you give hints about things like alignment etc. to improve the results). That has the advantage over AVX intrinsics that it'll work on other architectures like ARM, or with other x86 SIMD instruction sets (Assuming you tell the compiler to target them) with just a simple recompilation instead of having to rewrite your code:
float sum = 0.0f;
#pragma omp simd reduction(+:sum)
for (i = 0; i < n; i++) {
sum += a[i] * b[i];
}
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
The following 3 lines give imprecise results with "gcc -Ofast -march=skylake":
int32_t i = -5;
const double sqr_N_min_1 = (double)i * i;
1. - ((double)i * i) / sqr_N_min_1
Obviously, sqr_N_min_1 gets 25., and in the 3rd line (-5 * -5) / 25 should become 1. so that the overall result from the 3rd line is exactly 0.. Indeed, this is true for compiler options "gcc -O3 -march=skylake".
But with "-Ofast" the last line yields -2.081668e-17 instead of 0. and with other i than -5 (e.g. 6 or 7) it gets other very small positive or negative random deviations from 0..
My question is: Where exactly is the source of this imprecision?
To investigate this, I wrote a small test program in C:
#include <stdint.h> /* int32_t */
#include <stdio.h>
#define MAX_SIZE 10
double W[MAX_SIZE];
int main( int argc, char *argv[] )
{
volatile int32_t n = 6; /* try 6 7 or argv[1][0]-'0' */
double *w = W;
int32_t i = 1 - n;
const int32_t end = n - 1;
const double sqr_N_min_1 = (double)i * i;
/* Here is the crucial part. The loop avoids the compiler replacing it with constants: */
do {
*w++ = 1. - ((double)i * i) / sqr_N_min_1;
} while ( (i+=2) <= end );
/* Then, show the results (only the 1st and last output line matters): */
w = W;
i = 1 - n;
do {
fprintf( stderr, "%e\n", *w++ );
} while ( (i+=2) <= end );
return( 0 );
}
Godbolt shows me the assembly produced by an "x86-64 gcc9.3" with the option "-Ofast -march=skylake" vs. "-O3 -march=skylake". Please, inspect the five columns of the website (1. source code, 2. assembly with "-Ofast", 3. assembly with "-O3", 4. output of 1st assembly, 5. output of 2nd assembly):
Godbolt site with five columns
As you can see the differences in the assemblies are obvious, but I can't figure out where exactly the imprecision comes from. So, the question is, which assembler instruction(s) are responsible for this?
A follow-up question is: Is there a possibility to avoid this imprecision with "-Ofast -march=skylake" by reformulating the C-program?
Comments and another answer have pointed out the specific transformation that's happening in your case, with a reciprocal and an FMA instead of a division.
Is there a possibility to avoid this imprecision with "-Ofast -march=skylake" by reformulating the C-program?
Not in general.
-Ofast is (currently) a synonym for -O3 -ffast-math.
See https://gcc.gnu.org/wiki/FloatingPointMath
Part of -ffast-math is -funsafe-math-optimizations, which as the name implies, can change numerical results. (With the goal of allowing more optimizations, like treating FP math as associative to allow auto-vectorizing the sum of an array with SIMD, and/or unrolling with multiple accumulators, or even just rearranging a sequence of operations within one expression to combine two separate constants.)
This is exactly the kind of speed-over-accuracy optimization you're asking for by using that option. If you don't want that, don't enable all of the -ffast-math sub-options, only the safe ones like -fno-math-errno / -fno-trapping-math. (See How to force GCC to assume that a floating-point expression is non-negative?)
There's no way of formulating your source to avoid all possible problems.
Possibly you could use volatile tmp vars all over the place to defeat optimization between statements, but that would make your code slower than regular -O3 with the default -fno-fast-math. And even then, calls to library functions like sin or log may resolve to versions that assume the args are finite, not NaN or infinity, because of -ffinite-math-only.
GCC issue with -Ofast? points out another effect: isnan() is optimized into a compile-time 0.
From the comments, it seems that, for -O3, the compiler computes 1. - ((double)i * i) / sqr_N_min_1:
Convert i to double and square it.
Divide that by sqr_N_min_1.
Subtract that from 1.
and, for -Ofast, computes it:
Prior to the loop, calculate the reciprocal of sqr_N_min_1.
Convert i to double and square it.
Compute the fused multiply-subtract of 1 minus the square times the reciprocal.
The latter improves speed because it calculates the division only once, and multiplication is much faster than division in the target processors. On top of that, the fused operation is faster than a separate multiplication and subtraction.
The error occurs because the reciprocal operation introduces a rounding error that is not present in the original expression (1/25 is not exactly representable in a binary format, while 25/25 of course is). This is why the compiler does not make this optimization when it is attempting to provide strict floating-point semantics.
Additionally, simply multiplying the reciprocal by 25 would erase the error. (This is somewhat by “chance,” as rounding errors vary in complicated ways. 1./25*25 produces 1, but 1./49*49 does not.) But the fused operation produces a more accurate result (it produces the result as if the product were computed exactly, with rounding occurring only after the subtraction), so it preserves the error.
I have to raise 10 to the power of a double a lot of times.
Is there a more efficient way to do this than with the math library pow(10,double)? If it matters, my doubles are always negative between -5 and -11.
I assume pow(double,double) uses a more general algorithm than is required for pow(10,double) and might therefore not be the fastest method. Given some of the answers below, that might have been an incorrect assumption.
As for the why, it is for logartihmic interpolation.
I have a table of x and y values.
My object has a known x value (which is almost always a double).
double Dbeta(struct Data *diffusion, double per){
double frac;
while(per>diffusion->x[i]){
i++;
}
frac = (per-diffusion->x[i-1])/(diffusion->x[i]-diffusion->x[i-1]);
return pow(10,log10DB[i-1] + frac * (log10DB[i]-log10DB[i-1]));
}
This function is called a lot of times.
I have been told to look into profiling, so that is what I will do first.
I have just been told I could have used natural logarithms instead of base 10, which is obviously right. (my stupidity sometimes amazes even myself.)
After replacing everything with natural logarithms everything runs a bit faster. With profiling (which is a new word I learned today) I found out 39% of my code is spend in the exp function, so for those who wondered if it was in fact this part that was bottlenecking my code, it was.
For pow(10.0, n) it should be faster to set c = log(10.0), which you can compute once, then use exp(c*n), which should be significantly faster than pow(10.0, n) (which is basically doing that same thing internally, except it would be calculating log(10.0) over and over instead of just once). Beyond that, there probably isn't much else you can do.
Yes, the pow function is slow (roughly 50x the cost of a multiply, for those asking for benchmarks).
By some log/exponents trickery, we can express 10^x as
10^x = exp(log(10^x)) = exp(x * log(10)).
So you can implement 10^x with exp(x * M_LN10), which should be more efficient than pow.
If double accuracy isn't critical, use the float version of the function expf (or powf), which should be more efficient than the double version.
If rough accuracy is Ok, precompute a table over the [-5, -11] range and do a quick look up with linear interpolation.
Some benchmarks (using glibc 2.31):
Benchmark Time
---------------------------------
pow(10, x) 15.54 ns
powf(10, x) 7.18 ns
expf(x * (float)M_LN10) 3.45 ns
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
From the execution time perspective, is using modulus operator more beneficial or the manual way of doing it if i am supposed to do the modulus thing a large number of times, about 10^6 times ?
Manually doing (number % mod_number) :
while(number >= mod_number) {
number = number - mod_number;
}
Doing the same thing using % operator :
number = number % mod_number;
From what i have tested, manually doing it gives better time performance.
How is the modulus operator defined? I know the outputs for negative numbers are implementation defined, i am asking about the working of the operator, i.e., its complexity so that i can justify the better manual performance.
Note : The question is specifically for implementation in C.
The code snippet:
for (j = 0; j < idx; j++) {
num = mark[j];
dif = k - num;
if (dif < 0) dif = (-1 * dif) + 100;
many = count[num];
prev = ap[dif][k];
ap[dif][k] = ap[dif][k] + ap[dif][num];
//the manual way here works faster than %
if (ap[dif][k] >= mod) ap[dif][k] -= mod;
ap[dif][k] += many;
if (ap[dif][k] >= mod) ap[dif][k] -= mod;
sum = (sum + ap[dif][k]);
if (sum >= mod) sum -= mod;
sum = sum - prev;
}
The above loop is executed 2*(10^5)*t times with 'idx' gradually increasing till 100 for each 't'. Used t = 10.
I would be very surprised if the loop were more efficient when number is many times larger than mod_number. Any CPU you're likely to use has a built-in division operation that returns both the quotient and the remainder in constant time, and this will be used to implement the % operator. Your loop takes O(number/mod_number) time.
I suggest you take a look at the generated assembly code for the two versions and you'll see this.
It depends on the implementation. It is pointless to discuss performance without a given system in mind.
The modulus operator will likely be implemented through the CPU's division instruction, which on most CPUs is relatively slow in comparison to other CPU instructions. However, it seems highly unlikely that a loop like the one in your example will be more efficient.
More likely, the performance difference you are experiencing is either related to wrong optimization settings or incorrect benchmarking.
According to my experience, using the modulus operator should give you better performance. The people who have written C compilers should have considered the optimization of the operation they are performing.
But your test results shows the other away, it may depend on the code you have written. It would be easier to find why? if you show your code...
The example you have shown (not that while loop at the top, the snippet at the bottom) is a case where the "divisor" is only subtracted at most once. That is essentially the one case in which "repeated" subtraction (0 or 1 times, a special case of repeated subtraction) can be (and commonly is, but not necessarily) faster than division-based modulo. Obviously it depends on how fast division is on the target, how fast a test/branch (or test/predicated instruction) is on the target, and in the case of branches it even depends on how predictable the branch will be.
A compiler is unlikely to make that optimization (but it's not impossible), because it only makes sense if it is known that the subtraction will only happen at most once (or perhaps more than one, if division is especially slow on the target, but some lowish bound is still needed), which is in general a hard thing to find out for a compiler.
To give some real life numbers, on a Haswell 32bit signed division (and therefore also modulo) would take 22 to 29 cycles, and a branch misprediction might take up to 20 cycles, but that's a worst case and the branch should not be mispredicted all the time. Also, you could avoid the branch (if it's badly predicted) and do something like this (not tested, just to give you some idea)
sub eax, edx
lea edx, [eax + edx]
cmovl eax, edx
Which should only take about 4 cycles, independent of any predictability. Using a branch may be faster if it can be predicted well.
I have several variables listed below:
int cpu_time_b = 6
float clock_cycles_a = 2 * pow(10, 10));
float cpi_a = 2.0;
int cycle_time_a = 250;
float cpi_b = 1.2;
int cycle_time_b = 500
I am working out the clock rate of b with the following calculation:
(((1.2*clock_cycles_a)/cpu_time_b)/(1 * pow(10, 9)))
Clearly the answer should be 4 however my program is outputting 6000000204800000000.0 as the answer
I think that overflow is possibly happening here. Is this the case and if so, how could I fix the problem?
All calculations should be made to ensure comparable numbers are "reduced" together. in your example, it seems like only
cpu_time_b
is truly variable (undefined in the scope of your snippet. All other variables appears as constants. All constants should be computed before compilation especially if they are susceptible to cause overflow.
clock_cycles_a
cancels the denominator. pow is time consuming (may not be critical here) and not always that precise. You multiply the 2 explicitly when you declare clock_cycles_a and then use 1.2 below. etc. Reducing the whole thing keeping only the actual variable becomes:
24.0/cpu_time_b
which makes me deduce that cpu_time_b should be 6?
Finaly, while you write the equation, we have no idea of what you do with the result. Store it in the wrong variable type? printf with the wrong format? etc?