I have a simple code that sums elements from an array and returns them:
// Called with jump == 0
int performance(int jump, int *array, int size) {
int currentIndex = 0;
int total = 0;
// For i in 1...500_000_000
for (int i = 0; i < 500000000; i++) {
currentIndex = (currentIndex + jump) % size;
total += array[currentIndex];
}
return total;
}
I noticed a weird behavior: the presence of % size has a very large performance impact (~10x slower) even tho jump is 0 so it is constantly accessing the same array element (0). Just removing % size improves performance a lot.
I would have thought this was just the modulo computation that was making this difference, but now say I replace my sum line with total += array[currentIndex] % size; (thus also computing a modulo) the performance difference is almost unnoticeable.
I am compiling this with -O3 with clang on an arm64 machine.
What could be causing this?
Sounds normal for sdiv+msub latency to be about 10x add latency.
Even if this inlined for a compile-time-constant size that wasn't a power of two, that's still a multiplicative inverse and an msub (multiply-subtract) to get the remainder, so a dep chain of at least two multiplies and a shift.
Maybe an extra few instructions on the critical path for a signed remainder with with a constant size (even if positive) since the array is also signed int. e.g. -4 % 3 has to produce -1 in C.
See
How many CPU cycles are needed for each assembly instruction?
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
say I replace my sum line with total += array[currentIndex] % size; (thus also computing a modulo)
That remainder isn't part of a loop-carried dependency chain. (https://fgiesen.wordpress.com/2018/03/05/a-whirlwind-introduction-to-dataflow-graphs/)
Multiple remainder calculations can be in flight in parallel, since the next array[idx] load address only depends on a += jump add instruction.
If you don't bottleneck on throughput limits, those remainder results could potentially be ready with 1/clock throughput, with OoO exec overlapping dep chains between iterations. The only latency bottlenecks are the loop counter/index and total += ..., both of which are just integer add which has 1 cycle latency.
So really, the bottleneck is likely going to be on throughput (of the whole loop body), not those latency bottlenecks, unless you're testing on an extremely wide CPU that can get a lot done every cycle. (Surprised you don't get more slowdown from introducing the % at all. Unless total is getting optimized away after inlining if you're not using the result.)
Related
I was wondering why a simple loop such as this one can't hit my CPU clock speed (4,2Ghz):
float sum = 0;
for (int i = 0; i < 1000000; i+=1) {
sum = sum * 1 + 1;
}
Intuitively I would expect to achieve this in less than 1ms (like 0,238ms), doing 4.2 billion iteration per second. But I get about 3ms, which is about 333 million iterations per second.
I assume doing the math is 2 cycles, one for the multiplication and another for the sum. So let's say I'm doing 666 million operations... still seems slow. Then I assumed that the loop comparison takes a cycle and the loop counter takes another cycle...
So I created the following code to remove the loop...
void listOfSums() {
float internalSum = 0;
internalSum = internalSum * 1 + 1;
internalSum = internalSum * 1 + 1;
internalSum = internalSum * 1 + 1;
internalSum = internalSum * 1 + 1;
// Repeated 100k times
To my surprise it's slower, now this takes 10ms. Leading to only 100 million iterations per second.
Given that modern CPU use pipelining, out of order execution, branch prediction... it seems that I'm unable to saturate the 4,2Ghz speed by just doing two floating point operations inside a loop.
Is it safe to then assume that the 4,2Ghz is only achievable with SIMD to fully saturate the CPU core with tasks and doing a simple loop will get you about 1/6 the Ghz floating point performance? I've tried different processors and 1/6 seems to be in the ballpark (Intel, iPhone, iPad).
What's exactly the bottleneck? The CPU ability to parse the instruction? Which only can be surpassed with SIMD?
It is typical that a current processor can issue one or more floating-point additions in each processor cycle and can issue one or more floating-point multiplications in each cycle. It is also typical that a floating-point addition or multiplication will take four cycles to complete. This means, once you have started four floating-point additions—one in cycle n, one in cycle n+1, one in cycle n+2, and one in cycle n+3—the processor may be completing one addition per cycle—one in cycle n+4 (while a new one also starts in cycle n+4), one in n+5, and so on.
However, in order to start a floating-point operation, the inputs to that operation must be ready. Once sum * 1 has been started in cycle n, its result might not be ready until cycle n+4. So the addition of 1 will start in cycle n+4. And that addition will not complete until cycle n+8. And then the multiplication in the next iteration that uses that result cannot start until cycle n+8. Thus, with the nominal loop structure shown, one floating-point addition or multiplication will be completed every four cycles.
If instead you try:
float sum0 = 0;
float sum1 = 0;
float sum2 = 0;
float sum3 = 0;
for (int i = 0; i < 1000000; i += 1)
{
sum0 = sum0 * 1 + 1;
sum1 = sum1 * 1 + 1;
sum2 = sum2 * 1 + 1;
sum3 = sum3 * 1 + 1;
}
then you may find four times as many floating-point operations are completed in the same time.
These details vary from processor model to processor model. Some processors might start working on certain instructions before all of their inputs are ready, some might offer early forwarding of results directly to other instruction execution units before the result is delivered to the regular result register, and so on, so the obtained performance is hugely dependent on processor characteristics.
Regarding the listOfSums example, the code grossly exceeds the size of L1 cache, and therefore the processor must load each instruction from memory before executing it, which greatly reduces the performance.
given this piece of code:
int x[2][128];
int i;
int sum=0;
for(i=0; i<128; i++){
sum += x[0][i] * x[1][i];
}
Assuming we execute this under the following conditions:
sizeof(int) = 4.
Array x begins at memory address 0x0 and is stored in row-major order.
In each case below, the cache is initially empty.
The only memory accesses are to the entries of the array x. All other variables are stored in registers.
Given these assumptions,
estimate the miss rates for the following cases: Assume the cache is
512 bytes, direct-mapped, with 16-byte cache blocks.
Given this info, I know that there are 32 sets (obtained from 512/16) in this cache. So the first set gets loaded with x[0][i] with 4 ints.
But for the second part x[1][i], how do I know whether this load of the values here will override
the first load of x[0][i], x[0][i+1], x[0][i+2], x[0][i+3] ? Or will the x[1][i], x[1][i+1],x[1][i+2],x[1][i+3] be stored in a different set to the first load of x[0][i]? I am confused about how the loading into the cache will be done for this piece of code.
What would be the miss rate of this?
Any help is appreciated please :)
In general it's impossible to predict what will happen in the cache system solely by looking at the C code. In order to do that you at least need to see the generated machine code.
Remember that the compiler is allowed to do all kinds of optimization tricks as long as the final result and side effects are the same.
So in principle a clever compiler could turn the code into:
for(i=0; i<128; i += 4){
regA = x[0][i];
regB = x[0][i+1];
regC = x[0][i+2];
regD = x[0][i+3];
sum += regA * x[1][i];
sum += regB * x[1][i+1];
sum += regC * x[1][i+2];
sum += regD * x[1][i+3];
}
which would impact the cache usage completely. Besides that there may be optimization tricks at HW level that you can't even see from the machine code.
Anyway - if we assume a "direct non-optimized" compilation then you will have 2 cache misses every time you do sum += x[0][i] * x[1][i];
The reason is that the distance between x[0][i] and x[1][i] is 128 * 4 = 512 and that is exactly the cache size. Consequently, the data from x[0][i] and x[1][i] will use the same cache line which means that the data read after the first cache miss will be overwritten by the data read after the second cache miss.
So there won't be any cache hits at all. You'll get 2 * 128 = 256 misses and a 100% miss rate.
I am making a simple C program to know the way of associativity of my CPU.
I know:
My cache size is 32Kb (L1) and the line size is 64 bytes. From there I know there are 500 lines.
My approach is to access the first 8192 element of integer (32 kb), and see where it takes longer, if it takes longer at every x iteration, then x is the way of associativity.
However, the result I get shows nothing:
Here is my C code:
void run_associativity_test() {
int j = 1;
// 8192 * 4 bytes (int) is 32 kb
while (j <= 8192 * 2) {
get_element_access_time(j);
j = j + 1;
}
}
double get_element_access_time(int index) {
struct timespec start_t, end_t;
double start, end, delta;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start_t);
arr[index] += 1;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end_t);
start = 1000000000 * start_t.tv_sec + start_t.tv_nsec;
end = 1000000000 * end_t.tv_sec + end_t.tv_nsec;
delta = end - start;
if (mode == 2 || mode == 3) {
printf("%d, %lf\n", index, delta);
}
return delta;
}
Is my approach wrong? How should I do it?
Also, I found a paper here that explains how to measure the way of associativity, although I couldn't understand it very well. I would be thankful if someone explain me briefly the method in the paper to measure the associativity.
Thanks!
This might be more of a comment than an answer, but it's too big to post it as a comment.
I know: My cache size is 32Kb (L1) and the line size is 64 bytes. From
there I know there are 500 lines.
The size of the cache is 2^15 bytes. So there are 2^15/2^6 = 2^9 = 512 cache lines.
while (j <= 8192 * 2) {
I thought the size of the array is 8192 ints, not (8192 * 2) + 1 ints.
get_element_access_time(j);
j = j + 1;
A cache line can hold 16 ints. Accessing the elements of the array sequentially would results in at most a miss ratio of 1/16, depending on the L1D prefetcher. It's difficult to estimate the number of ways in the L1D cache using that access pattern. I think the best way to do that is to thrash at the same cache set.
Let's forget about the L1D prefetcher for the moment. Also let's only consider L1D caches that use bits 6-11 of the memory address or a subset thereof as a cache set index. For example, if the cache was 8-way associative, then there would be 2^9/2^3 = 64 sets, which means that all of the bits 6-11 are used for the index.
How to check whether the cache is 8-way associative? By accessing the same 8 cache lines that would map to the same cache set many times (such as a million or more times). If the associativity of the cache is at least 8, the execution time should be better than if the associativity is less than 8. That's because in the former case there would be only 8 misses (to the 8 cache lines) but in the latter case there would be many misses since not all cache lines can exist at the same time in the L1D cache. To make your measurements as accurate as possible, we would like to maximize the L1D miss penalty. One possible way to do that is by writing to the L1D instead of reading. This forces the L1D to write back all evicted cache lines, which will hopefully have a measurable impact on performance. Another way to do this is to maximize the number of L2D misses.
It's relatively easy to write a program that exhibits such access pattern. Once you know whether the associativity is smaller than 8 or not, you can further close in on the associativity by similarly testing for other smaller ranges of associativities. Note that you only need to write to one of the elements in a cache line. Also it's important that you make sure to flush the each write out of the write buffer of the core. Otherwise, many writes might just be performed on the write buffer rather than the cache. Essentially this can be done by using the volatile keyword (I think?) or store fences.
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start_t);
arr[index] += 1;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end_t);
This doesn't make any sense. The resolution of the timer is not that high to precisely measure the latency of a single memory write operation. So you should measure the execution time of all the accesses.
The L1D prefetcher may interfere with the measurements, potentially making the cache appear to have a higher associativity than it really is. Switch it off if possible.
If the L1D cache uses bits other than 6-11 to index the cache, virtual memory comes into play, which would make it much more complicated to accurately estimate associativity.
I am trying to optimize a code in C, specificly a critical loop which takes almost 99.99% of total execution time. Here is that loop:
#pragma omp parallel shared(NTOT,i) num_threads(4)
{
# pragma omp for private(dx,dy,d,j,V,E,F,G) reduction(+:dU) nowait
for(j = 1; j <= NTOT; j++){
if(j == i) continue;
dx = (X[j][0]-X[i][0])*a;
dy = (X[j][1]-X[i][1])*a;
d = sqrt(dx*dx+dy*dy);
V = (D/(d*d*d))*(dS[0]*spin[2*j-2]+dS[1]*spin[2*j-1]);
E = dS[0]*dx+dS[1]*dy;
F = spin[2*j-2]*dx+spin[2*j-1]*dy;
G = -3*(D/(d*d*d*d*d))*E*F;
dU += (V+G);
}
}
All variables are local. The loop takes 0.7 second for NTOT=3600 which is a large amount of time, especially when I have to do this 500,000 times in the whole program, resulting in 97 hours spent in this loop. My question is if there are other things to be optimized in this loop?
My computer's processor is an Intel core i5 with 4 CPU(4X1600Mhz) and 3072K L3 cache.
Optimize for hardware or software?
Soft:
Getting rid of time consuming exceptions such as divide by zeros:
d = sqrt(dx*dx+dy*dy + 0.001f );
V = (D/(d*d*d))*(dS[0]*spin[2*j-2]+dS[1]*spin[2*j-1]);
You could also try John Carmack , Terje Mathisen and Gary Tarolli 's "Fast inverse square root" for the
D/(d*d*d)
part. You get rid of division too.
float qrsqrt=q_rsqrt(dx*dx+dy*dy + easing);
qrsqrt=qrsqrt*qrsqrt*qrsqrt * D;
with sacrificing some precision.
There is another division also to be gotten rid of:
(D/(d*d*d*d*d))
such as
qrsqrt_to_the_power2 * qrsqrt_to_the_power3 * D
Here is the fast inverse sqrt:
float Q_rsqrt( float number )
{
long i;
float x2, y;
const float threehalfs = 1.5F;
x2 = number * 0.5F;
y = number;
i = * ( long * ) &y; // evil floating point bit level hacking
i = 0x5f3759df - ( i >> 1 ); // what ?
y = * ( float * ) &i;
y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
// y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed
return y;
}
To overcome big arrays' non-caching behaviour, you can do the computation in smaller patches/groups especially when is is many to many O(N*N) algorithm. Such as:
get 256 particles.
compute 256 x 256 relations.
save 256 results on variables.
select another 256 particles as target(saving the first 256 group in place)
do same calculations but this time 1st group vs 2nd group.
save first 256 results again.
move to 3rd group
repeat.
do same until all particles are versused against first 256 particles.
Now get second group of 256.
iterate until all 256's are complete.
Your CPU has big cache so you can try 32k particles versus 32k particles directly. But L1 may not be big so I would stick with 512 vs 512(or 500 vs 500 to avoid cache line ---> this is going to be dependent on architecture) if I were you.
Hard:
SSE, AVX, GPGPU, FPGA .....
As #harold commented, SSE should be start point to compare and you should vectorize or at least parallelize through 4-packed vector instructions which have advantage of optimum memory fetching ability and pipelining. When you need 3x-10x more performance(on top of SSE version using all cores), you will need an opencl/cuda compliant gpu(equally priced as i5) and opencl(or cuda) api or you can learn opengl too but it seems harder(maybe directx easier).
Trying SSE is easiest, should give 3x faster than the fast inverse I mentionad above. An equally priced gpu should give another 3x of SSE at least for thousands of particles. Going or over 100k particles, whole gpu can achieve 80x performance of a single core of cpu for this type of algorithm when you optimize it enough(making it less dependent to main memory). Opencl gives ability to address cache to save your arrays. So you can use terabytes/s of bandwidth in it.
I would always do random pausing
to pin down exactly which lines were most costly.
Then, after fixing something I would do it again, to find another fix, and so on.
That said, some things look suspicious.
People will say the compiler's optimizer should fix these, but I never rely on that if I can help it.
X[i], X[j], spin[2*j-1(and 2)] look like candidates for pointers. There is no need to do this index calculation and then hope the optimizer can remove it.
You could define a variable d2 = dx*dx+dy*dy and then say d = sqrt(d2). Then wherever you have d*d you can instead write d2.
I suspect a lot of samples will land in the sqrt function, so I would try to figure a way around using that.
I do wonder if some of these quantities like (dS[0]*spin[2*j-2]+dS[1]*spin[2*j-1]) could be calculated in a separate unrolled loop outside this loop. In some cases two loops can be faster than one if the compiler can save some registers.
I cannot believe that 3600 iterations of an O(1) loop can take 0.7 seconds. Perhaps you meant the double loop with 3600 * 3600 iterations? Otherwise I can suggest checking if optimization is enabled, and how long threads spawning takes.
General
Your inner loop is very simple and it contains only a few operations. Note that divisions and square roots are roughly 15-30 times slower than additions, subtractions and multiplications. You are doing three of them, so most of the time is eaten by them.
First of all, you can compute reciprocal square root in one operation instead of computing square root, then getting reciprocal of it. Second, you should save the result and reuse it when necessary (right now you divide by d twice). This would result in one problematic operation per iteration instead of three.
invD = rsqrt(dx*dx+dy*dy);
V = (D * (invD*invD*invD))*(...);
...
G = -3*(D * (invD*invD*invD*invD*invD))*E*F;
dU += (V+G);
In order to further reduce time taken by rsqrt, I advise vectorizing it. I mean: compute rsqrt for two or four input values at once with SSE. Depending on size of your arguments and desired precision of result, you can take one of the routines from this question. Note that it contains a link to a small GitHub project with all the implementations.
Indeed you can go further and vectorize the whole loop with SSE (or even AVX), that is not hard.
OpenCL
If you are ready to use some big framework, then I suggest using OpenCL. Your loop is very simple, so you won't have any problems porting it to OpenCL (except for some initial adaptation to OpenCL).
Then you can use CPU implementations of OpenCL, e.g. from Intel or AMD. Both of them would automatically use multithreading. Also, they are likely to automatically vectorize your loop (e.g. see this article). Finally, there is a chance that they would find a good implementation of rsqrt automatically, if you use native_rsqrt function or something like that.
Also, you would be able to run your code on GPU. If you use single precision, it may result in significant speedup. If you use double precision, then it is not so clear: modern consumer GPUs are often slow with double precision, because they lack the necessary hardware.
Minor optimisations:
(d * d * d) is calculated twice. Store d*d and use it for d^3 and d^5
Modify 2 * x by x<<1;
I want to rewrite such simple routine to SSE2 code, (preferably
in nasm) and I am not totally sure how to do it, two things
not clear (how to express calculations (inner loop and those from
outer loop too) and how to call c code function "SetPixelInDibInt(i ,j, palette[n]);"
from under staticaly linked asm code
void DrawMandelbrotD(double ox, double oy, double lx, int N_ITER)
{
double ly = lx * double(CLIENT_Y)/double(CLIENT_X);
double dx = lx / CLIENT_X;
double dy = ly / CLIENT_Y;
double ax = ox - lx * 0.5 + dx * 0.5;
double ay = oy - ly * 0.5 + dy * 0.5;
static double re, im, re_n, im_n, c_re, c_im, rere, imim, int n;
for(int j=0; j<CLIENT_Y; j+=1)
{
for(int i=0; i<CLIENT_X; i+=1)
{
c_re = ax + i * dx;
c_im = ay + j * dy;
re = c_re;
im = c_im;
rere=re*re;
imim=im*im;
n=1;
for(int k=0;k<N_ITER;k++)
{
im = (re+re)*im + c_im;
re = rere - imim + c_re;
rere=re*re;
imim=im*im;
if ( (rere + imim) > 4.0 ) break;
n++;
}
SetPixelInDibInt(i ,j, palette[n]);
}
}
}
could someone help, I would like not to see other code
implementations but just nasm-sse translation of those above
- it would be most helpfull in my case - could someone help with that?
Intel has a complete implementation as an AVX example. See below.
What makes Mandelbrot tricky is that the early-out condition for each point in the set (i.e. pixel) is different. You could keep a pair or quad of pixels iterating until the magnitude of both exceeds 2.0 (or you hit max iterations). To do otherwise would require tracking which pixel's points were in which vector element.
Anyway, a simplistic implementation to operate on a vector of 2 (or 4 with AVX) doubles at a time would have its throughput limited by the latency of the dependency chains. You'd need to do multiple dependency chains in parallel to keep both of Haswell's FMA units fed. So you'd duplicate your variables, and interleave operations for two iterations of the outer loop inside the inner loop.
Keeping track of which pixels are being calculated would be a little tricky. I think it might take less overhead to use one set of registers for one row of pixels, and another set of registers for another row. (So you can always just move 4 pixels to the right, rather than checking whether the other dep chain is already processing that vector.)
I suspect that only checking the loop exit condition every 4 iterations or so might be a win. Getting code to branch based on a packed vector comparison, is slightly more expensive than in the scalar case. The extra FP add required is also expensive. (Haswell can do two FMAs per cycle, (latency = 5). The lone FP add unit is one the same port as one of the FMA units. The two FP mul units are on the same ports that can run FMA.)
The loop condition can be checked with a packed-compare to generate a mask of zeros and ones, and a (V)PTEST of that register with itself to see if it's all zero. (edit: movmskps then test+jcc is fewer uops, but maybe higher latency.) Then obviously je or jne as appropriate, depending on whether you did a FP compare that leaves zeros when you should exit, or zeros when you shouldn't. NAN shouldn't be possible, but there's no reason not to choose your comparison op such that a NAN will result in the exit condition being true.
const __mm256d const_four = _mm256_set1_pd(4.0); // outside the loop
__m256i cmp_result = _mm256_cmp_pd(mag_squared, const_four, _CMP_LE_OQ); // vcmppd. result is non-zero if at least one element < 4.0
if (_mm256_testz_si256(cmp_result, cmp_result))
break;
There MIGHT be some way to use PTEST directly on a packed-double, with some bit-hack AND-mask that will pick out bits that will be set iff the FP value is > 4.0. Like maybe some bits in the exponent? Maybe worth considering. I found a forum post about it, but didn't try it out.
Hmm, oh crap, this doesn't record WHEN the loop condition failed, for each vector element separately, for the purpose of coloring the points outside the Mandelbrot set. Maybe test for any element hitting the condition (instead of all), record the result, and then set that element (and c for that element) to 0.0 so it won't trigger the exit condition again. Or maybe scheduling pixels into vector elements is the way to go after all. This code might do fairly well on a hyperthreaded CPU, since there will be a lot of branch mispredicts with every element separately triggering the early-out condition.
That might waste a lot of your throughput, and given that 4 uops per cycle is doable, but only 2 of them can be FP mul/add/FMA, there's room for a significant amount of integer code to schedule points into vector elements. (On Sandybridge/Ivybrideg, without FMA, FP throughput is lower. But there are only 3 ports that can handle integer ops, and 2 of those are the ports for the FP mul and FP add units.)
Since you don't have to read any source data, there's only 1 memory access stream for each dep chain, and it's a write stream. (And it's low bandwidth, since most points take a lot of iterations before you're ready to write a single pixel value.) So the number of hardware prefetch streams isn't a limiting factor for the number of dep chains to run in parallel. Cache misses latency should be hidden by write buffers.
I can write some code if anyone's still interested in this (just post a comment). I stopped at the high-level design stage since this is an old question, though.
==============
I also found that Intel already used the Mandelbrot set as an example for one of their AVX tutorials. They use the mask-off-vector-elements method for the loop condition. (using the mask generated directly by vcmpps to AND). Their results indicate that AVX (single-precision) gave a 7x speedup over scalar float, so apparently it's not common for neighbouring pixels to hit the early-out condition at very different numbers of iterations. (at least for the zoom / pan they tested with.)
They just let the FP results keep accumulating for elements that fail the early-out condition. They just stop incrementing the counter for that element. Hopefully most systems default to having the control word set to zero out denormals, if denormals still take extra cycles.
Their code is silly in one way, though: They track the iteration count for each vector element with a floating-point vector, and then convert it to int at the end before use. It'd be faster, and not occupy an FP execution unit, to use packed-integers for that. Oh, I know why they do that: AVX (without AVX2) doesn't support 256bit integer vector ops. They could have used packed 16bit int loop counters, but that could overflow. (And they'd have to compress the mask down from 256b to 128b).
They also test for all elements being > 4.0 with movmskps and then test that, instead of using ptest. I guess the test / jcc can macro-fuse, and run on a different execution unit than FP vector ops, so it's maybe not even slower. Oh, and of course AVX (without AVX2) doesn't have 256bit PTEST. Also, PTEST is 2 uops, so actually movmskps + test / jcc is fewer uops than ptest + jcc. (PTEST is 1 fused-domain uop on SnB, but still 2 unfused uops for the execution ports. On IvB/HSW, 2 uops even in the fused domain.) So it looks like movmskps is the optimal way, unless you can take advantage of the bitwise AND that's part of PTEST, or need to test more than just the high bit of each element. If a branch is unpredictable, ptest might be lower latency, and thus be worth it by catching mispredicts a cycle sooner.