The line of code
next += val;
declines performance to 10x, I have checked ASM code, not result.
Why this line of code declines performance to 10x?
Here is the result:
➜ ~ clang-13 1.c -O3
➜ ~ ./a.out
rand_read_1
sum = 2624b18779c40, time = 0.19s
rand_read_2
sum = 2624b18779c40, time = 1.24s
CPU: Intel(R) Xeon(R) Silver 4210 CPU # 2.20GHz
Here is the code:
#include <stdio.h>
#include <time.h>
#include <stdint.h>
#include <unistd.h>
#include <string.h>
#include <assert.h>
#include <stdlib.h>
#define CCR_MULTIPLY_64 6364136223846793005
#define CCR_ADD_64 1
static inline uint64_t my_rand64(uint64_t *r)
{
*r = *r * CCR_MULTIPLY_64 + CCR_ADD_64;
return *r;
}
#define NUM 10000000UL
uint64_t rand_read_1(uint64_t *ptr, uint64_t nr_words)
{
uint64_t i, next, val = 0;
uint64_t sum;
next = 0;
sum = 0;
for (i = 0; i < NUM; i++) {
my_rand64(&next);
next %= nr_words;
val = ptr[next];
sum += val ^ next;
// printf("next1:%ld\n", next);
}
return sum;
}
uint64_t rand_read_2(uint64_t *ptr, uint64_t nr_words)
{
uint64_t i, next, val ,next2 = 0;
uint64_t sum;
next = 0;
sum = 0;
for (i = 0; i < NUM; i++) {
my_rand64(&next);
next %= nr_words;
val = ptr[next];
sum += val ^ next;
next += val;
}
return sum;
}
#define SIZE (1024*1024*1024)
static uint64_t get_ns(void)
{
struct timespec val;
uint64_t v;
int ret;
ret = clock_gettime(CLOCK_REALTIME, &val);
if (ret != 0) {
perror("clock_gettime");
exit(1);
}
v = (uint64_t) val.tv_sec * 1000000000LL;
v += (uint64_t) val.tv_nsec;
return v;
}
int main(int argc, char *argv[])
{
uint64_t *ptr;
uint64_t sum;
uint64_t t0, t1, td, t2;
ptr = (uint64_t *)malloc(SIZE);
assert(ptr);
memset(ptr, 0, SIZE);
t0 = get_ns();
printf("rand_read_1\n");
sum = rand_read_1(ptr, SIZE/8);
t1 = get_ns();
td = t1 - t0;
printf("sum = %lx, time = %.2fs\n", sum, td/1E9);
printf("rand_read_2\n");
sum = rand_read_2(ptr, SIZE/8);
t2 = get_ns();
td = t2 - t1;
printf("sum = %lx, time = %.2fs\n", sum, td/1E9);
return 0;
}
The method of benchmarking is a bit dodgy, but this is a real effect.
next += val; changes something fundamental about the structure of the code: it makes each memory read depend on the result of the previous read. Without that line, the reads are independent (there is a shorter loop-carried dependency chain through my_rand64, which the memory read is not a part of).
Essentially with that line it's a latency benchmark, and without that line, it's a throughput benchmark. Latency and throughput differing by a factor of 10 is reasonable for memory reads.
At the assembly level, without that line the asm looks like this when compiled with Clang
.LBB2_3: # =>This Inner Loop Header: Depth=1
imul rcx, r15
add rcx, 1
mov edx, ecx
and edx, 134217727
xor rdx, qword ptr [r14 + 8*rdx]
mov esi, r15d
imul esi, ecx
add rdx, rbx
add esi, 1
and esi, 134217727
mov rbx, qword ptr [r14 + 8*rsi]
xor rbx, rsi
add rbx, rdx
mov rcx, rsi
add rax, -2
jne .LBB2_3
uiCA estimates 9.16 cycles per iteration (the loop was unrolled by a factor of 2, so this corresponds to about 4.5 cycles per iteration of the original loop), but it does not take cache misses into account.
With that line, the assembly looks nearly the same, but that doesn't mean it runs in nearly the same way:
.LBB2_6: # =>This Inner Loop Header: Depth=1
imul ecx, r15d
add ecx, 1
and ecx, 134217727
mov rdx, qword ptr [r14 + 8*rcx]
mov rsi, rcx
xor rsi, rdx
add rsi, rbx
add edx, ecx
imul edx, r15d
add edx, 1
and edx, 134217727
mov rcx, qword ptr [r14 + 8*rdx]
mov rbx, rdx
xor rbx, rcx
add rbx, rsi
add rdx, rcx
mov rcx, rdx
add rax, -2
jne .LBB2_6
Now uiCA estimates 24.11 cycles per iteration (this loop was also unrolled by 2x), again without taking cache misses into account.
Some notes regarding how to not do benchmarking (benchmarking is very hard):
malloc is very expensive and you allocate a lot of memory here. However, the OS may not actually allocate it before it notices that the memory is being used. So unless you force the OS to actually allocate the memory before the benchmarking starts, you'll be benchmarking how slow malloc is and nothing else, which is a gigantic performance killer next to your little algorithm.
Additionally, you may allocate too much memory for your OS to handle and you also do not necessary allocate aligned multiples of sizeof(uint64_t) which is a bug.
You could do something like this instead (might have to reduce SIZE first):
ptr = (uint64_t *)malloc(SIZE * sizeof(uint64_t));
assert(ptr);
for(size_t i=0; i<SIZE; i++)
{
volatile uint64_t tmp = 123; // some garbage value
ptr[i] = tmp; // the compiler has to load 123 from stack to heap
}
// actual heap allocation should now be done
The memset to zero probably does not count as "touching" the heap memory and potentially the compiler might even optimize that out by swapping to calloc. When I'm checking the optimized code (gcc x86_64 Linux) this is indeed what is happening. So this does not achieve the "touching the heap" thing described above.
printf and the stdout buffers are performance-heavy functions that should never be placed inside the benchmarking code. You might just end up benchmarking how slow printf is. For example changing your code to
printf("rand_read_1\n");
t0 = get_ns();
sum = rand_read_1(ptr, SIZE/8);
t1 = get_ns();
gave vastly different results.
The last td = t2 - t1; is nonsense since you measure every crap unrelated to your algorithm that has happened since last measurement. Including any number of context switches by the OS...
With all these bug fixes applied, main() might look like this instead:
int main(int argc, char *argv[])
{
uint64_t *ptr;
uint64_t sum;
uint64_t t0, t1, td, t2;
ptr = malloc(SIZE * sizeof(uint64_t));
assert(ptr);
for(size_t i=0; i<SIZE; i++)
{
volatile uint64_t tmp = 123; // some garbage value
ptr[i] = tmp; // the compiler has to load 123 from stack to heap
}
printf("rand_read_1\n");
t0 = get_ns();
sum = rand_read_1(ptr, SIZE/8);
t1 = get_ns();
td = t1 - t0;
printf("sum = %lx, time = %.2fs\n", sum, td/1E9);
printf("rand_read_2\n");
t0 = get_ns();
sum = rand_read_2(ptr, SIZE/8);
t1 = get_ns();
td = t1 - t0;
printf("sum = %lx, time = %.2fs\n", sum, td/1E9);
return 0;
}
In addition, I would perhaps also advise to execute method 1, then method 2, then method 1 then method 2, to ensure that the benchmarking isn't biased to the first algorithm loading values to cache and the second reusing them, or biased towards the eventual context switch that will happen x time units after launching your program.
From there on you can start to measure and consider the performance of your actual algorithm, which is not what you are currently doing. I would suggest that you post a separate question about that.
Related
Currently, from research and various attempts, I'm pretty sure that the only solution to this problem is to use assembly. I'm posting this question to show an existing problem, and maybe get attention from compiler developers, or get some hits from searches about similar problems.
If anything changes in the future, I will accept it as an answer.
This is a very related question for MSVC.
In x86_64 machines, it is faster to use div/idiv with a 32-bit operand than a 64-bit operand. When the dividend is 64-bit and the divisor is 32-bit, and when you know that the quotient will fit in 32 bits, you don't have to use the 64-bit div/idiv. You can split the 64-bit dividend into two 32-bit registers, and even with this overhead, performing a 32-bit div on two 32-bit registers will be faster than doing a 64-bit div with a full 64-bit register.
The compiler will produce a 64-bit div with this function, and that is correct because for a 32-bit div, if the quotient of the division does not fit in 32 bits, an hardware exception occurs.
uint32_t div_c(uint64_t a, uint32_t b) {
return a / b;
}
However, if the quotient is known to be fit in 32 bits, doing a full 64-bit division is unnecessary. I used __builtin_unreachable to tell the compiler about this information, but it doesn't make a difference.
uint32_t div_c_ur(uint64_t a, uint32_t b) {
uint64_t q = a / b;
if (q >= 1ull << 32) __builtin_unreachable();
return q;
}
For both div_c and div_c_ur, the output from gcc is,
mov rax, rdi
mov esi, esi
xor edx, edx
div rsi
ret
clang does an interesting optimization of checking the dividend size, but it still uses a 64-bit div when the dividend is 64-bit.
mov rax, rdi
mov ecx, esi
mov rdx, rdi
shr rdx, 32
je .LBB0_1
xor edx, edx
div rcx
ret
.LBB0_1:
xor edx, edx
div ecx
ret
I had to write straight in assembly to achieve what I want. I couldn't find any other way to do this.
__attribute__((naked, sysv_abi))
uint32_t div_asm(uint64_t, uint32_t) {__asm__(
"mov eax, edi\n\t"
"mov rdx, rdi\n\t"
"shr rdx, 32\n\t"
"div esi\n\t"
"ret\n\t"
);}
Was it worth it? At least perf reports 49.47% overhead from div_c while 24.88% overhead from div_asm, so on my computer (Tiger Lake), div r32 is about 2 times faster than div r64.
This is the benchmark code.
#include <stdint.h>
#include <stdio.h>
__attribute__((noinline))
uint32_t div_c(uint64_t a, uint32_t b) {
uint64_t q = a / b;
if (q >= 1ull << 32) __builtin_unreachable();
return q;
}
__attribute__((noinline, naked, sysv_abi))
uint32_t div_asm(uint64_t, uint32_t) {__asm__(
"mov eax, edi\n\t"
"mov rdx, rdi\n\t"
"shr rdx, 32\n\t"
"div esi\n\t"
"ret\n\t"
);}
static uint64_t rdtscp() {
uint32_t _;
return __builtin_ia32_rdtscp(&_);
}
int main() {
#define n 500000000ll
uint64_t c;
c = rdtscp();
for (int i = 1; i <= n; ++i) {
volatile uint32_t _ = div_c(i + n * n, i + n);
}
printf(" c%15ul\n", rdtscp() - c);
c = rdtscp();
for (int i = 1; i <= n; ++i) {
volatile uint32_t _ = div_asm(i + n * n, i + n);
}
printf("asm%15ul\n", rdtscp() - c);
}
Every idea in this answer is based on comments by Nate Eldredge, from which I discovered some powerfulness of gcc's extended inline assembly. Even though I still have to write assembly, it is possible to create a custom as-if intrinsic function.
static inline uint32_t divqd(uint64_t a, uint32_t b) {
if (__builtin_constant_p(b)) {
return a / b;
}
uint32_t lo = a;
uint32_t hi = a >> 32;
__asm__("div %2" : "+a" (lo), "+d" (hi) : "rm" (b));
return lo;
}
__builtin_constant_p returns 1 if b can be evaluated in compile-time. +a and +d means values are read from and written to a and d registers (eax and edx). rm specifies that the input b can either be a register or memory operand.
To see if inlining and constant propagation is done smoothly,
uint32_t divqd_r(uint64_t a, uint32_t b) {
return divqd(a, b);
}
divqd_r:
mov rdx, rdi
mov rax, rdi
shr rdx, 32
div esi
ret
uint32_t divqd_m(uint64_t a) {
extern uint32_t b;
return divqd(a, b);
}
divqd_m:
mov rdx, rdi
mov rax, rdi
shr rdx, 32
div DWORD PTR b[rip]
ret
uint32_t divqd_c(uint64_t a) {
return divqd(a, 12345);
}
divqd_c:
movabs rdx, 6120523590596543007
mov rax, rdi
mul rdx
shr rdx, 12
mov eax, edx
ret
and the results are satisfying (https://godbolt.org/z/47PE4ovMM).
Why is my SIMD vector4 length function 3x slower than a naive vector length method?
SIMD vector4 length function:
__extern_always_inline float vec4_len(const float *v) {
__m128 vec1 = _mm_load_ps(v);
__m128 xmm1 = _mm_mul_ps(vec1, vec1);
__m128 xmm2 = _mm_hadd_ps(xmm1, xmm1);
__m128 xmm3 = _mm_hadd_ps(xmm2, xmm2);
return sqrtf(_mm_cvtss_f32(xmm3));
}
Naive implementation:
sqrtf(V[0] * V[0] + V[1] * V[1] + V[2] * V[2] + V[3] * V[3])
The SIMD version took 16110ms to iterate 1000000000 times. The naive version was ~3 times faster, it takes only 4746ms.
#include <math.h>
#include <time.h>
#include <stdint.h>
#include <stdio.h>
#include <x86intrin.h>
static float vec4_len(const float *v) {
__m128 vec1 = _mm_load_ps(v);
__m128 xmm1 = _mm_mul_ps(vec1, vec1);
__m128 xmm2 = _mm_hadd_ps(xmm1, xmm1);
__m128 xmm3 = _mm_hadd_ps(xmm2, xmm2);
return sqrtf(_mm_cvtss_f32(xmm3));
}
int main() {
float A[4] __attribute__((aligned(16))) = {3, 4, 0, 0};
struct timespec t0 = {};
clock_gettime(CLOCK_MONOTONIC, &t0);
double sum_len = 0;
for (uint64_t k = 0; k < 1000000000; ++k) {
A[3] = k;
sum_len += vec4_len(A);
// sum_len += sqrtf(A[0] * A[0] + A[1] * A[1] + A[2] * A[2] + A[3] * A[3]);
}
struct timespec t1 = {};
clock_gettime(CLOCK_MONOTONIC, &t1);
fprintf(stdout, "%f\n", sum_len);
fprintf(stdout, "%ldms\n", (((t1.tv_sec - t0.tv_sec) * 1000000000) + (t1.tv_nsec - t0.tv_nsec)) / 1000000);
return 0;
}
I run with the following command on an Intel(R) Core(TM) i7-8550U CPU. First with the vec4_len version then with the plain C.
I compile with GCC (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0:
gcc -Wall -Wextra -O3 -msse -msse3 sse.c -lm && ./a.out
SSE version output:
499999999500000128.000000
13458ms
Plain C version output:
499999999500000128.000000
4441ms
The most obvious problem is using an inefficient dot-product (with haddps which costs 2x shuffle uops + 1x add uop) instead of shuffle + add. See Fastest way to do horizontal float vector sum on x86 for what to do after _mm_mul_ps that doesn't suck as much. But still this is just not something x86 can do very efficiently.
But anyway, the real problem is your benchmark loop.
A[3] = k; and then using _mm_load_ps(A) creates a store-forwarding stall, if it compiles naively instead of to a vector shuffle. A store + reload can be efficiently forwarded with ~5 cycles of latency if the load only loads data from a single store instruction, and no data outside that. Otherwise it has to do a slower scan of the whole store buffer to assemble bytes. This adds about 10 cycles of latency to the store-forwarding.
I'm not sure how much impact this has on throughput, but could be enough to stop out-of-order exec from overlapping enough loop iterations to hide the latency and only bottleneck on sqrtss shuffle throughput.
(Your Coffee Lake CPU has 1 per 3 cycle sqrtss throughput, so surprisingly SQRT throughput is not your bottleneck.1 Instead it will be shuffle throughput or something else.)
See Agner Fog's microarch guide and/or optimization manual.
What does "store-buffer forwarding" mean in the Intel developer's manual?
How does store to load forwarding happens in case of unaligned memory access?
Can modern x86 implementations store-forward from more than one prior store?
Why would a compiler generate this assembly? quotes Intel's optimization manual re: store forwarding. (In that question, and old gcc version stored the 2 dword halves of an 8-byte struct separately, then copied the struct with a qword load/store. Super braindead.)
Plus you're biasing this even more against SSE by letting the compiler hoist the computation of V[0] * V[0] + V[1] * V[1] + V[2] * V[2] out of the loop.
That part of the expression is loop-invariant, so the compiler only has to do (float)k squared, add, and a scalar sqrt every loop iteration. (And convert that to double to add to your accumulator).
(#StaceyGirl's deleted answer pointed this out; looking over the code of the inner loops in it was a great start on writing this answer.)
Extra inefficiency in A[3] = k in the vector version
GCC9.1's inner loop from Kamil's Godbolt link looks terrible, and seems to include a loop-carried store/reload to merge a new A[3] into the 8-byte A[2..3] pair, further limiting the CPU's ability to overlap multiple iterations.
I'm not sure why gcc thought this was a good idea. It would maybe help on CPUs that split vector loads into 8-byte halves (like Pentium M or Bobcat) to avoid store-forwarding stalls. But that's not a sane tuning for "generic" modern x86-64 CPUs.
.L18:
pxor xmm4, xmm4
mov rdx, QWORD PTR [rsp+8] ; reload A[2..3]
cvtsi2ss xmm4, rbx
mov edx, edx ; truncate RDX to 32-bit
movd eax, xmm4 ; float bit-pattern of (float)k
sal rax, 32
or rdx, rax ; merge the float bit-pattern into A[3]
mov QWORD PTR [rsp+8], rdx ; store A[2..3] again
movaps xmm0, XMMWORD PTR [rsp] ; vector load: store-forwarding stall
mulps xmm0, xmm0
haddps xmm0, xmm0
haddps xmm0, xmm0
ucomiss xmm3, xmm0
movaps xmm1, xmm0
sqrtss xmm1, xmm1
ja .L21 ; call sqrtf to set errno if needed; flags set by ucomiss.
.L17:
add rbx, 1
cvtss2sd xmm1, xmm1
addsd xmm2, xmm1 ; total += (double)sqrtf
cmp rbx, 1000000000
jne .L18 ; }while(k<1000000000);
This insanity isn't present in the scalar version.
Either way, gcc did manage to avoid the inefficiency of a full uint64_t -> float conversion (which x86 doesn't have in hardware until AVX512). It was presumably able to prove that using a signed 64-bit -> float conversion would always work because the high bit can't be set.
Footnote 1: But sqrtps has the same 1 per 3 cycle throughput as scalar, so you're only getting 1/4 of your CPU's sqrt throughput capability by doing 1 vector at a time horizontally, instead of doing 4 lengths for 4 vectors in parallel.
A developer can use the __builtin_expect builtin to help the compiler understand in which direction a branch is likely to go.
In the future, we may get a standard attribute for this purpose, but as of today at least all of clang, icc and gcc support the non-standard __builtin_expect instead.
However, icc seems to generate oddly terrible code when you use it1. That is, code that is uses the builtin is strictly worse than the code without it, regardless of which direction the prediction is made.
Take for example the following toy function:
int foo(int a, int b)
{
do {
a *= 77;
} while (b-- > 0);
return a * 77;
}
Out of the three compilers, icc is the only one that compiles this to the optimal scalar loop of 3 instructions:
foo(int, int):
..B1.2: # Preds ..B1.2 ..B1.1
imul edi, edi, 77 #4.6
dec esi #5.12
jns ..B1.2 # Prob 82% #5.18
imul eax, edi, 77 #6.14
ret
Both gcc and Clang manage the miss the easy solution and use 5 instructions.
On the other hand, when you use likely or unlikely macros on the loop condition, icc goes totally braindead:
#define likely(x) __builtin_expect((x), 1)
#define unlikely(x) __builtin_expect((x), 0)
int foo(int a, int b)
{
do {
a *= 77;
} while (likely(b-- > 0));
return a * 77;
}
This loop is functionally equivalent to the previous loop (since __builtin_expect just returns its first argument), yet icc produces some awful code:
foo(int, int):
mov eax, 1 #9.12
..B1.2: # Preds ..B1.2 ..B1.1
xor edx, edx #9.12
test esi, esi #9.12
cmovg edx, eax #9.12
dec esi #9.12
imul edi, edi, 77 #8.6
test edx, edx #9.12
jne ..B1.2 # Prob 95% #9.12
imul eax, edi, 77 #11.15
ret #11.15
The function has doubled in size to 10 instructions, and (worse yet!) the critical loop has more than doubled to 7 instructions with a long critical dependency chain involving a cmov and other weird stuff.
The same is true if you use the unlikely hint and also across all icc versions (13, 14, 17) that godbolt supports. So the code generation is strictly worse, regardless of the hint, and regardless of the actual runtime behavior.
Neither gcc nor clang suffer any degradation when hints are used.
What's up with that?
1 At least in the first and subsequent examples I tried.
To me it seems an ICC bug. This code (available on godbolt)
int c;
do
{
a *= 77;
c = b--;
}
while (likely(c > 0));
that simply use an auxiliary local var c, produces an output without the edx = !!(esi > 0) pattern
foo(int, int):
..B1.2:
mov eax, esi
dec esi
imul edi, edi, 77
test eax, eax
jg ..B1.2
still not optimal (it could do without eax), though.
I don't know if the official ICC policy about __builtin_expect is full support or just compatibility support.
This question seems better suited for the Official ICC forum.
I've tried posting this topic there but I'm not sure I've made a good job (I've been spoiled by SO).
If they answer me I'll update this answer.
EDIT
I've got and an answer at the Intel Forum, they recorded this issue in their tracking system.
As today, it seems a bug.
Don't let the instructions deceive you. What matters is performance.
Consider this rather crude test :
#include "stdafx.h"
#include <windows.h>
#include <iostream>
int foo(int a, int b) {
do { a *= 7; } while (b-- > 0);
return a * 7;
}
int fooA(int a, int b) {
__asm {
mov esi, b
mov edi, a
mov eax, a
B1:
imul edi, edi, 7
dec esi
jns B1
imul eax, edi, 7
}
}
int fooB(int a, int b) {
__asm {
mov esi, b
mov edi, a
mov eax, 1
B1:
xor edx, edx
test esi, esi
cmovg edx, eax
dec esi
imul edi, edi, 7
test edx, edx
jne B1
imul eax, edi, 7
}
}
int main() {
DWORD start = GetTickCount();
int j = 0;
for (int aa = -10; aa < 10; aa++) {
for (int bb = -500; bb < 15000; bb++) {
j += foo(aa, bb);
}
}
std::cout << "foo compiled (/Od)\n" << "j = " << j << "\n"
<< GetTickCount() - start << "ms\n\n";
start = GetTickCount();
j = 0;
for (int aa = -10; aa < 10; aa++) {
for (int bb = -500; bb < 15000; bb++) {
j += fooA(aa, bb);
}
}
std::cout << "optimal scalar\n" << "j = " << j << "\n"
<< GetTickCount() - start << "ms\n\n";
start = GetTickCount();
j = 0;
for (int aa = -10; aa < 10; aa++) {
for (int bb = -500; bb < 15000; bb++) {
j += fooB(aa, bb);
}
}
std::cout << "use likely \n" << "j = " << j << "\n"
<< GetTickCount() - start << "ms\n\n";
std::cin.get();
return 0;
}
produces output:
foo compiled (/Od)
j = -961623752
4422ms
optimal scalar
j = -961623752
1656ms
use likely
j = -961623752
1641ms
This is naturally entirely CPU dependent (tested here on Haswell i7), but both asm loops generally are very nearly identical in performance when tested over a range of inputs. A lot of this has to do with the selection and ordering of instructions being conducive to leveraging instruction pipelining (latency), branch prediction, and other hardware optimizations in the CPU.
The real lesson when you're optimizing is that you need to profile - it's extremely difficult to do this by inspection of the raw assembly.
Even giving a challenging test where likely(b-- >0) isn't true over a third of the time :
for (int aa = -10000000; aa < 10000000; aa++) {
for (int bb = -3; bb < 9; bb++) {
j += fooX(aa, bb);
}
}
results in :
foo compiled (/Od) : 1844ms
optimal scalar : 906ms
use likely : 1187ms
Which isn't bad. What you have to keep in mind is that the compiler will generally do its best without your interference. Using __builtin_expect and the like should really be restricted to cases where you have existing code that you have profiled and that you have specifically identified as being both hotspots and as having pipeline or prediction issues. This trivial example is an ideal case where the compiler will almost certainly do the right thing without help from you.
By including __builtin_expect you're asking the compiler to necessarily compile in a different way - a more complex way, in terms of pure number of instructions, but a more intelligent way in that it structures the assembly in a way that helps the CPU make better branch predictions. In this case of pure register play (as in this example) there's not much at stake, but if it improves prediction in a more complex loop, maybe saving you a bad misprediction, cache misses, and related collateral damage, then it's probably worth using.
I think it's pretty clear here, at least, that when the branch actually is likely then we very nearly recover the full performance of the optimal loop (which I think is impressive). In cases where the "optimal loop" is rather more complex and less trivial we can expect that the codegen would indeed improve branch prediction rates (which is what this is really all about). I think this is really a case of if you don't need it, don't use it.
On the topic of likely vs unlikely generating the same assembly, this doesn't imply that the compiler is broken - it just means that the same codegen is effective regardless of whether the branch is mostly taken or mostly not taken - as long as it is mostly something, it's good (in this case). The codegen is designed to optimise use of the instruction pipeline and to assist branch prediction, which it does. While we saw some reduction in performance with the mixed case above, pushing the loop to mostly unlikely recovers performance.
for (int aa = -10000000; aa < 10000000; aa++) {
for (int bb = -30; bb < 1; bb++) {
j += fooX(aa, bb);
}
}
foo compiled (/Od) : 2453ms
optimal scalar : 1968ms
use likely : 2094ms
I'd like to start converting a little nasm project {synth.asm, synth_core.nh} to c to learn a little bit more about that little soft-synthesizer.
Problem is my asm knowledge is very very rusty, I'm wondering where to start off. I thought maybe one decompiler could help me out but I haven't found anything open-source able to convert these simple nasm listings to c.
Another alternative would be doing the conversion asm->c manually but I'm struggling to understand one of the most simplest functions :(
ie:
;distortion_machine
;---------------------------
;float a
;float b
;---------------------------
;ebp: distort definition
;edi: stackptr
;ecx: length
section distcode code align=1
distortion_machine:
pusha
add ecx, ecx
.sampleloop:
fld dword [edi]
fld dword [ebp+0]
fpatan
fmul dword [ebp+4]
fstp dword [edi]
scasd
loop .sampleloop
popa
add esi, byte 8
ret
broken attempt:
void distortion_machine(???) { // pusha; saving all registers
int ecx = ecx+ecx; // add ecx, ecx; this doesn't make sense
while(???) { // .sampleloop; what's the condition?
float a = [edi]; // fld dword [edi]; docs says edi is stackptr, what's the meaning?
float b = [ebp+0]; // fld dword [ebp+0]; docs says ebp is distort definition, is that an input parameter?
float c = atan(a,b); // fpatan;
float d = c*[ebp+4]; // fmul dword [ebp+4];
// scasd; what's doing this instruction?
}
return ???;
// popa; restoring all registers
// add esi, byte 8;
}
I guess the above nasm listing is a very simple loop distorting a simple audio buffer but I don't understand which ones are the inputs and which ones are the outputs, I don't even understand the loop conditions :')
Any help with the above routine and how to progress with this little educational project would be really appreciated.
There's a bit of guesswork here:
;distortion_machine
;---------------------------
;float a << input is 2 arrays of floats, a and b, successive on stack
;float b
;---------------------------
;ebp: distort definition << 2 floats that control distortion
;edi: stackptr << what it says
;ecx: length << of each input array (a and b)
section distcode code align=1
distortion_machine:
pusha ; << save all registers
add ecx, ecx ; << 2 arrays, so double for element count of both
.sampleloop:
fld dword [edi] ; << Load next float from stack
fld dword [ebp+0] ; << Load first float of distortion control
fpatan ; << Distort with partial atan.
fmul dword [ebp+4] ; << Scale by multiplying with second distortion float
fstp dword [edi] ; << Store back to same location
scasd ; << Funky way to incremement stack pointer
loop .sampleloop ; << decrement ecx and jump if not zero
popa ; << restore registers
add esi, byte 8 ; << See call site. si purpose here isn't stated
ret
It's a real guess, but esi may be a separate argument stack pointer, and the addresses of a and b have been pushed there. This code ignores them by making assumptions about the data stack layout, but it still needs to remove those pointers from the arg stack.
Approximate C:
struct distortion_control {
float level;
float scale;
};
// Input: float vectors a and b stored consecutively in buf.
void distort(struct distortion_control *c, float *buf, unsigned buf_size) {
buf_size *= 2;
do { // Note both this and the assembly misbehave if buf_size==0
*buf = atan2f(*buf, c->level) * c->scale;
++buf;
} while (--buf_size);
}
In a C re-implementation, you'd probably want to be more explicit and fix the zero-size buffer bug. It wouldn't cost much:
void distort(struct distortion_control *c, float *a, float *b, unsigned size) {
for (unsigned n = size; n; --n, ++a) *a = atan2f(*a, c->level) * c->scale;
for (unsigned n = size; n; --n, ++b) *b = atan2f(*b, c->level) * c->scale;
}
Considering the following test programs :
Loop value on the stack
int main( void ) {
int iterations = 1000000000;
while ( iterations > 0 )
-- iterations;
}
Loop value on the stack (dereferenced)
int main( void ) {
int iterations = 1000000000;
int * p = & iterations;
while ( * p > 0 )
-- * p;
}
Loop value on the heap
#include <stdlib.h>
int main( void ) {
int * p = malloc( sizeof( int ) );
* p = 1000000000;
while ( *p > 0 )
-- * p;
}
By compiling them with -O0, I get the following execution times :
case1.c
real 0m2.698s
user 0m2.690s
sys 0m0.003s
case2.c
real 0m2.574s
user 0m2.567s
sys 0m0.000s
case3.c
real 0m2.566s
user 0m2.560s
sys 0m0.000s
[edit] Following is the average on 10 executions :
case1.c
2.70364
case2.c
2.57091
case3.c
2.57000
Why is the execution time bigger with the first test case, which seems to be the simplest ?
My current architecture is a x86 virtual machine (Archlinux). I get these results both with gcc (4.8.0) and clang (3.3).
[edit 1] Generated assembler codes are almost identical except that the second and third ones have more instructions than the first one.
[edit 2] These performances are reproducible (on my system). Each execution will have the same order of magnitude.
[edit 3] I don't really care about performances of a non-optimized program, but I don't understand why it would be slower, and I'm curious.
It's hard to say if this is the reason since I'm doing some guessing and you haven't given some specifics (like which target you're using). But what I see when I compile without optimziations with an x86 target is the following sequences for decrementign the iterations variable:
Case 1:
L3:
sub DWORD PTR [esp+12], 1
L2:
cmp DWORD PTR [esp+12], 0
jg L3
Case 2:
L3:
mov eax, DWORD PTR [esp+12]
mov eax, DWORD PTR [eax]
lea edx, [eax-1]
mov eax, DWORD PTR [esp+12]
mov DWORD PTR [eax], edx
L2:
mov eax, DWORD PTR [esp+12]
mov eax, DWORD PTR [eax]
test eax, eax
jg L3
One big difference that you see in case 1 is that the instruction at L3 reads and writes the memory location. It is followed immediately byu an instruction that reads the same memory location that was just written. This sort of sequence of instructions (the same memory location written then immediate used in the next instruction) often causes some sort of pipeline stall in modern CPUs.
You'll note that the write followed immediately by a read of the same location is not present in case 2.
Again - this answer is a bit of informed speculation.