Question
I am testing a simple code which calculates Mandelbrot fractal. I have been checking its performance depending on the number of iterations in the function that checks if a point belongs to the Mandelbrot set or not.
The surprising thing is that I am getting a big difference in times after adding the -fPIC flag. From what I read the overhead is usually negligible and the highest overhead I came across was about 6%. I measured around 30% overhead. Any advice will be appreciated!
Details of my project
I use the -O3 flag, gcc 4.7.2, Ubuntu 12.04.2, x86_64.
The results look as follow
#iter C (fPIC) C C/C(fPIC)
1 0.01 0.01 1.00
100 0.04 0.03 0.75
200 0.06 0.04 0.67
500 0.15 0.1 0.67
1000 0.28 0.19 0.68
2000 0.56 0.37 0.66
4000 1.11 0.72 0.65
8000 2.21 1.47 0.67
16000 4.42 2.88 0.65
32000 8.8 5.77 0.66
64000 17.6 11.53 0.66
Commands I use:
gcc -O3 -fPIC fractalMain.c fractal.c -o ffpic
gcc -O3 fractalMain.c fractal.c -o f
Code: fractalMain.c
#include <time.h>
#include <stdio.h>
#include <stdbool.h>
#include "fractal.h"
int main()
{
int iterNumber[] = {1, 100, 200, 500, 1000, 2000, 4000, 8000, 16000, 32000, 64000};
int it;
for(it = 0; it < 11; ++it)
{
clock_t start = clock();
fractal(iterNumber[it]);
clock_t end = clock();
double millis = (end - start)*1000 / CLOCKS_PER_SEC/(double)1000;
printf("Iter: %d, time: %lf \n", iterNumber[it], millis);
}
return 0;
}
Code: fractal.h
#ifndef FRACTAL_H
#define FRACTAL_H
void fractal(int iter);
#endif
Code: fractal.c
#include <stdio.h>
#include <stdbool.h>
#include "fractal.h"
void multiplyComplex(double a_re, double a_im, double b_re, double b_im, double* res_re, double* res_im)
{
*res_re = a_re*b_re - a_im*b_im;
*res_im = a_re*b_im + a_im*b_re;
}
void sqComplex(double a_re, double a_im, double* res_re, double* res_im)
{
multiplyComplex(a_re, a_im, a_re, a_im, res_re, res_im);
}
bool isInSet(double P_re, double P_im, double C_re, double C_im, int iter)
{
double zPrev_re = P_re;
double zPrev_im = P_im;
double zNext_re = 0;
double zNext_im = 0;
double* p_zNext_re = &zNext_re;
double* p_zNext_im = &zNext_im;
int i;
for(i = 1; i <= iter; ++i)
{
sqComplex(zPrev_re, zPrev_im, p_zNext_re, p_zNext_im);
zNext_re = zNext_re + C_re;
zNext_im = zNext_im + C_im;
if(zNext_re*zNext_re+zNext_im*zNext_im > 4)
{
return false;
}
zPrev_re = zNext_re;
zPrev_im = zNext_im;
}
return true;
}
bool isMandelbrot(double P_re, double P_im, int iter)
{
return isInSet(0, 0, P_re, P_im, iter);
}
void fractal(int iter)
{
int noIterations = iter;
double xMin = -1.8;
double xMax = 1.6;
double yMin = -1.3;
double yMax = 0.8;
int xDim = 512;
int yDim = 384;
double P_re, P_im;
int nop;
int x, y;
for(x = 0; x < xDim; ++x)
for(y = 0; y < yDim; ++y)
{
P_re = (double)x*(xMax-xMin)/(double)xDim+xMin;
P_im = (double)y*(yMax-yMin)/(double)yDim+yMin;
if(isMandelbrot(P_re, P_im, noIterations))
nop = x+y;
}
printf("%d", nop);
}
Story behind the comparison
It might look a bit artificial to add the -fPIC flag when building executable (as per one of the comments). So a few words of explanation: first I only compiled the program as executable and wanted to compare to my Lua code, which calls the isMandelbrot function from C. So I created a shared object to call it from lua - and had big time differences. But couldn't understand why they were growing with number of iterations. In the end found out that it was because of the -fPIC. When I create a little c program which calls my lua script (so effectively I do the same thing, only don't need the .so) - the times are very similar to C (without -fPIC). So I have checked it in a few configurations over the last few days and it consistently shows two sets of very similar results: faster without -fPIC and slower with it.
It turns out that when you compile without the -fPIC option multiplyComplex, sqComplex, isInSet and isMandelbrot are inlined automatically by the compiler. If you define those functions as static you will likely get the same performance when compiling with -fPIC because the compiler will be free to perform inlining.
The reason why the compiler is unable to automatically inline the helper functions has to do with symbol interposition. Position independent code is required to access all global data indirectly, i.e. through the global offset table. The very same constraint applies to function calls, which have to go through the procedure linkage table. Since a symbol might get interposed by another one at runtime (see LD_PRELOAD), the compiler cannot simply assume that it is safe to inline a function with global visibility.
The very same assumption can be made if you compile without -fPIC, i.e. the compiler can safely assume that a global symbol defined in the executable cannot be interposed because the lookup scope begins with the executable itself which is then followed by all other libraries, including the preloaded ones.
For a more thorough understanding have a look at the following paper.
As other people already pointed out -fPIC forces GCC to disable many optimizations e.g. inlining and cloning. I'd like to point out several ways to overcome this:
replace -fPIC with -fPIE if you are compiling main executable (not libraries) as this allows compiler to assume that interposition is not possible;
use -fvisibility=hidden and __attribute__((visibility("default"))) to export only necessary functions from the library and hide the rest; this would allow GCC to optimize hidden functions more aggressively;
use private symbol aliases (__attribute__((alias ("__f")));) to refer to library functions from within the library; this would again untie GCC's hands
previous suggestion can be automated with -fno-semantic-interposition flag that was added in recent GCC versions
It's interesting to note that Clang is different from GCC as it allows all optimizations by default regardless of -fPIC (can be overridden with -fsemantic-interposition to obtain GCC-like behavior).
As others have discussed in the comment section of your opening post, compiling with -flto should help to reduce the difference in run-times you are seeing for this particular case, since the link time optimisations of gcc will likely figure out that it's actually ok to inline a couple of functions ;)
In general, link time optimisations could lead to massive reductions in code size (~6%) link to paper on link time optimisations in gold, and thus run time as well (more of your program fits in the cache). Also note that -fPIC is mostly viewed as a feature that enables tighter security and is always enabled in android. This question on SO briefly discusses as well. Also, just to let you know, -fpic is the faster version of -fPIC, so if you must use -fPIC try -fpic instead - link to gcc docs. For x86 it might not make a difference, but you need to check this for yourself/ask on gcc-help.
Related
I wanted to create a formal comparison between C and Julia performance. For this purpose I wanted to compare different sorting algorithms, starting with the bubble. In Julia I wrote it like:
using BenchmarkTools
function bubble_sort(v::AbstractArray{T}) where T<:Real
for _ in 1:length(v)-1
for i in 1:length(v)-1
if v[i] > v[i+1]
v[i], v[i+1] = v[i+1], v[i]
end
end
end
return v
end
v = rand(Int32, 100_000)
#timed bubble_sort(_v)
In the case of C code (I don't know to program in C so I apologize for the code):
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
static void swap(int *xp, int *yp){
int temp = *xp;
*xp = *yp;
*yp = temp;
}
void bubble_sort(int arr[], int n){
int i, j;
for (j = 0; j < n - 1; j++){
for (i = 0; i < n - 1; i++){
if (arr[i] > arr[i+1]){
swap(&arr[i], &arr[i+1]);
}
}
}
}
int main(){
int arr_sz = 100000;
int arr[arr_sz], i;
for (i = 0; i < arr_sz; i++){
arr[i] = rand();
}
double cpu_time_used;
clock_t begin = clock();
bubble_sort(arr, arr_sz);
clock_t end = clock();
cpu_time_used = ((double) (end - begin)) / CLOCKS_PER_SEC;
printf("time %f\n", cpu_time_used);
return 0;
}
The performance difference is (in my computer):
Julia
C
20s
~50s
I suppose that I have a big mistake in the C code, but I am not able to find it out, or is just Julia faster in loops?
Update: performance optimization
Changed the type to int32 in Julia so it is the same as C
swap method as static (+1s improvement on average)
compiling optimization (detailed bellow)
Instead of gcc main.c, I've used different optimization flags, as also the clang compiler. Results:
Time (s)
Julia
19.13
gcc -O main.c
47.58
gcc -O1 main.c
15.98
gcc -O2 main.c
19.52
gcc -O3 main.c
19.20
gcc -Os main.c
17.72
clang -O0 main.c
51.59
clang -O1 main.c
16.78
clang -O2 main.c
13.53
clang -O3 main.c
13.57
clang -Ofast main.c
12.39
clang -Os main.c
18.85
clang -Oz main.c
15.64
clang -Og main.c
16.37
It seems like this question may have been reopened after you discovered that your initial measurements were taken on code compiled for debugging, rather than fully optimised, with different sets of data, different compiler platforms and different integer representations.
I suppose that I have a big mistake in the C code, but I am not able to find it out, or is just Julia faster in loops?
I can answer this somewhat cringy (in my opinion) double-barreled question with a quote from the C standard: "The semantic descriptions in this International Standard describe the behavior of an abstract machine in which issues of optimization are irrelevant." In short, there's no speed in C; that's an attribute that occurs in implementations of C. We can't reproduce your speed without having your implementation (including your hardware, for example).
It's very possible that Julia has similar clauses in her spec. The gist is: some nifty optimisations may determine that your sorted arrays don't have any necessary side-effects and so those may theoretically be optimised away. I'd expect both programs to output somewhere near 0.0 in that case. This is your perfectly optimal compiler; one that spots code that has no actual impact upon the logic of the program, and optimises away the dead code.
We haven't always had loop-invariant code motion, and so it stands to reason that there may be a fifth element here: your compilers version. You'll probably get different statistics if the underlying llvm is different, for example:
LLVM 11 tends to take 2x longer to compile code with optimizations, and as a result produces code that runs 10-20% faster (with occasional outliers in either direction), compared to LLVM 2.7 which is more than 10 years old.
-- source
Perhaps one day you'll update this question with output that reads 0.0s for both programs. Then this question has truly lost its point.
It's hard to tell what further is being asked here, #cbk. The comments section managed to reduce the runtime for the C program significantly with those four improvements. The question kinda doesn't even make sense here anymore, because it largely cancels itself out by answering itself at the end.
Perhaps this is just one of those cases where a newcomer ought to have answered their own question with an answer (you can do that), rather than rotting the question with edits that answer it... Nonetheless, it's a question that now shows up unanswered in the list of questions. I'd vote to close for the reason "The question should include more details", but I suspect then you might include an example of compilation halted after assembly generation, when OP seems to have glossed over the solution, the details we need are more along the lines of "What didn't you understand about these comments that answer your original question?" and yet the question has varied so much in apparent meaning... Are we gonna have a close/reopen war?
I have the following simplified ReLU simulation code that I am trying to optimize. The code uses a ternary operation which is perhaps coming in the way of automatic vectorization by the compiler. How can I vectorize this code?
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <mkl.h>
void relu(double* &Amem, double* Z_curr, int bo)
{
for (int i=0; i<bo; ++i) {
Amem[i] = Z_curr[i] > 0 ? Z_curr[i] : Z_curr[i]*0.1;
}
}
int main()
{
int i, j;
int batch_size = 16384;
int output_dim = 21;
// double* Amem = new double[batch_size*output_dim];
// double* Z_curr = new double[batch_size*output_dim];
double* Amem = (double *)mkl_malloc(batch_size*output_dim*sizeof( double ), 64 );
double* Z_curr = (double *)mkl_malloc(batch_size*output_dim*sizeof( double ), 64 );
memset(Amem, 0, sizeof(double)*batch_size*output_dim);
for (i=0; i<batch_size*output_dim; ++i) {
Z_curr[i] = -1+2*((double)rand())/RAND_MAX;
}
relu(Amem, Z_curr, batch_size*output_dim);
}
To compile it, if you have MKL then use the following, otherwise plain g++ -O3.
g++ -O3 ex.cxx -L${MKLROOT}/lib/intel64 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5
So far, I have tried adding -march=skylake-avx512 as a compiler option, but it does not vectorize the loop as I found using option -fopt-info-vec-all for compilation:
ex.cxx:9:16: missed: couldn't vectorize loop
ex.cxx:9:16: missed: not vectorized: control flow in loop.
ex.cxx:6:6: note: vectorized 0 loops in function.
ex.cxx:9:16: missed: couldn't vectorize loop
ex.cxx:9:16: missed: not vectorized: control flow in loop.
and this is the time it takes currently at my end:
time ./a.out
real 0m0.034s
user 0m0.026s
sys 0m0.009s
There is usually no benefit to pass a pointer by reference (unless you want to modify the pointer itself). Furthermore, you can help your compiler using the (non-standard) __restrict keyword, telling it that no aliasing happens between input and output (of course, this will likely give wrong results, if e.g., Amem == Z_curr+1 -- but Amem == Z_curr should (in this case) be fine).
void relu(double* __restrict Amem, double* Z_curr, int bo)
Using that alone, clang actually is capable of vectorizing your loop using vcmpltpd and masked moves (for some reasons, only using 256bit registers).
If you simplify your expression to std::max(Z_curr[i], 0.1*Z_curr[i]) even gcc easily is capable of vectorizing it: https://godbolt.org/z/eTv4PnMWb
Generally, I would suggest compiling crucial routines of your code with different compilers and different compile options (sometimes trying -ffast-math can show you ways to simplify your expressions) and have a look at the generated code. For portability you could then re-translate the generated code into intrinsics (or leave it as is, if every compiler you care about gives good-enough results).
For completeness, here is a possible manually vectorized implementation using intrinsics:
void relu_avx512(double* __restrict Amem, double* Z_curr, int bo)
{
int i;
for (i=0; i<=bo-8; i+=8)
{
__m512d z = _mm512_loadu_pd(Z_curr+i);
__mmask8 positive = _mm512_cmplt_pd_mask (_mm512_setzero_pd(), z);
__m512d res = _mm512_mask_mul_pd(z, positive, z, _mm512_set1_pd(0.9));
_mm512_storeu_pd(Amem+i, res);
}
// remaining elements in scalar loop
for (; i<bo; ++i) {
Amem[i] = 0.0 < Z_curr[i] ? Z_curr[i] : Z_curr[i]*0.1;;
}
}
Godbolt: https://godbolt.org/z/s6br5KEEc (if you compile this with -O2 or -O3 on clang, it will heavily unroll the cleanup loop, even though it can't have more than 7 iterations. Theoretically, you could do the remaining elements with a masked or overlapping store (or maybe you have use-cases where the size is guaranteed to be a multiple of 8 and you can leave it away).
I am trying to understand pure functions, and have been reading through the Wikipedia article on that topic. I wrote the minimal sample program as follows:
#include <stdio.h>
static int a = 1;
static __attribute__((pure)) int pure_function(int x, int y)
{
return x + y;
}
static __attribute__((pure)) int impure_function(int x, int y)
{
a++;
return x + y;
}
int main(void)
{
printf("pure_function(0, 0) = %d\n", pure_function(0, 0));
printf("impure_function(0, 0) = %d\n", impure_function(0, 0));
return 0;
}
I compiled this program with gcc -O2 -Wall -Wextra, expecting that an error, or at least a warning, should have been issued for decorating impure_function() with __attribute__((pure)). However, I received no warnings or errors, and the program also ran without issues.
Isn't marking impure_function() with __attribute__((pure)) incorrect? If so, why does it compile without any errors or warnings, even with the -Wextra and -Wall flags?
Thanks in advance!
Doing this is incorrect and you are responsible for using the attribute correctly.
Look at this example:
static __attribute__((pure)) int impure_function(int x, int y)
{
extern int a;
a++;
return x + y;
}
void caller()
{
impure_function(1, 1);
}
Code generated by GCC (with -O1) for the function caller is:
caller():
ret
As you can see, the impure_function call was completely removed because compiler treats it as "pure".
GCC can mark the function as "pure" internally automatically if it sees its definition:
static __attribute__((noinline)) int pure_function(int x, int y)
{
return x + y;
}
void caller()
{
pure_function(1, 1);
}
Generated code:
caller():
ret
So there is no point in using this attribute on functions that are visible to the compiler. It is supposed to be used when definition is not available, for example when function is defined in another DLL. That means that when it is used in a proper place the compiler won't be able to perform a sanity check anyway. Implementing a warning thus is not very useful (although not meaningless).
I don't think there is anything stopping GCC developers from implementing such warning, except time that must be spend.
A pure function is a hint for the optimizing compiler. Probably, gcc don't care about pure functions when you pass just -O0 to it (the default optimizations). So if f is pure (and defined outside of your translation unit, e.g. in some outside library), the GCC compiler might optimize y = f(x) + f(x); into something like
{
int tmp = f(x); /// tmp is a fresh variable, not appearing elsewhere
y = tmp + tmp;
}
but if f is not pure (which is the usual case: think of f calling printf or malloc), such an optimization is forbidden.
Standard math functions like sin or sqrt are pure (except for IEEE rounding mode craziness, see http://floating-point-gui.de/ and Fluctuat for more), and they are complex enough to compute to make such optimizations worthwhile.
You might compile your code with gcc -O2 -Wall -fdump-tree-all to guess what is happening inside the compiler. You could add the -fverbose-asm -S flags to get a generated *.s assembler file.
You could also read the Bismon draft report (notably its section ยง1.4). It might give some intuitions related to your question.
In your particular case, I am guessing that gcc is inlining your calls; and then purity matters less.
If you have time to spend, you might consider writing your own GCC plugin to make such a warning. You'll spend months in writing it! These old slides might still be useful to you, even if the details are obsolete.
At the theoretical level, be aware of Rice's theorem. A consequence of it is that perfect optimization of pure functions is probably impossible.
Be aware of the GCC Resource Center, located in Bombay.
I'm learning the basics of SIMD so I was given a simple code snippet to see the principle at work with SSE and SSE2.
I recently installed minGW to compile C code in windows with gcc instead of using the visual studio compiler.
The objective of the example is to add two floats and then multiply by a third one.
The headers included are the following (which I guess are used to be able to use the SSE intrinsics):
#include <time.h>
#include <stdio.h>
#include <xmmintrin.h>
#include <pmmintrin.h>
#include <time.h>
#include <sys/time.h> // for timing
Then I have a function to check what time it is, to compare time between calculations:
double now(){
struct timeval t; double f_t;
gettimeofday(&t, NULL);
f_t = t.tv_usec; f_t = f_t/1000000.0; f_t +=t.tv_sec;
return f_t;
}
The function to do the calculation in the "scalar" sense is the following:
void run_scalar(){
unsigned int i;
for( i = 0; i < N; i++ ){
rs[i] = (a[i]+b[i])*c[i];
}
}
Here is the code for the sse2 function:
void run_sse2(){
unsigned int i;
__m128 *mm_a = (__m128 *)a;
__m128 *mm_b = (__m128 *)b;
__m128 *mm_c = (__m128 *)c;
__m128 *mm_r = (__m128 *)rv;
for( i = 0; i <N/4; i++)
mm_r[i] = _mm_mul_ps(_mm_add_ps(mm_a[i],mm_b[i]),mm_c[i]);
}
The vectors are defined the following way (N is the size of the vectors and it is defined elsewhere) and a function init() is called to initialize them:
float a[N] __attribute__((aligned(16)));
float b[N] __attribute__((aligned(16)));
float c[N] __attribute__((aligned(16)));
float rs[N] __attribute__((aligned(16)));
float rv[N] __attribute__((aligned(16)));
void init(){
unsigned int i;
for( i = 0; i < N; i++ ){
a[i] = (float)rand () / RAND_MAX / N;
b[i] = (float)rand () / RAND_MAX / N;
c[i] = (float)rand () / RAND_MAX / N;
}
}
Finally here is the main that calls the functions and prints the results and computing time.
int main(){
double t;
init();
t = now();
run_scalar();
t = now()-t;
printf("S = %10.9f Temps du code scalaire : %f seconde(s)\n",1e5*sum(rs),t);
t = now();
run_sse2();
t = now()-t;
printf("S = %10.9f Temps du code vectoriel 2: %f seconde(s)\n",1e5*sum(rv),t);
}
For sum reason if I compile this code with a command line of "gcc -o vec vectorial.c -msse -msse2 -msse3" or "mingw32-gcc -o vec vectorial.c -msse -msse2 -msse3"" it compiles without any problems, but for some reason I can't run it in my windows machine, in the command prompt I get an "access denied" and a big message appears on the screen saying "This app can't run on your PC, to find a version for your PC, check with the software publisher".
I don't really understand what is going on, neither do I have much experience with MinGW or C (just an introductory course to C++ done on Linux machines). I've tried playing around with different headers because I thought maybe I was targeting a different processor than the one on my PC but couldn't solve the issue. Most of the info I found was confusing.
Can someone help me understand what is going on? Is it a problem in the minGW configuration that is compiling in targeting a Linux platform? Is it something in the code that doesn't have the equivalent in windows?
I'm trying to run it on a 64 bit Windows 8.1 pc
Edit: Tried the configuration suggested in the site linked below. The output remains the same.
If I try to run through MSYS I get a "Bad File number"
If I try to run throught the command prompt I get Access is Denied.
I'm guessing there's some sort of bug arising from permissions. Tried turning off the antivirus and User Account control but still no luck.
Any ideas?
There is nothing wrong with your code, besides, you did not provide the definition of sum() or N which is, however, not a problem. The switches -msse -msse2 appear to be not required.
I was able to compile and run your code on Linux (Ubuntu x86_64, compiled with gcc 4.8.2 and 4.6.3, on Atom D2700 and AMD Athlon LE-1640) and Windows7/64 (compiled with gcc 4.5.3 (32bit) and 4.8.2 (64bit), on Core i3-4330 and Core i7-4960X). It was running without problem.
Are you sure your CPU supports the required instructions? What exactly was the error code you got? Which MinGW configuration did you use? Out of curiosity, I used the one available at http://win-builds.org/download.html which was very straight-forward.
However, using the optimization flag -O3 created the best result -- with the scalar loop! Also useful are -m64 -mtune=native -s.
I have
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#include <stdio.h>
#include <math.h>
int fib(int n) {
return n < 2 ? n : fib(n-1) + fib(n-2);
}
double clock_now()
{
struct timeval now;
gettimeofday(&now, NULL);
return (double)now.tv_sec + (double)now.tv_usec/1.0e6;
}
#define NITER 5
and in my main(), I'm doing a simple benchmark like this:
printf("hi\n");
double t = clock_now();
int f = 0;
double tmin = INFINITY;
for (int i=0; i<NITER; ++i) {
printf("run %i, %f\n", i, clock_now()-t);
t = clock_now();
f += fib(40);
t = clock_now()-t;
printf("%i %f\n", f, t);
if (t < tmin) tmin = t;
t = clock_now();
}
printf("fib,%.6f\n", tmin*1000);
When I compile with clang -O3 (LLVM 5.0 from Xcode 5.0.1), it always prints out zero time, except at the init of the for-loop, i.e. this:
hi
run 0, 0.866536
102334155 0.000000
run 1, 0.000001
204668310 0.000000
run 2, 0.000000
307002465 0.000000
run 3, 0.000000
409336620 0.000000
run 4, 0.000001
511670775 0.000000
fib,0.000000
It seems that it statically precalculates the fib(40) and stores it somewhere. Right? The strange lag at the beginning (0.8 secs) is probably because it loads that cache?
I'm doing this for benchmarking. The C compiler should optimize fib() itself as much as it can. However, I don't want it to precalculate it already at compile time. So basically I want all code optimized as heavily as possible, but not main() (or at least not this specific optimization). Can I do that somehow?
What optimization is it anyway in this specific case? It's somehow strange and quite nice.
I found a solution by marking certain data volatile. Esp, what I did was:
volatile int f = 0;
//...
volatile int FibArg = 40;
f += fib(FibArg);
That way, it forces the compiler to read FibArg when calling the function and it forces it to not assume that it is constant. Thus it must call the function to calculate it.
The volatile int f was not necessary for my compiler at the moment but it might be in the future when the compiler figures out that fib has no side effects and its result nor f is every used.
Note that this is still not the end. A future compiler could have advanced so far that it guesses that 40 is a likely argument for fib. Maybe it builds a database for likely values. And for the most likely values, it builds up a small cache. And when fib is called, it does a fast runtime-check whether it has that value cached. Of course, the runtime-check adds some overhead but maybe the compiler estimates that this overhead is minor for some particular code in relation to the speed gained by the cached.
I'm not sure if a compiler will ever do such optimization but it could. Profile Guided Optimization (PGO) goes already in that direction.