I am writing a program to calculate the duration that my CPU take to do one "FLops". For that I wrote the code below
before = clock();
y= 4.8;
x= 2.3;
z= 0;
for (i = 0; i < MAX; ++i){
z=x*y+z;
}
printf("%1.20f\n", ( (clock()-before )/CLOCKS_PER_SEC )/MAX);
The problem that I am repeating the same operation. Doesn't the compiler optimize this sort of "Thing"? If so what I have to do to get the correct results?
I am not using the "rand" function so it does not conflict my result.
This has a loop-carried dependency and not enough stuff to do in parallel, so if anything is even executed at all, it would not be FLOPs that you're measuring, with this you will probably measure the latency of floating point addition. The loop carried dependency chain serializes all those additions. That chain has some little side-chains with multiplications in them, but they don't depend on anything so only their throughput matters. But that throughput is going to be better than the latency of an addition on any reasonable processor.
To actually measure FLOPs, there is no single recipe. The optimal conditions depend strongly on the microarchitecture. The number of independent dependency chains you need, the optimal add/mul ratio, whether you should use FMA, it all depends. Typically you have to do something more complicated than what you wrote, and if you're set on using a high level language, you have to somehow trick it into actually doing anything at all.
For inspiration see how do I achieve the theoretical maximum of 4 FLOPs per cycle?
Even if you have no compiler optimization going on (possibilities have already been nicely listed), your variables and result will be in cache after the first loop iteration and from then on your on the track with way more speed and performance than you would be, if the program would have to fetch new values for each iteration.
So if you want to calculate the time for a single flop for a single iteration of this program you would actually have to give new input for every iteration. Really consider using rand() and just seed with a known value srand(1) or so.
Your calculations should also be different; flops are the number of computations your program does so in your case 2*n (where n = MAX). To calculate the amount of time per flop divide time used by the amount of flops.
Related
At around 39 minute of "Writing Fast Code I" by Andrei Alexandrescu (link here to youtube)
there is a slide of how to use differential timing... can someone show me some basic code with this approach? It was only mentioned for a second, but I think that's an interesting idea.
Run baseline 2n times (t2a)
vs. baseline n times + contender n times (ta+b).
Relative improvement = "t2a / (2ta+b - t2a)"
some overhead noises canceled
Alexsandrescu's slide is rather trivial to pour into code:
auto start = clock::now();
for( int i = 0; i < 2*n; i++ )
baseline();
auto t2a = clock::now() - start;
start = clock::now();
for( int i = 0; i < n; i++ )
baseline();
// *
for( int i = 0; i < n; i++ )
contender();
auto taplusb = clock::now() - start;
double r = t2a / (2 * taplusb - t2a) // relative speedup
* Synchronization point which prevents optimization across the last two loops.
I would be more interested in the mathematical reasoning behind measuring the relative speed up this way as opposed to simply tBaseline / tContender as I've been doing for ever. He only vaguely hints at '...overhead noise (being) cancelled (out)', but doesn't explain it in detail.
If you keep watching until 41:40 or so, he mentions it again when warning about the pitfall of first run vs. subsequent (allocators warmed up, etc.)
The best solution for that is doing warm-up runs before the first timed region.
I think he's picturing that 2n baseline vs. n baseline + n contender in separate invocations of the benchmark program.
So instead of doing some warmup runs before the timed region, he's using the baseline as a controlled warmup inside the timed region. This might make it possible to just time the whole program, e.g. perf stat, instead of calling a time function inside the program. Depending on how much process startup overhead your OS has vs. how long you make your repeat loop.
Microbenchmarking is hard and there are many pitfalls. Notably benchmarking optimized code while still making sure there isn't optimization between iterations of your repeat loop. (Often it's useful to use inline asm "escape" macros to force the compiler to materialize a value in an integer register, and/or to forget about the value of a variable to defeat CSE. Sometimes it's sufficient to just add the final result of each iteration to a sum that you print at the end.)
This is the first I've heard of this differential idea. It doesn't sound more useful than normal warm-ups.
If anything it will make the contender look slightly worse than using the function under test for some warm-up runs before the timed region. Using the same function as the timed region will warm up branch-prediction for it. Or not because after inlining the warm-up vs. main versions will be at different addresses. The same pattern at different addresses may possibly still help a modern TAGE predictor but IDK.
Or if contender has any lookup tables, those will become hot in cache from the warmup.
In any case, warmups are essential, unless you make the repeat count long enough to dwarf the time it takes for the CPU to switch to max turbo and so on. And to page-fault in all the memory you touch.
If your calculated time/iteration doesn't stay constant with your repeat count, your microbenchmark is broken.
Take the rest of his advice with a grain of salt, too. Most of it use useful (e.g. prefer 32-bit integers even for local temporaries, not just for arrays for cache-footprint reasons), but the reasoning is wrong for some of them.
His explanation that an ALU can do 2x 32-bit adds or 1x 64-bit add only applies to SIMD: 4x 32-bit int in a vector for paddd or 2x 64-bit int in a vector for paddq. But x86 scalar add r32, r32 has the same throughput as add r64,r64. I don't think it was true even on Pentium 4 (Nocona) despite P4 having funky double-pumped ALUs with 0.5 cycle latency for add. At least before Prescott/Nocona which introduced 64-bit support.
Using 32-bit unsigned integers on x86-64 can stop the compiler from optimizing to pointer increments if it wants to. It has to maintain correctness in case of 32-bit wraparound of a variable before array indexing.
Using 16-bit or 8-bit locals to match the data in an array can sometimes help auto-vectorization, IIRC. Gcc/clang sometimes make really braindead code that unpacks to 32-bit and then re-packs down to 8-bit elements, when processing an array of int8_t or uint8_t. I forget if I've every actually worked around that by using narrow locals, though. C default integer promotions bring most expressions back up to 32-bit.
Also, at https://youtu.be/vrfYLlR8X8k?t=3498, he claims that FP->int is expensive. That's never been true on x86-64: FP math uses SSE/SSE2 which has an instruction that does truncating conversion. FP->int used to be slow in the bad old days of x87 math, where you had to change the FP rounding mode, fistp, then change it back, to get C truncation semantics. But SSE includes cvttsd2si exactly for that common case.
He also says float is no faster than double. That's true for scalar (other than div/sqrt), but if your code can auto-vectorize then you get twice as much work done per instruction and the instructions have the same throughput. (Twice as many elements fit in a SIMD vector.)
How the math works:
It just cancels out the n * baseline time from both parts, effectively doing (2 * baseline) / (2*contender) = baseline/contender.
It assumes that the times add normally (not overlapping computation). t_2a = 2 * baseline, and 2 * t_ab = 2 * baseline + 2 * contender. Subtracting cancels the 2*baseline parts, leaving you with 2*contender.
The trick isn't in the math, if anything this is more mathematically dangerous because subtracting two larger numbers accumulates error. i.e. if the n*baseline actually takes different amounts of time in the two runs (because you didn't control that perfectly), then it doesn't cancel and contributes error to your estimate.
We had to implement an ASM program for multiplying sparse matrices in the coordinate scheme format (COOS) as well as in the compressed row format (CSR). Now that we have implemented all these algorithms we want to know how much more performant they are in contrast to the usual matrix multiplication. We already implemented code to measure the running time of all these algorithms but now we decided that we also want to know how many floating points operations per seconds (FLOPS) we can perform.
Any suggestion of how to measure/count this?
Here some background information on the used system:
processor : 0
model name : ARMv7 Processor rev 2 (v7l)
Features : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpd32
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x3
CPU part : 0xc08
CPU revision : 2
Our first idea was now to implement a kind of FPO counter which we increment after each floating point operation (Arithmetic operations as well as comparison and move operations), but that would mean that we have to insert increment operations all over our code which also slows down the application ...
Does anyone know if there is some kind of hardware counter which counts the number of floating point operations or maybe if there exist some kind of performance tool which can be used to monitor our program and measures the number of FPOs.
Any suggestions or pointers would be appreciated.
Here is the evaluation of the FLOPs for a matrix multiplication by using the counting approach. We first measured the running time than inserted counters for each instruction we were interested in and after that we calculated the number of floating point operations per second.
It looks like the closest you can get with the performance events supported by Cortex-A8 is a count of total instructions executed, which isn't very helpful given that "an instruction" performs anything from 0 to (I think) 8 FP operations. Taking a step back, it becomes apparent that trying to measure FLOPS for the algorithm in hardware wouldn't really work anyway - e.g. you could write an implementation using vector ops but not always put real data in all lanes of each vector, then the CPU needs to be psychic to know how many of the FP operations it's executing actually count.
Fortunately, given a formal definition of an algorithm, calculating the number of operations involved should be fairly straightforward (although not necessarily easy, depending on the complexity). For instance, running through it in my head, the standard naïve multiplication of an m x n matrix with an n x m matrix comes out to m * m * (n + n - 1) operations (n multiplications and (n - 1) additions per output element). Once on-paper analysis has come up with an appropriately parameterised op-counting formula, you can plumb that into your benchmarking tool to calculate numbers for the data on test.
Once you've done all that, you'll probably then start regretting spending all the time to do it, because what you'll have is (arbitrary number) / (execution time) which is little more meaningful than (execution time) alone, and mostly just complicates comparison between cases where (arbitrary number) differs. NEON performance in particular is dominated by pipeline latency and memory bandwidth, and as such the low-level implementation details could easily outweigh any inherent difference the algorithms might have.
Think of it this way: say on some given 100MHz CPU a + a + b + b takes 5 cycles total, while (a + b) * 2 takes 4 cycles total* - the former scores 60 MFLOPS, the latter only 50 MFLOPS. Are you going to say that more FLOPS means better performance, in which case the routine which takes 25% longer to give the same result is somehow "better"? Are you going to say fewer FLOPS means better performance, which is clearly untrue for any reasonable interpretation? Or are you going to conclude that FLOPS is pretty much meaningless for anything other than synthetic benchmarks to compare the theoretical maximum bandwidth of one CPU with another?
* numbers pulled out of thin air for the sake of argument; however they're actually not far off something like Cortex-M4F - a single-precision FPU where both add and multiply are single-cycle, plus one or two for register hazards.
Number of Cores x Average frequency x Operations percycle
I have been thinking and was wondering what the fastest algorithm is to get through every element of a (large - lets say more than say 10,000 sized) unsorted int array. My first thought was to go through the linear motion and check every element at a time - then my mind wandered to recursion and wondered if cutting the array into parallels each time and check the elements would be fine.
The goal I'm trying to figure out is if a number (in this kind of array) will be a multiple of a seemingly "randomly" generated int. Then after this I will progress to try and find if a subset of the large array will equate to a multiple of this number as well. (But I will get to that part another day!)
What are all of your thoughts? Questions? Comments? Concerns?
You seem under the false impression that the bottleneck for running through an array sequentially ist the CPU: it isn't, it is your memory bus. Modern platforms are very good in predicting sequential access and doing everything to streamline the access, you can't do much more than that. Parallelizing will usually not help, since you only have one memory bus, which is the bottleneck, in the contrary you are risking false sharing so it could even get worse.
If for some reason you are really doing a lot of computation on each element of your array, the picture changes. Then, you can start to try some parallel stuff.
For an unsorted array, linear search is as good as you can do. Cutting the array each time and then searching the elements would not help you much, instead it may slow down your program as calling functions needs stack maintenance.
The most efficient way to process every element of a contiguous array in a single thread is sequentially. So the simplest solution is the best. Enabling compiler optimisation is likely to have a significant effect on simple iterative code.
However if you have multiple cores, and very large arrays, greater efficiency may be achieved by separating the tasks into separate threads. As suggested a using a library specifically aimed at performing parallel processing is likely to perform better and more deterministically that simply using the OS support for threading.
Another possibility is to offload the task to a GPU, but that is hardware specific and requires GPU library support such as CUDA.
All that said 10000 elements does not seem that many - how fast do you need it to go, and how long does it currently take? You need to be measuring this if performance is of specific interest.
If you want to perform some kind of task on every element of the array, then it's not going to be possible to do any better than visiting each element once; if you did manage to somehow perform the action on N/2 elements of an N-sized array, then the only possibility is that you didn't visit half of the elements. The best case scenario is visiting every element of the array no more than once.
You can approach the problem recursively, but it's not going to be any better than a simple linear method. If you use tail recursion (the recursive call is at the end of the function), then the compiler is probably going to turn it into a loop anyway. If it doesn't turn it into a loop, then you have to deal with the additional cost of pushing onto the call stack, and you have the possibility of stack overflows for very large arrays.
The cool modern way to do it is with parallel programming. However, don't be fooled by everyone suggesting libraries; even though the run time looks faster than a linear method, each element is still being visited once. Parallelism (see OpenMP, MPI, or GPU programming) cheats by dividing the work into different execution units, like different cores in your processor or different machines on a network. However, it's very possible that the overhead of adding the parallelism will incur a larger cost than the time you'll save by dividing the work, if the problem set isn't large enough.
I do recommend looking into OpenMP; with it, one line of code can automatically divide up a task to different execution units, without you needing to handle any kind of inter-thread communication or anything nasty.
The following program shows a simple way to implement the idea of parallelization for the case you describe - the timing benchmark shows that it doesn't provide any benefit (since the inner loop "doesn't do enough work" to justify the overhead of parallelization).
#include <stdio.h>
#include <time.h>
#include <math.h>
#include <omp.h>
#include <stdlib.h>
#define N 1000000
int main(void) {
int ii,jj, kk;
int *array;
double t1, t2;
int threads;
// create an array of random numbers:
array = malloc(N * sizeof *array);
for(ii=0; ii<N; ii++) {
array[ii]=rand();
}
for(threads = 1; threads < 5; threads++) {
jj=0;
omp_set_num_threads(threads);
t1=omp_get_wtime();
// perform loop 100 times for better timing accuracy
for(kk=0; kk<100; kk++) {
#pragma omp parallel for reduction(+:jj)
for(ii=0; ii<N; ii++) {
jj+=(array[ii]%6==0)?1:0;
}
}
t2=omp_get_wtime();
printf("jj is now %d\n", jj);
printf("with %d threads, elapsed time = %.3f ms\n", threads, 1000*(t2-t1));
}
return 0;
}
Compile this with
gcc -Wall -fopenmp parallel.c -o parallel
and the output is
jj is now 16613400
with 1 threads, elapsed time = 467.238 ms
jj is now 16613400
with 2 threads, elapsed time = 248.232 ms
jj is now 16613400
with 3 threads, elapsed time = 314.938 ms
jj is now 16613400
with 4 threads, elapsed time = 251.708 ms
This shows that the answer is the same, regardless of the number of threads used; but the amount of time taken does change a little bit. Since I am doing this on a 6 year old dual core machine, you don't actually expect a speed-up with more than two threads, and indeed you don't see one; but there is a difference between 1 thread and 2.
My point was really to show how easy it is to implement a parallel loop for the task you envisage - but also to show that it's not really worth it (for me, on my hardware).
Whether it helps for your case depends on the amount of work going on inside your innermost loop, and the number of cores available. If you are limited by memory access speed, this doesn't help; but since the modulo operation is relatively slow, it's possible that you gain a small amount of speed from doing this - and more cores, and more complex calculations, will increase the performance gain.
Final point - the omp syntax is relatively straightforward to understand. The only thing that is strange is the reduction(+:jj) statement. This means "create individual copies of jj. When you are done, add them all together."
This is how we make sure the total count of numbers divisible by 6 is kept track of across the different threads.
I am going to analyse and optimize some C-Code and therefore I first have to check, whether the functions I want to optimize are memory-bound or cpu-bound. In general I know, how to do this, but I have some questions about counting Floating Point Operations and analysing the size of data, which is used. Look at the following for-loop, which I want to analyse. The values of the array are doubles (that means 8 Byte each):
for(int j=0 ;j<N;j++){
for(int i=1 ;i<Nt;i++){
matrix[j*Nt+i] = matrix[j*Nt+i-1] * mu + matrix[j*Nt+i]*sigma;
}
}
1) How many floating point operations do you count? I thought about 3*(Nt-1)*N... but do I have to count the operations within the arrays, too (matrix[j*Nt+i], which are 2 more FLOP for this array)?
2)How much data is transfered? 2* ((Nt-1)*N)8Byte or 3 ((Nt-1)*N)*8Byte. I mean, every entry of the matrix has to be loaded. After the calculation, the new values is saved to that index of the array (now these is 1load and 1 store). But this value is used for the next calculation. Is another load operations needed therefore, or is this value (matrix[j*Nt+i-1]) already available without a load operation?
Thx a lot!!!
With this type of code, the direct sort of analysis you are proposing to do can be almost completely misleading. The only meaningful information about the performance of the code is actually measuring how fast it runs in practice (benchmarking).
This is because modern compilers and processors are very clever about optimizing code like this, and it will end up executing in a way which is nothing like your straightforward analysis. The compiler will optimize the code, rearranging the individual operations. The processor will itself try to execute the individual sub-operations in parallel and/or pipelined, so that for example computation is occurring while data is being fetched from memory.
It's useful to think about algorithmic complexity, to distinguish between O(n) and O(n²) and so on, but constant factors (like you ask about 2*... or 3*...) are completely moot because they vary in practice depending on lots of details.
I am trying to learn some basic benchmarking. I have a loop in my Java program like,
float a=6.5f;
int b=3;
for(long j=0; j<999999999; j++){
var = a*b+(a/b);
}//end of for
My processor takes around 0.431635 second to process this. How would I calculate processor speed in terms of Flops(Floating point Operations Per Second) and Iops(Integer Operations Per Second)? Can you provide explanations with some steps?
You have a single loop with 999999999 iterations: lets call this 1e9 (one billion) for simplicity. The integers will get promoted to floats in the calculations that involve both, so the loop contains 3 floating-point operations: one mult, one add, and one div, so there are 3e9. This takes 0.432s, so you're apparently getting about 6.94 GFLOP/s (3e9/0.432). Similarly, you are doing 1 integer op (j++) per loop iteration, so you are getting 1e9/0.432 or about 2.32 GIOP/s.
However, the calculation a*b+(a/b) is loop-invariant, so it would be pretty surprising if this didn't get optimized away. I don't know much about Java, but any C compiler will evaluate this at compile-time, remove the a and b variables and the loop, and (effectively) replace the whole lot with var=21.667;. This is a very basic optimization, so I'd be surprised if javac didn't do it too.
I have no idea what's going on under the hood in Java, but I'd be suspicious of getting 7 GFLOPs. Modern Intel CPUs (I'm assuming that's what you've got) are, in principle, capable of two vector arithmetic ops per clock cycle with the right instruction mix (one add and one mult per cycle), so for a 3 GHz 4-core CPU, it's even possible to get 3e9*4*8 = 96 single-precision GFLOPs under ideal conditions. The various mul and add instructions have a reciprocal throughput of 1 cycle, but the div takes more than ten times as long, so I'd be very suspicious of getting more than about CLK/12 FLOPs (scalar division on a single core) once division is involved: if the compiler is smart enough to vectorize and/or parallelize the code to get more than that, which it would have to do, it would surely be smart enough to optimize away the whole loop.
In summary, I suspect that the loop is being optimized away completely and the 0.432 seconds you're seeing is just overhead. You have not given any indication how you're timing the above loop, so I can't be sure. You can check this out for yourself by replacing the ~1e9 loop iterations with 1e10. If it doesn't take about 10x as long, you're not timing what you think you're timing.
There's a lot more to say about benchmarking and profiling, but I'll leave it at that.
I know this is very late, but I hope it helps someone.
Emmet.