Array Sum Benchmark on GPU - Odd Results? - arrays

I am currently doing some benchmark tests using OpenCL on an AMD Radeon HD 7870.
The code that I have written in JOCL (the Java bindings for OpenCL) simply adds two 2D arrays (z= x + y) but it does so many times (z=x+y+y+y+y+y+y...).
The size of the two arrays I am adding is 500 by 501 and I am looping over the number of iterations I want to add them together on the GPU. So first I add them once, then ten times, then one thousand times, etc.
The maximum number of iterations that I loop to is 100,000,000. Below is what the log file looks like when I run my code (counter is the number of times my program executes in 5 seconds):
Number of Iterations: 1
Counter: 87
FLOPS Rate: 0.0043310947 GFLOPs/s
Number of Iterations: 10
Counter: 88
FLOPS Rate: 0.043691948 GFLOPs/s
Number of Iterations: 100
Counter: 84
FLOPS Rate: 0.41841218 GFLOPs/s
Number of Iterations: 1000
Counter: 71
FLOPS Rate: 3.5104263 GFLOPs/s
Number of Iterations: 10000
Counter: 8
FLOPS Rate: 3.8689642 GFLOPs/s
Number of Iterations: 100000
Counter: 62
FLOPS Rate: 309.70895 GFLOPs/s
Number of Iterations: 1000000
Counter: 17
FLOPS Rate: 832.0814 GFLOPs/s
Number of Iterations: 10000000
Counter: 2
FLOPS Rate: 974.4635 GFLOPs/s
Number of Iterations: 100000000
Counter: 1
FLOPS Rate: 893.7945 GFLOPs/s
Do these numbers make sense? I feel that 0.97 TeraFLOPS is quite high and that I must be calculating the number of FLOPs incorrectly.
Also, I believe that the number of FLOPs I am calculating should at one point level out with an increase in the number of iterations but that is not so evident here. It seems that if I continue to increase the number of iterations, the calculated FLOPS will also increase which also leads me to believe that I am doing something wrong.
Just for reference, I am calculating the FLOPS in the following way:
FLOPS = counter(500)(501)(iterations)/(time_elapsed)
Any help with this issue will be greatly appreciated.
Thank you
EDIT:
I have now done this same benchmark test looping over a range of iterations (the amount of times I add y to x) as well as array sizes. I have generated the following surface plot as can be seen at this GitHub repository
https://github.com/ke0m/Senior_Design/blob/master/JOCL/Graphing/GoodGPUPlot.PNG
I have asked the opinion of others on this plot and they mention to me that while the numbers I am calculating are feasible, they are artificially high. They say this is evident in the steep slope in the plot that does not really make any physical sense. One suggested idea as to why the slope is so steep is because the compiler converts the variable that controls the iterations (of type int) to a short and therefore forces this number to stay below 32000 (approximately). That means that I am doing less work on the GPU then I think I am and calculating a higher GFLOPS value.
Can anyone confirm this idea or offer any other ideas as to why the plot looks the way it does?
Thank you again

counter(500)(501)(iterations) - If this is calculated with integers, the result is likely to be too large for an integer register. If so convert to floating point before calculating.

I did a matrix-matrix multiplication kernel that uses local memory optimization. On my HD7870 # stock settings, it does just about 500 billion sums and 500 billion multiplications per second which makes 1 Teraflops. This is quite close to your calculations if your card is at stock settings too.
Yes, your calculations make sense since the gpu's peak is about 2.5 Tflops/s and you are doing the calculations in local memory / register space which is needed to get close to peak values of card.
You are doing only additions so you just add 1 per iteration(not doing any multiplication leaves one pipeline per core empty I assume so you have nearly half of the peak).
1 flops per a=b+c
so you are right about the flops values.
But when you dont give the gpu a "resonance condition for total item number" like multiple of 512(multiple of maximum local item size) or 256 or 1280(number of cores) , your gpu will not compute at full efficiently and will degreade on performance for small arrays.
Also if you dont give enough total warps, threads will not be able to hide latency of main memory just like in the 1,10,100 iterations. Hiding memory latency needs multiple warps on a compute unit such that all the ALUs and ADDR units (i mean all pipelines) are occupied most of the time. Occupation is very important here because of so few operations per memory operation. If you decrease the workgroup size from 256 to 64, this can increase occupation so more latency hiding.
Trial&error can give you an optimum peak performance. Otherwise your kernel is bottlenecked by main memory bandwidth and thread start/stop latencies.
Here:
HD 7870 SGEMM with 9x16x16 pblocking algorithm: 1150 Gflops/s for square matrix size=8208
Additionally, divisions and special functions can be percepted as 50 to 200 flops per item and subject to different versions of them(like a software rsqrt() vs hardware rsqrt() approximation).
Try with array sizes of multiple of 256 and with a high iterations like 1M and try 64 or 128 as local items per compute unit. If you could multiply them at the same time, you could reach a higher flops throughput. You can add a multiplication of y with 2 or 3 to use multiplication pipelines too! This way you may approach a higher flops than before.
x=y+z*2.0f+z*3.0f+z*4.0f+z*5.0f ---->8 flops
or against auto-optimizaitons of compiler,
x=y+zrandomv+zrandomval2+zrandomval3+zrandomval4
instead of
x=y+z+z+z+z ----->4 flops
Edit: I dont know if HD7870 uses different(an extra batch of) ALUs for double-precision(64-bit fp) operations, if yes, then you can use them to do mixed-precision operations to have %10 more flops throughput because HD7870 is capable of 64-bit # 1/8 of 32-bit speed! You can make your card explode with this way.

Related

Need a FFT for an near infinite set of data points

I need to perform a Fourier transform on a long stream of data. I made a DFT .c file that works fine, the downside is of course the speed. It is slow AF.
I am looking for a way to perform the FFT on a long stream of data.
All the FFT libs require an array of max 1024, 2048 or some even 4096 data points.
I get the data from an ADC that runs around 128000 Hz and I need to measure data between 1 and 10 seconds. This means an array from 128 000 to 1 280 000 samples. In my code I check the frequencies 0 till 2000. It took around 400 core ticks for one sin+cos calculation. The core runs at 480 Mhz, so it costs around 1 us.
This means 2000 frequencies * 128 000 samples * 1 us = +/- 256 seconds(4 min) of analysis per 1 second of data.
And when 10 secs are used, it would cost 40 mins.
Does anyone know a faster way or a FFT solution that supports a near "infinite" data array?
If your calculation involved floating points, avoid double precision if you doesn't need that kind of guaranteed floating point precision.
If your ADC resolution is not that high (says less than 16 bits), you can consider to use fixed point arithmetic. This can help to reduce computation time especially if your machine does not support hardware floating point calculation. please refer: Q number format
If you are using ARM base controller, you may want to check out this:
http://www.keil.com/pack/doc/CMSIS/DSP/html/index.html

Determine FLOPS of our ASM program

We had to implement an ASM program for multiplying sparse matrices in the coordinate scheme format (COOS) as well as in the compressed row format (CSR). Now that we have implemented all these algorithms we want to know how much more performant they are in contrast to the usual matrix multiplication. We already implemented code to measure the running time of all these algorithms but now we decided that we also want to know how many floating points operations per seconds (FLOPS) we can perform.
Any suggestion of how to measure/count this?
Here some background information on the used system:
processor : 0
model name : ARMv7 Processor rev 2 (v7l)
Features : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpd32
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x3
CPU part : 0xc08
CPU revision : 2
Our first idea was now to implement a kind of FPO counter which we increment after each floating point operation (Arithmetic operations as well as comparison and move operations), but that would mean that we have to insert increment operations all over our code which also slows down the application ...
Does anyone know if there is some kind of hardware counter which counts the number of floating point operations or maybe if there exist some kind of performance tool which can be used to monitor our program and measures the number of FPOs.
Any suggestions or pointers would be appreciated.
Here is the evaluation of the FLOPs for a matrix multiplication by using the counting approach. We first measured the running time than inserted counters for each instruction we were interested in and after that we calculated the number of floating point operations per second.
It looks like the closest you can get with the performance events supported by Cortex-A8 is a count of total instructions executed, which isn't very helpful given that "an instruction" performs anything from 0 to (I think) 8 FP operations. Taking a step back, it becomes apparent that trying to measure FLOPS for the algorithm in hardware wouldn't really work anyway - e.g. you could write an implementation using vector ops but not always put real data in all lanes of each vector, then the CPU needs to be psychic to know how many of the FP operations it's executing actually count.
Fortunately, given a formal definition of an algorithm, calculating the number of operations involved should be fairly straightforward (although not necessarily easy, depending on the complexity). For instance, running through it in my head, the standard naïve multiplication of an m x n matrix with an n x m matrix comes out to m * m * (n + n - 1) operations (n multiplications and (n - 1) additions per output element). Once on-paper analysis has come up with an appropriately parameterised op-counting formula, you can plumb that into your benchmarking tool to calculate numbers for the data on test.
Once you've done all that, you'll probably then start regretting spending all the time to do it, because what you'll have is (arbitrary number) / (execution time) which is little more meaningful than (execution time) alone, and mostly just complicates comparison between cases where (arbitrary number) differs. NEON performance in particular is dominated by pipeline latency and memory bandwidth, and as such the low-level implementation details could easily outweigh any inherent difference the algorithms might have.
Think of it this way: say on some given 100MHz CPU a + a + b + b takes 5 cycles total, while (a + b) * 2 takes 4 cycles total* - the former scores 60 MFLOPS, the latter only 50 MFLOPS. Are you going to say that more FLOPS means better performance, in which case the routine which takes 25% longer to give the same result is somehow "better"? Are you going to say fewer FLOPS means better performance, which is clearly untrue for any reasonable interpretation? Or are you going to conclude that FLOPS is pretty much meaningless for anything other than synthetic benchmarks to compare the theoretical maximum bandwidth of one CPU with another?
* numbers pulled out of thin air for the sake of argument; however they're actually not far off something like Cortex-M4F - a single-precision FPU where both add and multiply are single-cycle, plus one or two for register hazards.
Number of Cores x Average frequency x Operations percycle

FFT Resolution in constrained environments

So I am sampling data from a sensor at 10 kilo samples per second. I am going to collect 512 samples continuously from this sensor, and then try to do an FFT on it. But here is the problem, I am constrained to do a 16 point FFT on it. So from what I understand is that I divide my 512 samples' frame into bins of 16, and take FFT on them individually. Once I have done that, I just merge them side by side.
My questions:
If my sampling frequency is 10 kilo samples per second, and my FFT size is 16, then my bin size should be 625 Hz, right?
Second, am I correct in merging the FFT outputs as above?
I will be absolutely grateful for a response.
You could also do 2 layers of radix-16 FFTs and bit shuffles, plus 1 layer of radix-2 FFT butterflys to produce the same result as an FFT of length 512.
If you collect data in 512-sample chunks but are constrained to 16-point FFT, you will have to perform the FFT 32 times for each chunk and average the results (either for each chunk or for the entire recording - your choice).
The sampling rate determines the upper limit of the frequency values you assign to the FFT results, and it doesn't matter whether you are looking at 512 samples or 16 samples at a time. Your top frequency is going to be 1/2 the sample rate = 5 kHz.
The series of frequency results will be (in Hz) ...
5000
2500
1250
625
312.5
...
and so on, depending on how many samples you pass to the FFT.
I'm not going to ask why you're restricted to 16-point FFT!
If you are using 16-point FFT, then the resolution you will get is low. It will be able to capture frequencies from 0-5 Sa/s with only 8 unique bins.
Regarding your question about the bin size, I don't understand why you need it.
I think to get better results, you can also average the sampled points to fit your 16-point FFT.

(n - Multiplication) vs (n/2 - multiplication + 2 additions) which is better?

I have a C program that has n multiplications (single multiplication with n iterations) and I found another logic that has n/2 iterations of (1 multiplication + 2 additions). I know about the complexity that both are of O(n). but in terms of CPU cycles. which is faster ?
Test on your computer. Or, look at the specs for your processor and guess.
The old logic no longer applies: on modern processors, an integer multiplication might be very cheap, on some newish Intel processors it's 3 clock cycles. Additions are 1 cycle on these same processors. However, in a modern pipelined processor, the stalls created by data dependencies might cause additions to take longer.
My guess is that N additions + N/2 multiplications is slower than N multiplications if you are doing a fold type operation, and I would guess the reverse for a map type operation. But this is only a guess.
Test if you want the truth.
However: Most algorithms this simple are memory-bound, and both will be the same speed.
First of all follow Dietrich Epp's first advice - measuring is (at least for complex optimization problems) the only way to be sure.
Now if you want to figure out why one is faster than the other, we can try. There are two different important performance measures: Latency and reciprocal throughput. A short summary of the two:
Latency: This is the delay that the instruction generates in a
dependency chain. The numbers are minimum values. Cache misses,
misalignment, and exceptions may increase the clock counts
considerably. Where hyperthreading is enabled, the use of the same
execution units in the other thread leads to inferior performance.
Denormal numbers, NAN’s and infinity do not increase the latency. The
time unit used is core clock cycles, not the reference clock cycles
given by the time stamp counter.
Reciprocal throughput: The average number of core clock cycles per
instruction for a series of independent instructions of the same kind
in the same thread.
For Sandy bridge the rec. throughput for an add r, r/i (for further notice r=register, i=immediate, m=memory) is 0.33 while the latency is 1.
An imul r, r has a latency of 3 and a rec. throughput of 1.
So as you see it completely depends on your specific algorithm - if you can just replace one imul with two independent adds this particular part of your algorithm could get a theoretical speedup of 50% (and in the best case obviously a speedup of ~350%). But on the other hand if your adds add a problematic dependency one imul could be just as fast as one add.
Also note that we've ignored all the additional complications like memory and cache behavior (things which will generally have a much, MUCH larger influence on the execution time) or intricate stuff like µop fusion and whatnot. In general the only people that should care about this stuff are compiler writers - it's much simpler to just measure the result of their efforts ;)
Anyways if you want a good listing of this stuff see this here (the above description of latency/rec. throughput is also from that particular document).

FLOPS Intel core and testing it with C (innerproduct)

I have some misconceptions about measuring flops, on Intel architecture, is a FLOP one addition and one multiplication together? I read about this somewhere online and there is no debate that could reject this. I know that FLOP has a different meaning on different types of cpu.
How do I calculate my theoretical peak FLOPS? I am using Intel(R) Core(TM)2 Duo CPU E7400 # 2.80GHz. What exactly is the relationship between GHz and FLOPS? (even wikipedia's entry on FLOPS does NOT specify how to do this)
I will be using the following methods to measure the actual performance of my computer (in terms of flops): Inner product of two vectors: for two vectors of size N, is the number of flops 2n(n -1) (if one addition or one multiplication is considered to be 1 flop). If not, how should I go about calculating this?
I know there better ways to do so, but I would like to know whether my proposed calculations are right. I read somewhere about LINPACK as a benchmark, but I would still like to know how it's done.
As for your 2nd question, the theoretical FLOPS calculation isn't too hard. It can be broken down into roughly:
(Number of cores) * (Number of execution units / core) * (cycles / second) * (Execution unit operations / cycle) * (floats-per-register / Execution unit operation)
A Core-2 Duo has 2 cores, and 1 execution unit per core. an SSE register is 128 bits wide. a float is 32 bits wide so you can store 4 floats per register. I assume the execution unit does 1 SSE operation per cycle. So it should be:
2 * 1 * 2.8 * 1 * 4 = 22.4 GFLOPS
which matches:
http://www.intel.com/support/processors/sb/cs-023143.htm
This number is obviously purely theoretical best case performance. Real world performance will most likely not come close to this due to a variety of reasons. It's probably not worth trying to directly correlate flops to actual app runtime, you'd be better off trying out the computations used by your applicaton.
This article shows some theory on FLOPS numbers for x86 CPUs. It's only current up to Pentium 4, but perhaps you can extrapolate.
A FLOP stands for Floating Point Operation.
It means the same in any architecture that supports floating point operations, and is usually measured as the ammount of operations that can take place in any one second (as in FLOPS; floating point operations per second).
here you can find tools to measure your computer's FLOPS.
Intel's data sheets contain GFLOPS numbers and your processor has a claimed 22.4
http://www.intel.com/support/processors/sb/CS-023143.htm
Since your machine is dual core that means 11.2 GFlops per core at 2.8 GHz. Divide this out and you get 4. So Intel claims that their cores can each do 4 FLOPS per cycle.

Resources