Performance comparison of IPP 6, 7 and 8 - benchmarking

I own IPP 6, now I checked there is already IPP 8 available. Are there any benchmarks for comparing IPP 6, 7 and 8 on the newest CPUs? Particularly for 1D basic ops (mul, add, complex), FFT and IIR filtering.

You can do experiments yourself. IPP is supplied with performance measurement utility, usually "ps*.exe" in ipp\tools\perfsys directory. It's hard to say how it was at time of IPP 6.x, but it should be similar. The "ps*.exe" executable files allow to measure specific IPP function performance in terms of clocks-per-element (the lower the better, of course) for different CPU optimizations. The basic options for these perf. tests are "-?", "-e" shows all functions within test, "-T" turns on specific CPU optimization only, "-r" saves output into csv file.
Suppose, you want to measure ippsIIR64f_32s_Sfs function for AVX, SSE41 and SSE3 CPUs. You need to start ps_ipps.exe (which is 1D domain performance test) three times:
ps_ipps.exe -fippsIIR64f_32s_Sfs -B -R -TAVX (you'll get csv file with AVX optimization results)
ps_ipps.exe -fippsIIR64f_32s_Sfs -B -R -TSSE41 (SSE4.1 perf. data will be appended to csv)
ps_ipps.exe -fippsIIR64f_32s_Sfs -B -R -TSSE3" (SSE3 performance data will be appended).
Then grep csv file for required function/argument combination, e.g.
find "ippsIIR64f,32s,Sfs,32768,6,numBq_DF1" ps_ipps.csv
For example, I get
ippsIIR64f,32s,Sfs,32768,6,numBq_DF1,-,-,0,nLps=2048,1.30,cpMac,512,-
ippsIIR64f,32s,Sfs,32768,6,numBq_DF1,-,-,0,nLps=8,1.56,cpMac,613,-
ippsIIR64f,32s,Sfs,32768,6,numBq_DF1,-,-,0,nLps=4,5.61,cpMac,2.21e+003,-
That means, 5.61 clocks for SSE3, 1.56 clocks for SSE4.1 and 1.30 clocks for AVX.
You CPU must support the highest instruction set, which you want to measure.
As for IPP 7 and 8, you can download "try-and-buy" versions of Intel products (Composer or Parallel Studio) from Intel site to do benchmarks.

Related

Explanation of the arm cortex a/r/m numbering convention

I've been looking around some web sources, but I can not find the meaning of the numbers after the processor type of the ARM family. For example Cortex-A53, I know it refers to the application family, hence the A, the 5 might refer that it contains MMU(not sure though), but the 3 I have no idea...can you please provide an explanation or sources?
For the Cortex-A processors there are three major sub-groups which are worth knowing about:
Cortex-A3x => smaller cores, mostly designed for embedded systems and low-cost mobile.
Cortex-A5x => "LITTLE" cores in the Arm big.LITTLE / DynamIQ heterogeneous compute architecture (so lower peak performance than the "big" cores, but better energy efficiency).
Cortex-A7x => "big" cores in the Arm big.LITTLE / DynamIQ heterogeneous compute architecture (so higher peak performance than the "LITTLE" cores, but lower energy efficiency).
Within each those groups the bigger value of "x" will be the newer CPU cores, which nearly always have both improved energy efficiency and peak performance than the lower numbered ones within that group.
The specific numbers don't have specific decode for "has an MMU" or anything like that (unless you go back a long time - some of the early ARM7 and ARM9 CPU names did).
For Cortex-M and R, they don't really have the same tiers - in general bigger number = bigger and faster core with more recent ISA extensions to add new capabilities.
The only significant banding that exists is the Cortex-R5x series (which is ARMv8-R architecture including 64-bit support, where as the single digit R cores are all 32-bit Armv7 cores).

Petsc code has no parallel speed-up on 2990WX platform

While I run my code on a old Intel Xeon platform(X5650#2.67GHz), the parallel efficiency seems good that 80%~95% speed-up with twice processor usage. However, when I run the same code on on AMD 2990WX platform, I cannot get any acceleration with any numbers of threads.
I am so confused that why my new AMD platform performs so bad parallel efficiency and I can hardly to know where is the wrong settings in my code.
I have a C code based on the PetSc library to solve a very-large sparse linear equation, the parallel part in my code is provided by PetSc which automatically involves MPI ( I just arrange the matrix construction tasks to each process and do not add any other communication routines).
The system of the computation platform are both Centos7, the version of MPI library are both MPICH3, the version of PetSc are both 3.11. The BLAS on XEON platform is included by MKL, whihe the BLAS on AMD platform is included by BLIS library.
While the program is running on the AMD platform, I use top to check the operation of the processor, and found that the CPU usage are actually different with different run settings:
for 32 processes:
/usr/lib64/mpich/bin/mpiexec -n 32 ./${BIN_DIR}/main
for 64 processes:
/usr/lib64/mpich/bin/mpiexec -n 64 ./${BIN_DIR}/main
on XEON platform:
/public/software/Petsc/bin/petscmpiexec -n 64 -f mac8 ./${BIN_DIR}/main
with mac8 file:
ic1:8
ic2:8
ic3:8
ic4:8
ic5:8
ic6:8
ic7:8
ic8:8

Fast Vector Gaussian Normal Random Numbers in C on Intel Core Processors (AVX, AES)?

THIS QUESTION IS ABOUT C++ CODE TARGETED FOR AVX/AVX2 INSTRUCTIONS, as shipped in Intel processors since 2013 (and/or AVX-512 since 2015).
How do I generate one million random Gaussian unit normals fast on Intel processors with new instructions sets?
More generic versions of this question were asked a few times before, e.g., as in Generate random numbers following a normal distribution in C/C++. Yes, I know about Box-Muller and adding and other techniques. I am tempted to build my inverse normal distribution, sample (i.e., map) exactly according to expectations (pseudo-normals, then), and then randomly rearrange sort order.
But, I also know I am using an Intel Core processor with recent AVX vector and AES instruction sets. besides, I need C (not C++ with its std library), and it needs to work on Linux and OSX with gcc.
So, is there a better processor-specific way to generate so many random numbers fast? For such large quantities of random numbers, does Intel processor hardware even offer useful instructions? Are they an option worth looking into: and if so, is there an existing standard function implementation of "rnorm"?

Best gcc optimization switches for hyperthreading

Background
I have an EP (Embarassingly Parallell) C application running four threads on my laptop which contains an intel i5 M 480 running at 2.67GHz. This CPU has two hyperthreaded cores.
The four threads execute the same code on different subsets of data. The code and data have no problems fitting in a few cache lines (fit entirely in L1 with room to spare). The code contains no divisions, is essentially CPU-bound, uses all available registers and does a few memory accesses (outside L1) to write results on completion of the sequence.
The compiler is mingw64 4.8.1 i e fairly recent. The best basic optimization level appears to be -O1 which results in four threads that complete faster than two. -O2 and higher run slower (two threads complete faster than four but slower than -O1) as does -Os. Every thread on average does 3.37 million sequences every second which comes out to about 780 clock cycles for each. On average every sequence performs 25.5 sub-operations or one per 30.6 cycles.
So what two hyperthreads do in parallell in 30.6 cycles one thread will do sequentially in 35-40 or 17.5-20 cycles each.
Where I am
I think what I need is generated code which isn't so dense/efficient that the two hyperthreads constantly collide over the local CPU's resources.
These switches work fairly well (when compiling module by module)
-O1 -m64 -mthreads -g -Wall -c -fschedule-insns
as do these when compiling one module which #includes all the others
-O1 -m64 -mthreads -fschedule-insns -march=native -g -Wall -c -fwhole-program
there is no discernible performance difference between the two.
Question
Has anyone experimented with this and achieved good results?
You say "I think what I need is generated code which isn't so dense/efficient that the two hyperthreads constantly collide over the local CPU's resources.". That's rather misguided.
Your CPU has a certain amount of resources. Code will be able to use some of the resources, but usually not all. Hyperthreading means you have two threads capable of using the resources, so a higher percentage of these resources will be used.
What you want is to maximise the percentage of resources that are used. Efficient code will use these resources more efficiently in the first place, and adding hyper threading can only help. You won't get that much of a speedup through hyper threading, but that is because you got the speedup already in single threaded code because it was more efficient. If you want bragging rights that hyper threading gave you a big speedup, sure, start with inefficient code. If you want maximum speed, start with efficient code.
Now if your code was limited by latencies, it means it could perform quite a few useless instructions without penalty. With hyper threading, these useless instructions actually cost. So for hyper threading, you want to minimise the number of instructions, especially those that were hidden by latencies and had no visible cost in single threaded code.
You could try locking each thread to a core using processor affinity. I've heard this can give you 15%-50% improved efficiency with some code. The saving being that when the processor context switch happens there is less changed in the caches etc..
This will work better on a machine that is just running your app.
It's possible that hyperthreading be counterproductive.
It happens it is often counterproductive with computationally intensive loads.
I would give a try to:
disable it at bios level and run two threads
try to optimize and use vector SSE/AVX extensions, eventually even by hand
explanation: HT is useful because hardware threads get scheduled more efficiently that software threads. However there is an overhead in both. Scheduling 2 threads is more lightweight than scheduling 4, and if your code is already "dense", I'd try to go for "denser" execution, optimizing as more as possible the execution on 2 pipelines.
It's clear that if you optimize less, it scales better, but difficulty it will be faster. So if you are looking for more scalability - this answer is not for you... but if you are looking for more speed - give it a try.
As others has already stated, there is not a general solution when optimizing, otherwise this solution should be embedded in the compilers already.
You could download an OpenCL or CUDA toolkit and implement a version for your graphic card... you maybe able to speed it up 100 fold with little effort.

C - the limits of speed of the Desktop-CPUs if program is build using GCC with all optimization flags?

We are planning to port a big part of our Digital Signal Processing routines from hardware-specific chips to the common desktop CPU architecture like Quad-Core or so. I am trying to estimate the limits of such architecture for a program build with GCC. I am mostly interested in a high SDRAM-CPU bandwidth [Gb/sec] and in a high number of the 32-Bit IEEE-754 floating point Multiply-Accumulate operations per second.
I have selected a typical representative of the modern desktop CPUs - Quad Core, about 10Mb cache, 3GHz, 45nm. Can you please help me to find out its limits:
1) Highest possible Multiply-Accumulate operations per second if CPU's specific instructions which GCC supports using input flags will be used and all cores will be used. The source code itself must not require changes if we decide to port it to the different CPU-architecture like Altivec on PowerPC - the best option is to use GCC flags like -msse or -maltivec. I suggest also, a program has to have 4 threads in order to utilize all available cores, right?
2) SDRAM-CPU bandwidth (highest limit, so indep. on the mainboard).
UPDATE: Since GCC 3, GCC can automatically generate SSE/SSE2 scalar code when the target supports those instructions. Automatic vectorization for SSE/SSE2 has been added since GCC 4. SSE4.1 introduces DPPS, DPPD instructions - Dot product for Array of Structs data. New 45nm Intel processors support SSE4 instructions.
First off, know that it will most likely not be possible for your code to both run as fast as possible on modern vector FPU units and be completely portable across architectures. It is possible to abstract away some aspects of the architectures via macros, etc, but compilers are (at present) capable of generating nearly optimal auto-vectorized code only for very simple programs.
Now, on to your questions: current x86 hardware does not have a multiply-accumulate, but is capable of one vector add and one vector multiply per cycle per core. Assuming that your code achieves full computational density, and you either hand-write vector code or your code is simple enough for the compiler to handle the task, the peak throughput that can be achieved independent of memory access latency is:
number of cores * cycles per second * flops per cycle * vector width
Which in your case sounds like:
4 * 3.2 GHz * 2 vector flops/cycle * 4 floats/vector = 102.4 Gflops
If you are going to write scalar code, divide that by four. If you are going to write vector code in C with some level of portable abstraction, plan to be leaving some performance on the table, but you can certainly go substantially faster than scalar code will allow. 50% of theoretical peak is a conservative guess (I would expect to do better assuming the algorithms are amenable to vectorization, but make sure you have some headroom in your estimates).
edit: notes on DPPS:
DPPS is not a multiply-add, and using it as one is a performance hazard on current architectures. Looking it up in the Intel Optimization Manual, you will find that it has a latency of 11 cycles, and throughput is only one vector result every two cycles. DPPS does up to four multiplies and three adds, so you're getting 2 multiplies per cycle and 1.5 adds, whereas using MULPS and ADDPS would get you 4 of each every cycle.
More generally, horizontal vector operations should be avoided unless absolutely necessary; lay out your data so that your operations stay within vector lanes to the maximum extent possible.
In fairness to Intel, if you can't change your data layout, and DPPS happens to be exactly the operation that you need, then you want to use it. Just be aware that you're limiting yourself to less than 50% of peak FP throughput right off the bat by doing so.
This may not directly answer your question, but have you considered using the PC's graphics cards for parallel floating-point computations? It's getting to the point where GPUs will outperform CPUs for some tasks; and the nice thing is that graphics cards are reasonably competitively priced.
I'm short on details, sorry; this is just to give you an idea.
Some points you should consider:
1) Intel's i7-architecture is in the moment your fastest options for 1 or 2 CPUs. Only for 4 or more sockets AMD's Opterons can compete.
2) Intel's compilers generate code that is often significantly faster that code generated by other compilers (when used on AMD's CPUs you have to patch away some CPU checks Intel puts in to prevent AMD to look good).
3) No x86-CPU supports multiply-and-add yet, AMD's next architecure "Bulldozer" will probably be the first to support it.
4) High memory bandwidth you get on any AMD CPU and on Intel only for the new i7-architecture (socket 1366 is better than 775).
5) Use Intel's highly efficient libraries
if possible.

Resources