Petsc code has no parallel speed-up on 2990WX platform - c

While I run my code on a old Intel Xeon platform(X5650#2.67GHz), the parallel efficiency seems good that 80%~95% speed-up with twice processor usage. However, when I run the same code on on AMD 2990WX platform, I cannot get any acceleration with any numbers of threads.
I am so confused that why my new AMD platform performs so bad parallel efficiency and I can hardly to know where is the wrong settings in my code.
I have a C code based on the PetSc library to solve a very-large sparse linear equation, the parallel part in my code is provided by PetSc which automatically involves MPI ( I just arrange the matrix construction tasks to each process and do not add any other communication routines).
The system of the computation platform are both Centos7, the version of MPI library are both MPICH3, the version of PetSc are both 3.11. The BLAS on XEON platform is included by MKL, whihe the BLAS on AMD platform is included by BLIS library.
While the program is running on the AMD platform, I use top to check the operation of the processor, and found that the CPU usage are actually different with different run settings:
for 32 processes:
/usr/lib64/mpich/bin/mpiexec -n 32 ./${BIN_DIR}/main
for 64 processes:
/usr/lib64/mpich/bin/mpiexec -n 64 ./${BIN_DIR}/main
on XEON platform:
/public/software/Petsc/bin/petscmpiexec -n 64 -f mac8 ./${BIN_DIR}/main
with mac8 file:
ic1:8
ic2:8
ic3:8
ic4:8
ic5:8
ic6:8
ic7:8
ic8:8

Related

unicode collation NIF running slower than Pure Erlang implementation

I'm trying to optimise existing unicode collation library(written in Erlang) by rewriting it as a NIF implementation. Prime reason is because collation is CPU intensive operation.
Link to implementation: https://github.com/abhi-bit/merger
Unicode collation of 1M rows via Pure Erlang based priority queue:
erlc *.erl; ERL_LIBS="..:$ERL_LIBS" erl -noshell -s perf_couch_skew main 1000000 -s init stop
Queue size: 1000000
12321.649 ms
Unicode collation of 1M rows via NIF based binomial heap:
erlc *.erl; ERL_LIBS="..:$ERL_LIBS" erl -noshell -s perf_merger main 1000000 -s init stop
Queue size: 1000000
15871.965 ms
This is unusual, I was expecting it to be probably ~10X faster.
I turned on eprof/fprof but they aren't of much use when it comes to NIF modules, below is what eprof said about the prominent functions
FUNCTION CALLS % TIME [uS / CALLS]
-------- ----- --- ---- [----------]
merger:new/0 1 0.00 0 [ 0.00]
merger:new/2 1 0.00 0 [ 0.00]
merger:size/1 100002 0.31 19928 [ 0.20]
merger:in/3 100000 3.29 210620 [ 2.11]
erlang:put/2 2000000 6.63 424292 [ 0.21]
merger:out/1 100000 14.35 918834 [ 9.19]
I'm sure, NIF implementation could be made faster because I've a pure C implementation of unicode collation based on binary Heap using dynamic array and that's much much faster.
$ make
gcc -I/usr/local/Cellar/icu4c/55.1/include -L/usr/local/Cellar/icu4c/55.1/lib min_heap.c collate_json.c kway_merge.c kway_merge_test.c -o output -licui18n -licuuc -licudata
./output
Merging 1 arrays each of size 1000000
mergeKArrays took 84.626ms
Specific questions I've here:
How much slowdown is expected because of Erlang <-> C communication in a NIF module? In this case, slowdown is probably 30x or more between pure C and NIF implementation
What tools could be useful to debug NIF related slowdown(like in this case)? I tried using perf top to see the function call, top ones(some hex addresses were showing) were coming from "beam.smp".
What are possible areas that I should look at optimising a NIF? For example: I've heard that one should keep data being transferred between Erlang to C and vice-versa minimal, are there more such areas to consider?
The overhead of calling a NIF is tiny. When the Erlang runtime loads a module that loads a NIF, it patches the module's beam code with an emulator instruction to call into the NIF. The instruction itself performs just a small amount of setup prior to calling the C function implementing the NIF. This is not the area that's causing your performance issues.
Profiling a NIF is much the same as profiling any other C/C++ code. Judging from your Makefile it appears you're developing this code on OS X. On that platform, assuming you have XCode installed, you can use the Instruments application with the CPU Samples instrument to see where your code is spending most of its time. On Linux, you can use the callgrind tool of valgrind together with an Erlang emulator built with valgrind support to measure your code.
What you'll find if you use these tools on your code is, for example, that perf_merger:main/1 spends most of its time in merger_nif_heap_get, which in turn spends a noticeable amount of time in CollateJSON. That function seems to call convertUTF8toUChar and createStringFromJSON quite a bit. Your NIF also seems to perform a lot of memory allocation. These are the areas you should focus on to speed up your code.

Fast Vector Gaussian Normal Random Numbers in C on Intel Core Processors (AVX, AES)?

THIS QUESTION IS ABOUT C++ CODE TARGETED FOR AVX/AVX2 INSTRUCTIONS, as shipped in Intel processors since 2013 (and/or AVX-512 since 2015).
How do I generate one million random Gaussian unit normals fast on Intel processors with new instructions sets?
More generic versions of this question were asked a few times before, e.g., as in Generate random numbers following a normal distribution in C/C++. Yes, I know about Box-Muller and adding and other techniques. I am tempted to build my inverse normal distribution, sample (i.e., map) exactly according to expectations (pseudo-normals, then), and then randomly rearrange sort order.
But, I also know I am using an Intel Core processor with recent AVX vector and AES instruction sets. besides, I need C (not C++ with its std library), and it needs to work on Linux and OSX with gcc.
So, is there a better processor-specific way to generate so many random numbers fast? For such large quantities of random numbers, does Intel processor hardware even offer useful instructions? Are they an option worth looking into: and if so, is there an existing standard function implementation of "rnorm"?

Mattson's openmp pi examples: No speed up in the for example

I am trying to run the pi calculation programs included in the Mattson OMP Exercises accompanying the lectures. Of the 3 versions,pi_spmd_simple, pi_spmd_final and pi_loop, the first two scale up as mentioned in the lectures, but the third, using the for pragma with reduction, is slower with two or more threads than with one. Has anybody else seen similar (mis)behaviour? Is there any explanation?
My tests were run on an Intel E6500 Dual Core 2.93GHz CPU running Knoppix 7.4.2 with gcc 4.9.1. We have observed similar behaviour on AMD 4 Core Phenom processor. I have also observed similar issues in even simpler programs with for loops.
The tutorial is in OpenMP Tutorial

SIMD SSE2 instructions in assembly

I'm currently rewriting a program that used 64 bit words to use 128 bit words. I am trying to use SIMD SSE2 intrinsics from Intel. My new program, that uses the SIMD intrinsics, is about 60% percent slower than the original when I had expected to to be around twice as fast. When I looked at the assembly code for each of them, they were very similar and about the same length. However, the object code (compiled file) was 60% longer.
I also ran callgrind on the two programs, which told me how many instruction reads there were per line. I found that the SIMD version of my program often had fewer instructions reads for the same action than in the original version. Ideally, that should happen, but it doesn't make sense because the SIMD version takes longer to run.
My question:
Do the SSE2 intrinsics convert into more assembly instructions? Do the SSE2 instructions take longer to run? Or is there some other reason that my new program is so slow?
Additional notes: I am programming in C, on Linux Mint, and compiling with gcc -O3 -march=native.

C - the limits of speed of the Desktop-CPUs if program is build using GCC with all optimization flags?

We are planning to port a big part of our Digital Signal Processing routines from hardware-specific chips to the common desktop CPU architecture like Quad-Core or so. I am trying to estimate the limits of such architecture for a program build with GCC. I am mostly interested in a high SDRAM-CPU bandwidth [Gb/sec] and in a high number of the 32-Bit IEEE-754 floating point Multiply-Accumulate operations per second.
I have selected a typical representative of the modern desktop CPUs - Quad Core, about 10Mb cache, 3GHz, 45nm. Can you please help me to find out its limits:
1) Highest possible Multiply-Accumulate operations per second if CPU's specific instructions which GCC supports using input flags will be used and all cores will be used. The source code itself must not require changes if we decide to port it to the different CPU-architecture like Altivec on PowerPC - the best option is to use GCC flags like -msse or -maltivec. I suggest also, a program has to have 4 threads in order to utilize all available cores, right?
2) SDRAM-CPU bandwidth (highest limit, so indep. on the mainboard).
UPDATE: Since GCC 3, GCC can automatically generate SSE/SSE2 scalar code when the target supports those instructions. Automatic vectorization for SSE/SSE2 has been added since GCC 4. SSE4.1 introduces DPPS, DPPD instructions - Dot product for Array of Structs data. New 45nm Intel processors support SSE4 instructions.
First off, know that it will most likely not be possible for your code to both run as fast as possible on modern vector FPU units and be completely portable across architectures. It is possible to abstract away some aspects of the architectures via macros, etc, but compilers are (at present) capable of generating nearly optimal auto-vectorized code only for very simple programs.
Now, on to your questions: current x86 hardware does not have a multiply-accumulate, but is capable of one vector add and one vector multiply per cycle per core. Assuming that your code achieves full computational density, and you either hand-write vector code or your code is simple enough for the compiler to handle the task, the peak throughput that can be achieved independent of memory access latency is:
number of cores * cycles per second * flops per cycle * vector width
Which in your case sounds like:
4 * 3.2 GHz * 2 vector flops/cycle * 4 floats/vector = 102.4 Gflops
If you are going to write scalar code, divide that by four. If you are going to write vector code in C with some level of portable abstraction, plan to be leaving some performance on the table, but you can certainly go substantially faster than scalar code will allow. 50% of theoretical peak is a conservative guess (I would expect to do better assuming the algorithms are amenable to vectorization, but make sure you have some headroom in your estimates).
edit: notes on DPPS:
DPPS is not a multiply-add, and using it as one is a performance hazard on current architectures. Looking it up in the Intel Optimization Manual, you will find that it has a latency of 11 cycles, and throughput is only one vector result every two cycles. DPPS does up to four multiplies and three adds, so you're getting 2 multiplies per cycle and 1.5 adds, whereas using MULPS and ADDPS would get you 4 of each every cycle.
More generally, horizontal vector operations should be avoided unless absolutely necessary; lay out your data so that your operations stay within vector lanes to the maximum extent possible.
In fairness to Intel, if you can't change your data layout, and DPPS happens to be exactly the operation that you need, then you want to use it. Just be aware that you're limiting yourself to less than 50% of peak FP throughput right off the bat by doing so.
This may not directly answer your question, but have you considered using the PC's graphics cards for parallel floating-point computations? It's getting to the point where GPUs will outperform CPUs for some tasks; and the nice thing is that graphics cards are reasonably competitively priced.
I'm short on details, sorry; this is just to give you an idea.
Some points you should consider:
1) Intel's i7-architecture is in the moment your fastest options for 1 or 2 CPUs. Only for 4 or more sockets AMD's Opterons can compete.
2) Intel's compilers generate code that is often significantly faster that code generated by other compilers (when used on AMD's CPUs you have to patch away some CPU checks Intel puts in to prevent AMD to look good).
3) No x86-CPU supports multiply-and-add yet, AMD's next architecure "Bulldozer" will probably be the first to support it.
4) High memory bandwidth you get on any AMD CPU and on Intel only for the new i7-architecture (socket 1366 is better than 775).
5) Use Intel's highly efficient libraries
if possible.

Resources