MKL Performance on Intel Phi - c

I have a routine that performs a few MKL calls on small matrices (50-100 x 1000 elements) to fit a model, which I then call for different models. In pseudo-code:
double doModelFit(int model, ...) {
...
while( !done ) {
cblas_dgemm(...);
cblas_dgemm(...);
...
dgesv(...);
...
}
return result;
}
int main(int argc, char **argv) {
...
c_start = 1; c_stop = nmodel;
for(int c=c_start; c<c_stop; c++) {
...
result = doModelFit(c, ...);
...
}
}
Call the above version 1. Since the models are independent, I can use OpenMP threads to parallelize the model fitting, as follows (version 2):
int main(int argc, char **argv) {
...
int numthreads=omp_max_num_threads();
int c;
#pragma omp parallel for private(c)
for(int t=0; t<numthreads; t++) {
// assuming nmodel divisible by numthreads...
c_start = t*nmodel/numthreads+1;
c_end = (t+1)*nmodel/numthreads;
for(c=c_start; c<c_stop; c++) {
...
result = doModelFit(c, ...);
...
}
}
}
When I run version 1 on the host machine, it takes ~11 seconds and VTune reports poor parallelization with most of the time spent idle. Version 2 on the host machine takes ~5 seconds and VTune reports great parallelization (near 100% of the time is spent with 8 CPUs in use). Now, when I compile the code to run on the Phi card in native mode (with -mmic), versions 1 and 2 both take approximately 30 seconds when run on the command prompt on mic0. When I use VTune to profile it:
Version 1 takes the same roughly 30 seconds, and the hotspot analysis shows that most time is spent in __kmp_wait_sleep and __kmp_static_yield. Out of 7710s CPU time, 5804s are spent in Spin Time.
Version 2 takes fooooorrrreevvvver... I kill it after running a couple minutes in VTune. The hotspot analysis shows that of 25254s of CPU time, 21585s are spent in [vmlinux].
Can anyone shed some light on what's going on here and why I'm getting such bad performance? I'm using the default for OMP_NUM_THREADS and set KMP_AFFINITY=compact,granularity=fine (as recommended by Intel). I'm new to MKL and OpenMP, so I'm certain I'm making rookie mistakes.
Thanks,
Andrew

The most probable reason for this behavior given that most of the time is spent in OS (vmlinux), is over-subscription caused by nested OpenMP parallel region inside MKL implementation of cblas_dgemm() and dgesv. E.g. see this example.
This version is supported and explained by Jim Dempsey at the Intel forum.

What about using MKL:sequential library? If you link MKL library with sequential option, it doesn't generate OpenMP threads inside of the MKL itself. I guess you may get better results than now.

Related

C extensions using OpenMP on the Visual Studio compiler called using Cython (or ctypes) gradually slow down to a halt

I've got a C function that performs some I/O operations and decoding that'd I'd like to call from a python script.
The C function works fine when compiled by the Visual Studio Command Line C compiler, and also works fine when called through Cython with Multithreading disabled. But when called with OpenMP multithreading, works great through the first few million loops, but then the CPU usage slowly decreases for the next few million loops, until it finally grinds to a halt and doesn't fail, but doesn't continue computations either.
The C file is as follows:
//block_reader.c
#include "block_reader.h" //contains block_data_t, decode_block, get_block_data
#include <stdio.h>
#include <stdlib.h>
#define NTHREADS 8
int decode_blocks(block_data_t *block_data_array, int num_blocks, int *values){
int block;
#pragma omp parallel for num_threads(NTHREADS)
for(block=0; block<num_blocks; block++){
decode_block(block_data_array[i], values);
}
}
int main(int argc, char *argv[]) {
int num_blocks = 250000, block_size = 4096;
block_data_t *block_data_array = get_block_data();
int *values = (long long *)malloc(num_blocks * block_size * sizeof(int));
int i, block;
for(i=0; i<1000; i++){
printf("experiment #%d\n", i+1);
decode_blocks(block_data_array, values)
}
}
}
when compiled with cl /W3 -openmp block_reader.c block_helper.c zstd.lib on the visual studio x64 command line, the main function loop gets all the way to experiment #1000, with the CPU usage at 90% the entire time (my machine has 8 logical threads, I have no idea why it's capped at 90% in pure C, I get the same issue when I remove num_threads(NTHREADS) from the openMP pragma, but I'm not really worried about it).
However when I wrap it in Cython and loop it in python:
#block_reader_wrapper.pyx
from libc.stdlib cimport malloc
from libc.stdio cimport printf
cimport openmp
cimport block_reader_defns #contains block_data_t
import numpy as np
cimport numpy as np
cimport cython
#cython.boundscheck(False) # Deactivate bounds checking.
#cython.wraparound(False) # Deactivate negative indexing.
cpdef tuple read_blocks(block_data_array):
cdef np.ndarray[np.int32_t, ndim=1] values = np.zeros(size, dtype=np.int32_t)
cdef int[::1] values_view = values
decode_blocks(block_data_array, len(block_data_array), num_blocks, &values_view[0])
return values
cdef extern from "block_reader.h":
int decode_blocks(char**, b_metadata*, unsigned int, unsigned long long*, long long*, int*)
#setup.block_reader_wrapper.py
from setuptools import setup, Extension
from Cython.Build import cythonize
import numpy
ext_modules = [
Extension(
"block_reader_wrapper",
["block_reader_wrapper.pyx", "block_reader.c", "block_helper.c"],
libraries=["zstd"],
library_dirs=["{dir}/vcpkg/installed/x64-windows/lib"],
include_dirs=['{dir}/vcpkg/installed/x64-windows/include', numpy.get_include()],
extra_compile_args=['/openmp', '-O2'], #Have tried -O2, -O3 and no optimization
extra_link_args=['/openmp'], #always gets LINK : warning LNK4044: unrecognized option '/openmp'; ignored despite the docs asking for it https://cython.readthedocs.io/en/latest/src/userguide/parallelism.html
)
]
setup(
ext_modules = cythonize(ext_modules,
gdb_debug=True,
annotate=True,
)
)
#experiment.py
from block_reader_wrapper import read_blocks
from block_data_gen import get_block_data
for i in range(1000):
print("experiment", i+1)
read_blocks(get_block_data())
I get to experiment #10 with CPU usage at 100% (and running a little faster than the 90% capped pure C), but then between experiment #11 - experiment #16 the CPU usage slowly decreases in increments of 1 logical thread worth of resources, until the CPU usage hits the bottom of 1 logical thread, however despite my task manager claiming python is using ~20% of my CPU usage, the process stops outputting data. Memory Usage is always fairly low (~10%).
I figure this must have something to do with Cython's linking of OpenMP, perhaps implicitly limiting the number of payloads that I can pass to its worker threads.
Any insight would be greatly appreciated, I need this to ultimately work on Windows and Ubuntu, which is why I choose openMP in the first place.
Edit 1: As per DavidW 's suggestion I replaced:
cdef np.ndarray[np.int32_t, ndim=1] values = np.zeros(size, dtype=np.int32_t)
with:
cdef array.array values, values_temp
values_temp = array.array('q', [])
values = array.clone(values_temp, size, zero=True)
Unfortunately this has not fixed the problem.
Edit 2 & 3: After profiling the process when it's "ground to a halt", I see that a significant portion of CPU time is spent waiting. Specifically the functions free_base and malloc_base from the module ucrtbase.dll
Edit 4: I rewrote the wrapper with ctypes instead of cython, which takes advantage of the same C -> Python API, so maybe it's no surprise that the same problem exists (although it grinds to a halt about twice as fast with ctypes over Cython).
VTune Summary:
Elapsed Time: 285419.416s
CPU Time: 22708.709s
Effective Time: 9230.924s
Spin Time: 13477.785s
Overhead Time: 0s
Total Thread Count: 10
Paused Time: 0s
Top Hotspots
Function Module CPU Time
free_base ucrtbase.dll 9061.852s
malloc_base ucrtbase.dll 8308.887s
NtWaitForSingleObject ntdll.dll 1283.721s
func#0x180020020 USER32.dll 820.759s
func#0x18001c630 tsc_block_reader.cp38-win_amd64.pyd 753.774s
[Others] N/A* 2479.716s
Effective CPU Utilization Histogram
Simultaneously Utilized Logical CPUs Elapsed Time Utilization threshold
0 279744.7901384001 Idle
1 5446.2851121 Poor
2 177.8078306 Poor
3 40.3033061 Poor
4 10.2292884 Poor
5 0 Poor
6 0 Poor
7 0 Ok
8 0 Ideal
Even though it's saying about 80% of the CPU time is Idle, a loop that should complete in 30 seconds didn't even complete after 2 days, so it's much more than 80% Idle time.
Looks like the majority of the Idle time is spent in ucrtbase.dll

OpenMP for beginners

I just got started with openMP; I wrote a little C code in order to check if what I have studied is correct. However I found some troubles; here is the main.c code
#include "stdio.h"
#include "stdlib.h"
#include "omp.h"
#include "time.h"
int main(){
float msec_kernel;
const int N = 1000000;
int i, a[N];
clock_t start = clock(), diff;
#pragma omp parallel for private(i)
for (i = 1; i <= N; i++){
a[i] = 2 * i;
}
diff = clock() - start;
msec_kernel = diff * 1000 / CLOCKS_PER_SEC;
printf("Kernel Time: %e s\n",msec_kernel*1e-03);
printf("a[N] = %d\n",a[N]);
return 0;
}
My goal is to see how long it takes to the PC to do such operation using 1 and 2 CPUs; in order to to compile the program I type the following line in the terminal:
gcc -fopenmp main.c -o main
And then I select the number of CPUs like so:
export OMP_NUM_THREADS=N
where N is either 1 or 2; however I don't get the right execution time; my results in fact are:
Kernel Time: 5.000000e-03 s
a[N] = 2000000
and
Kernel Time: 6.000000e-03 s
a[N] = 2000000
Both corresponding to N=1 and N=2. as you can see when I use 2 CPUs it takes slightly more time than using just one! What am I doing wrong? How can I fix this problem?
First of all, using multiple cores doesn't implicitly mean, that you're going to get better performance.
OpenMP has to manage the data distribution among you're cores which is going to take time as well. Especially for very basic operations such as only a single multiplication you are doing, performance of a sequential (single core) program will be better.
Second, by going through every element of you're array only once and not doing anything else, you make no use of cache memory and most certainly not of shared cache between cpu's.
So you should start reading some things about general algorithm performance. To make use of multiple cores using shared cache is in my opinion the essence.
Todays computers have come to a stage where the CPU is so much faster than a memory allocation, read or write. This means when using multiple cores, you'll only have a benefit if you use things like shared cache, because the data distribution,initialization of the threads and managing them will use time as well. To really see a performance speedup (See the link, essential term in parallel computing) you should program an algorithm which has a heavy accent on computation not on memory; this has to do with locality (another important term).
So if you wanna experience a big performance boost by using multiple cores test it on a matrix-matrix-multiplication on big matrices such as 10'000*10'000. And plot some graphs with inputsize(matrix-size) to time and matrix-size to gflops and compare the multicore with the sequential version.
Also make yourself comfortable with the complexity analysis (Big O notation).
Matrix-matrix-multiplication has a locality of O(n).
Hope this helps :-)
I suggest setting the numbers of cores/threads within the code itself either directly at the #pragma line #pragma omp parallel for num_threads(2) or using the omp_set_num_threads function omp_set_num_threads(2);
Further, when doing time/performance analysis it is really important to always run the program multiple times and then take the mean of all the runtimes or something like that. Running the respective programs only once will not give you a meaningful reading of used time. Always call multiple times in a row. Not to forget to also alternate the quality of data.
I suggest writing a test.c file, which takes your actual program function within a loop and then calculates the time per execution of the function:
int executiontimes = 20;
clock_t initial_time = clock();
for(int i = 0; i < executiontimes; i++){
function_multiplication(values);
}
clock_t final_time = clock();
clock_t passed_time = final_time - initial_time;
clock_t time_per_exec = passed_time / executiontimes;
Improve this test algorithm, add some rand() for your values etc. seed them with srand() etc. If you have more questions on the subject or to my answer leave a comment and I'll try to explain further by adding more explanations.
The function clock() returns elapsed CPU time, which includes ticks from all cores. Since there is some overhead to using multiple threads, when you sum the execution time of all threads the total cpu time will always be longer than the serial time.
If you want the real time (wall clock time), try to use the OMP Runtime Library function omp_get_wtime() defined in omp.h. It is cross platform portable and should be the preferred way to do wall timing.
You can also use the POSIX functions defined in time.h:
struct timespec start, stop;
clock_gettime(CLOCK_REALTIME, &start);
// action
clock_gettime(CLOCK_REALTIME, &stop);
double elapsed_time = (stop.tv_sec - start.tv_sec) +
1e-9 * (stop.tv_nsec - start.tv_nsec);

Use all PC resources to make intensive calculations

I'm writing in C a little script that will generate random numbers.
When I run it for 60 seconds, it generates only 17 million numbers, and in my PC task manager, I see that it uses almost 0% of the resources.
Can someone please, give me a piece of code or link that allow me to use the full resources of my PC to generate trillions of random numbers in a few seconds? Maybe multi-threaded?
Note: if you know a simple way (no heavy CUDA SDK) to use the Nvidia GPU rather than the CPU, it will be good too!
EDIT
Here's my code :
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
char getRandLetter(void);
int main()
{
int i,num=0,p=0;
char randhash[12];
time_t start,stop;
start = time(NULL);
while(1) {
for(i=0; i<12; i++){
randhash[i]=getRandLetter();
}
num++;
stop = time(NULL);
double diff = difftime(stop, start);
if (diff >= 60) {
printf("60 seconds passed... NUM=%d", num);
start = time(NULL);
break;
}
}
return 0;
}
char getRandLetter() {
static int range = 'Z'-'A'+1;
return rand()%range + 'A';
}
Note: i have a killer pc with i7 and a killer geforce :p So i just need to exploit these resources.
Generating random numbers should not be CPU bound. Functions that generate random numbers usually make a call to the system kernel, which gets random numbers from physical sources of entropy (the network, keyboard, mouse, etc.). The best you can do with computation is pseudo random numbers.
Effectively, there should be ways to use more of your CPU to generate "random" numbers quicker, but those random numbers wouldn't be as high quality as those plucked from physical sources of entropy (and using those sources doesn't flex the CPU much if at all) because CPUs tend to be quite deterministic (as they should be).
Some additional info on:
http://en.wikipedia.org/wiki/Random_number_generation

Why is my computer not showing a speedup when I use parallel code?

So I realize this question sounds stupid (and yes I am using a dual core), but I have tried two different libraries (Grand Central Dispatch and OpenMP), and when using clock() to time the code with and without the lines that make it parallel, the speed is the same. (for the record they were both using their own form of parallel for). They report being run on different threads, but perhaps they are running on the same core? Is there any way to check? (Both libraries are for C, I'm uncomfortable at lower layers.) This is super weird. Any ideas?
EDIT: Added detail for Grand Central Dispatch in response to OP comment.
While the other answers here are useful in general, the specific answer to your question is that you shouldn't be using clock() to compare the timing. clock() measures CPU time which is added up across the threads. When you split a job between cores, it uses at least as much CPU time (usually a bit more due to threading overhead). Search for clock() on this page, to find "If process is multi-threaded, cpu time consumed by all individual threads of process are added."
It's just that the job is split between threads, so the overall time you have to wait is less. You should be using the wall time (the time on a wall clock). OpenMP provides a routine omp_get_wtime() to do it. Take the following routine as an example:
#include <omp.h>
#include <time.h>
#include <math.h>
#include <stdio.h>
int main(int argc, char *argv[]) {
int i, nthreads;
clock_t clock_timer;
double wall_timer;
for (nthreads = 1; nthreads <=8; nthreads++) {
clock_timer = clock();
wall_timer = omp_get_wtime();
#pragma omp parallel for private(i) num_threads(nthreads)
for (i = 0; i < 100000000; i++) cos(i);
printf("%d threads: time on clock() = %.3f, on wall = %.3f\n", \
nthreads, \
(double) (clock() - clock_timer) / CLOCKS_PER_SEC, \
omp_get_wtime() - wall_timer);
}
}
The results are:
1 threads: time on clock() = 0.258, on wall = 0.258
2 threads: time on clock() = 0.256, on wall = 0.129
3 threads: time on clock() = 0.255, on wall = 0.086
4 threads: time on clock() = 0.257, on wall = 0.065
5 threads: time on clock() = 0.255, on wall = 0.051
6 threads: time on clock() = 0.257, on wall = 0.044
7 threads: time on clock() = 0.255, on wall = 0.037
8 threads: time on clock() = 0.256, on wall = 0.033
You can see that the clock() time doesn't change much. I get 0.254 without the pragma, so it's a little slower using openMP with one thread than not using openMP at all, but the wall time decreases with each thread.
The improvement won't always be this good due to, for example, parts of your calculation that aren't parallel (see Amdahl's_law) or different threads fighting over the same memory.
EDIT: For Grand Central Dispatch, the GCD reference states, that GCD uses gettimeofday for wall time. So, I create a new Cocoa App, and in applicationDidFinishLaunching I put:
struct timeval t1,t2;
dispatch_queue_t queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
for (int iterations = 1; iterations <= 8; iterations++) {
int stride = 1e8/iterations;
gettimeofday(&t1,0);
dispatch_apply(iterations, queue, ^(size_t i) {
for (int j = 0; j < stride; j++) cos(j);
});
gettimeofday(&t2,0);
NSLog(#"%d iterations: on wall = %.3f\n",iterations, \
t2.tv_sec+t2.tv_usec/1e6-(t1.tv_sec+t1.tv_usec/1e6));
}
and I get the following results on the console:
2010-03-10 17:33:43.022 GCDClock[39741:a0f] 1 iterations: on wall = 0.254
2010-03-10 17:33:43.151 GCDClock[39741:a0f] 2 iterations: on wall = 0.127
2010-03-10 17:33:43.236 GCDClock[39741:a0f] 3 iterations: on wall = 0.085
2010-03-10 17:33:43.301 GCDClock[39741:a0f] 4 iterations: on wall = 0.064
2010-03-10 17:33:43.352 GCDClock[39741:a0f] 5 iterations: on wall = 0.051
2010-03-10 17:33:43.395 GCDClock[39741:a0f] 6 iterations: on wall = 0.043
2010-03-10 17:33:43.433 GCDClock[39741:a0f] 7 iterations: on wall = 0.038
2010-03-10 17:33:43.468 GCDClock[39741:a0f] 8 iterations: on wall = 0.034
which is about the same as I was getting above.
This is a very contrived example. In fact, you need to be sure to keep the optimization at -O0, or else the compiler will realize we don't keep any of the calculations and not do the loop at all. Also, the integer that I'm taking the cos of is different in the two examples, but that doesn't affect the results too much. See the STRIDE on the manpage for dispatch_apply for how to do it properly and for why iterations is broadly comparable to num_threads in this case.
EDIT: I note that Jacob's answer includes
I use the omp_get_thread_num()
function within my parallelized loop
to print out which core it's working
on... This way you can be sure that
it's running on both cores.
which is not correct (it has been partly fixed by an edit). Using omp_get_thread_num() is indeed a good way to ensure that your code is multithreaded, but it doesn't show "which core it's working on", just which thread. For example, the following code:
#include <omp.h>
#include <stdio.h>
int main() {
int i;
#pragma omp parallel for private(i) num_threads(50)
for (i = 0; i < 50; i++) printf("%d\n", omp_get_thread_num());
}
prints out that it's using threads 0 to 49, but this doesn't show which core it's working on, since I only have eight cores. By looking at the Activity Monitor (the OP mentioned GCD, so must be on a Mac - go Window/CPU Usage), you can see jobs switching between cores, so core != thread.
Most likely your execution time isn't bound by those loops you parallelized.
My suggestion is that you profile your code to see what is taking most of the time. Most engineers will tell you that you should do this before doing anything drastic to optimize things.
It's hard to guess without any details. Maybe your application isn't even CPU bound. Did you watch CPU load while your code was running? Did it hit 100% on at least one core?
Your question is missing some very crucial details such as what the nature of your application is, what portion of it are you trying to improve, profiling results (if any), etc...
Having said that you should remember several critical points when approaching a performance improvement effort:
Efforts should always concentrate on the code areas which have been proven, by profiling, to be the inefficient
Parallelizing CPU bound code will almost never improve performance (on a single core machine). You will be losing precious time on unnecessary context switches and gaining nothing. You can very easily worsen performance by doing this.
Even if you are parallelizing CPU bound code on a multicore machine, you must remember you never have any guarantee of parallel execution.
Make sure you are not going against these points, because an educated guess (barring any additional details) will say that's exactly what you're doing.
If you are using a lot of memory inside the loop, that might prevent it from being faster. Also you could look into pthread library, to manually handle threading.
I use the omp_get_thread_num() function within my parallelized loop to print out which core it's working on if you don't specify num_threads. For e.g.,
printf("Computing bla %d on core %d/%d ...\n",i+1,omp_get_thread_num()+1,omp_get_max_threads());
The above will work for this pragma
#pragma omp parallel for default(none) shared(a,b,c)
This way you can be sure that it's running on both cores since only 2 threads will be created.
Btw, is OpenMP enabled when you're compiling? In Visual Studio you have to enable it in the Property Pages, C++ -> Language and set OpenMP Support to Yes

GCC performance

I am doing parallel programming with MPI on Beowulf cluster. We wrote parallel algorithm for simulated annealing. It works fine. We expect 15 time faster execution than with serial code. But we did some execution of serial C code on different architectures and operating systems just so we could have different data sets for performance measurement. We have used this Random function in our code. We use GCC on both windows and ubuntu linux. We figured out that execution takes much longer on linuxes, and we don't know why. Can someone compile this code on linux and windows with gcc and try to explain me.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main (int argc, char** argv){
double Random();
int k,NUM_ITERATIONS = 10;
clock_t start_time = clock();
NUM_ITERATIONS=atoi(argv[1]);
// iniciranje random generatora
srand(time(NULL));
for(k=0; k<NUM_ITERATIONS; k++){
double raa = Random();
}
clock_t end_time = clock();
printf("Time of algorithm execution: %lf seconds\n", ((double) (end_time - start_time)) / CLOCKS_PER_SEC);
return 0;
}
// generate random number bettwen 0 and 1
double Random(){
srand(rand());
double a = rand();
return a/RAND_MAX;
}
If I execute it with 100 000 000 as argument for NUM_ITERATIONS, I get 20 times slower execution on linux than on windows. Tested on machine with same architecture with dual boot win + ubuntu linux. We need help as this Random function is bottleneck for what we want to show with our data.
On Linux gcc, the call to srand(rand()); within the Random function accounts for more than 98 % of the time.
It is not needed for the generation of random numbers, at least not within the loop. You already call srand() once, it's enough.
I would investigate other random number generators available. Many exist that have been well tested and perform better than the standard library random functions, both in terms of speed of execution and in terms of pseudo-randomness. I have also implemented my own RNG for a graduate class, but I wouldn't use it in production code. Go with something that has been vetted by the community. Random.org is a good resource for testing whatever RNG you select.

Resources