Primary job terminated in HPL benchmarking with openmpi - benchmarking

I am benchmarking my cluster. I do have one processor AMD FX(tm)-8350 type processor and have Intel(R) Xeon(R) CPU X3210 type. I have compile my hpl with Make.Linux_PII_CBLAS_gm ; with openmpi, cblas and openblas library. It compile good. Running with mpirun in AMD processor it also serves well. But when come to the intel processor it Shows :
Primary job terminated normally, but 1 process returned a non zero exit code. Per user-derection , the job has been aborted.
mpirun noticed that process rank 0 with pid 0 on node node1 exited on signal 4 (Illegal instruction.)
Do I have to compile hpl specifically for intel processor? Any suggestion regarding this problem will be a lot helpful.

Related

C OpenMP : libgomp: Thread creation failed: Resource temporarily unavailable

I was trying to do a p basic project and It seems like I have run out of threads? Do you guys know how do can I fix the problem?
Here is the code:
int main()
{
omp_set_num_threads(2150);
#pragma omp parallel
{
printf("%d\n",omp_get_thread_num());
}
return 0;
}
and here is the global compiler setting I have written on "other compiler options" on CodeBlocks:
-fopenmp
I am getting the error of:
libgomp: Thread creation failed: Resource temporarily unavailable
I have seen similar threads on the site, but I have not got the answer or solution as of yet.
Specs:
Intel i5 6400
2x8GB ram
Windows 10 64 bit
The problem is
omp_set_num_threads(2150);
The OS will impose a limit on the number of threads a process can create. This may be indirect for example by limiting the stack size. Creating 2150 exceeds the limits.
You mention that you've got the intel 15 6400 which is a quad core chip. Try setting the number of threads to something more reasonable. In your case:
omp_set_num_threads(4);
For numerical processing, performance will likely suffer when using more than 4 threads on a 4 core system

Real time Linux: disable local timer interrupts

TL;DR : Using Linux kernel real time with NO_HZ_FULL I need to isolate a process in order to have deterministic results but /proc/interrupts tell me there is still local timer interrupts (among other). How to disable it?
Long version :
I want to make sure my program is not being interrupt so I try to use a real time Linux kernel.
I'm using the real time version of arch Linux (linux-rt on AUR) and I modified the configuration of the kernel to selection the following options :
CONFIG_NO_HZ_FULL=y
CONFIG_NO_HZ_FULL_ALL=y
CONFIG_RCU_NOCB_CPU=y
CONFIG_RCU_NOCB_CPU_ALL=y
then I reboot my computer to boot on this real time kernel with the folowing options:
nmi_watchdog=0
rcu_nocbs=1
nohz_full=1
isolcpus=1
I also disable the following option in the BIOS :
C state
intel speed step
turbo mode
VTx
VTd
hyperthreading
My CPU (i7-6700 3.40GHz) has 4 cores (8 logical CPU with hyperthreading technology)
I can see CPU0, CPU1, CPU2, CPU3 in /proc/interrupts file.
CPU1 is isolated by isolcpus kernel parameter and I want to disable the local timer interrupts on this CPU.
I though real-time kernel with CONFIG_NO_HZ_FULL and CPU isolation (isolcpus) was enough to do it and I try to check by running theses command :
cat /proc/interrupts | grep LOC > ~/tmp/log/overload_cpu1
taskset -c 1 ./overload
cat /proc/interrupts | grep LOC >> ~/tmp/log/overload_cpu1
where the overload process is:
***overload.c:***
int main()
{
for(int i=0;i<100;++i)
for(int j=0;j<100000000;++j);
}
The file overload_cpu1 contains the result:
LOC: 234328 488 12091 11299 Local timer interrupts
LOC: 239072 651 12215 11323 Local timer interrupts
meanings 651-488 = 163 interrupts from local timer and not 0...
For comparison I do the same experiment but I change the core where my process overload run (I keep watching interrupts on CPU1):
taskset -c 0 : 8 interrupts
taskset -c 1 : 163 interrupts
taskset -c 2 : 7 interrupts
taskset -c 3 : 8 interrupts
One of my question is why there is no 0 interrupts ? why the number of interrupts is bigger when my process run on CPU1 ? (I mean I though NO_HZ_FULL will prevent interrupt if my process was alone : "The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to avoid
sending scheduling-clock interrupts to CPUs with a single runnable task"(https://www.kernel.org/doc/Documentation/timers/NO_HZ.txt)
Maybe an explaination is there is other process running on CPU1.
I checked by using ps command :
CLS CPUID RTPRIO PRI NI CMD PID
TS 1 - 19 0 [cpuhp/1] 18
FF 1 99 139 - [migration/1] 20
TS 1 - 19 0 [rcuc/1] 21
FF 1 1 41 - [ktimersoftd/1] 22
TS 1 - 19 0 [ksoftirqd/1] 23
TS 1 - 19 0 [kworker/1:0] 24
TS 1 - 39 -20 [kworker/1:0H] 25
FF 1 1 41 - [posixcputmr/1] 28
TS 1 - 19 0 [kworker/1:1] 247
TS 1 - 39 -20 [kworker/1:1H] 501
As you can see, there is threads on the CPU1.
Is that possible to disable these processes ? I guess it is because if it is not the case, NO_HZ_FULL will never work right ?
Tasks with class TS doesn't disturb me because they didn't have priority among SCHED_FIFO and I can set this policy to my program.
Same things for tasks with class FF and priority less than 99.
However, you can see migration/1 that is in SCHED_FIFO and priority 99.
Maybe these process can causes interrupts when they run . This explain the few interrupts when my process in on CPU0, CPU2 and CPU3 (respectively 8,7 and 8 interrupts) but it also mean these processes are not running very often and then doesn't explain why there is many interrupts when my process run on CPU1 (163 interrupts).
I also do the same experiment but with the SCHED_FIFO of my overload process and I get:
taskset -c 0 : 1
taskset -c 1 : 4063
taskset -c 2 : 1
taskset -c 3 : 0
In this configuration there is more interrupts in the case my process use SCHED_FIFO policy on CPU1 and less on other CPU. do you know why ?
The thing is that a full-tickless CPU (a.k.a. adaptive-ticks, configured with nohz_full=) still receives some ticks.
Most notably the scheduler requires a timer on an isolated full tickless CPU for updating some state every second or so.
This is a documented limitation (as of 2019):
Some process-handling operations still require the occasional
scheduling-clock tick. These operations include calculating CPU
load, maintaining sched average, computing CFS entity vruntime,
computing avenrun, and carrying out load balancing. They are
currently accommodated by scheduling-clock tick every second
or so. On-going work will eliminate the need even for these
infrequent scheduling-clock ticks.
(source: Documentation/timers/NO_HZ.txt, cf. the LWN article (Nearly) full tickless operation in 3.10 from 2013 for some background)
A more accurate method to measure the local timer interrupts (LOC row in /proc/interrupts) is to use perf. For example:
$ perf stat -a -A -e irq_vectors:local_timer_entry ./my_binary
Where my_binary has threads pinned to the isolated CPUs that non-stop utilize the CPU without invoking syscalls - for - say 2 minutes.
There are other sources of additional local timer ticks (when there is just 1 runnable task).
For example, the collection of VM stats - by default they are collected each seconds. Thus, I can decrease my LOC interrupts by setting a higher value, e.g.:
# sysctl vm.stat_interval=60
Another source are periodic checks if the TSC on the different CPUs doesn't drift - you can disable those with the following kernel option:
tsc=reliable
(Only apply this option if you really know that your TSCs don't drift.)
You might find other sources by recording traces with ftrace (while your test binary is running).
Since it came up in the comments: Yes, the SMI is fully transparent to the kernel. It doesn't show up as NMI. You can only detect an SMI indirectly.

How CPU allocation is done in Linux ? Thread level or Process level? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 6 years ago.
Improve this question
I am trying to understand how CPU is distributed among different processes with different no of threads.I have two programs Program1 and Program2.
Program1 has 5 threads whereas Program2 has ONLY main thread.
SCENARIO -1 :
terminal-1 : ./Program1
terminal-2 : ./Program2
When I run Program1 in one terminal and Program2 in another terminal , the CPU allocation is done 50% for Program1 and 50% for Program2. Each thread of Program1 is getting 10% (cumulatively 50% for Program1)
This shows, no matter the no of threads a process have, every process will get equal share of CPU. This shows CPU allocation is done at Process level.
pstree shows
├─bash───P1───5*[{P1}]
├─bash───P2───{P2}
SCENARIO -2 :
terminal-1 : ./Program1 & ./Program2
When I run both Program1 and Program2 in SAME terminal , the CPU allocation is done equal for Program1 and all threads of Program2. It means each thread of Program1 is getting almost 17% (cumulatively Program1 is getting 83%) and Program2 is also getting 17%. This shows CPU allocation is done at Thread level.
pstree shows
├─bash─┬─P1───5*[{P1}]
│ └─P2
I am using Ubuntu 12.04.4 LTS, kernel - config-3.11.0-15-generic. I have also used Ubuntu 14.04.4 , kernel-3.16.x and got similar results.
Can anyone explain how CPU scheduler of LINUX KERNEL distinguishing SCENARIO-1 and SCENARIO-2?
I think the CPU scheduler is distinguishing both SCENARIO in somewhere before allocating CPU.
To understand how CPU scheduler is distinguishing SCENARIO-1 and SCENARIO-2 ,I have downloaded Linux kernel source code.
However, I haven't found in source code where it is distinguishing SCENARIO-1 and SCENARIO-2.
It will be great if anyone point me the source code or function where the CPU scheduler is distinguishing SCENARIO-1 and SCENARIO-2.
Thanks in advance.
NOTE : Although Ubuntu is based on Debian, surprisingly, in Debian 8 (kernel-3.16.0-4-686-pae) in both SCENARIO's CPU allocation is done at Thread level means each thread of Program1 is getting almost 17% (cumulatively Program1 is getting 83%) and Program2 is also getting 17%.
Here is the code :
Program1(with 5 threads)
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
// Let us create a global variable to change it in threads
int g = 0;
// The function to be executed by all threads
void *myThreadFun(void *vargp)
{
// Store the value argument passed to this thread
int myid = (int)vargp;
// Let us create a static variable to observe its changes
static int s = 0;
// Change static and global variables
++s; ++g;
// Print the argument, static and global variables
printf("Thread ID: %d, Static: %d, Global: %d\n", myid, ++s, ++g);
while(1); // Representing CPU Bound Work
}
int main()
{
int i;
pthread_t tid[5];
// Let us create three threads
for (i = 0; i < 5; i++)
pthread_create(&tid[i], NULL, myThreadFun, (void *)i);
for (i = 0; i < 5; i++)
pthread_join(tid[i],NULL);
return 0;
}
Program2 ( with only main thread)
#include <stdio.h>
#include <stdlib.h>
int main()
{
while(1);// Representing CPU Bound Work
}
To disable all optimization by gcc, I have used O0 option while compiling the both programs.
gcc -O0 program1.c -o p1 -lpthread
gcc -O0 program2.c -o p2
UPDATE: As per explanation of ninjalj , in Scenario-1, CPU allocation is done at control groups level and as I am using two different terminal(means two different session), thus 2 different control groups and each control groups is getting 50% CPU allocation. This is due to reason, autogroup is enabled by default.
As Program2 has ONLY one thread and Program1 has more threads, I want to run
both the program in separate terminal(different session) and get more CPU allocation for Program1 ( as in Scenario-2, Program1 is getting 83% CPU allocation compared to 17% of Program2). Is it possible in any way,that CPU allocation of Scenario1 will be same as Scenario-2 in Ubuntu?
Also it is surprising to me although Ubuntu is based on Debian, still Debian and Ubuntu is behaving differently. In case of Debian, Program1 is getting more CPU in both Scenario.
The linux kernel does not distinguish processes vs. threads in scheduling.
Threads are processes that just happen to share most of their memory. Beyond that, they are treated equally by the scheduler.
You can have 50 processes and 30 threads. That's 80 "things" and the kernel will schedule them without regard to whether they are processes or threads

Regarding CPU utilization

Considering the below piece of C code, I expected the CPU utilization to go up to 100% as the processor would try to complete the job (endless in this case) given to it. On running the executable for 5 mins, I found the CPU to go up to a max. of 48%. I am running Mac OS X 10.5.8; processor: Intel Core 2 Duo; Compiler: GCC 4.1.
int i = 10;
while(1) {
i = i * 5;
}
Could someone please explain why the CPU usage does not go up to 100%? Does the OS limit the CPU from reaching 100%?
Please note that if I added a "printf()" inside the loop the CPU hits 88%. I understand that in this case, the processor also has to write to the standard output stream hence the sharp rise in usage.
Has this got something to do with the amount of job assigned to the processor per unit time?
Regards,
Ven.
You have a multicore processor and you are in a single thread scenario, so you will use only one core full throttle ... Why do you expect the overall processor use go to 100% in a similar context ?
Run two copies of your program at the same time. These will use both cores of your "Core 2 Duo" CPU and overall CPU usage will go to 100%
Edit
if I added a "printf()" inside the loop the CPU hits 88%.
The printf send some characters to the terminal/screen. Sending information, Display and Update is handeled by code outside your exe, this is likely to be executed on another thread. But displaying a few characters does not need 100% of such a thread. That is why you see 100% for Core 1 and 76% for Core 2 which results in the overal CPU usage of 88% what you see.

cpu cores vs threads

My MacBookPro, running BootCamp, has an Intel i7-640M processor, which has 2 cores. Like all the other i7 chips, each core is hyperthreaded, so you can have up to 4 threads. Using Visual Studio 2010 c/c++ to determine these:
coresAvailable = omp_get_num_procs ( );
threadsAvailable = omp_get_max_threads ( ) ;
The "threadsAvailable" comes back with a value of 4, as expected. But "coresAvailable" also is reported as 4.
What am I missing?
omp_get_num_procs returns the number of CPUs the OS reports, and since a hyperthreaded core reports itself as 2 CPUs, a dual-core hyperthreaded chip will report itself as 4 processors.
omp_get_max_threads returns the most threads that will be used in a parallel region of code, so it makes sense that the most threads it will use will be the number of CPUs available.

Resources