Running openmp on cluster - c

I have to run an openmp program on a cluster with different configuration (such as different number of nodes).
But the problem I am facing is that whenever I am trying to run the program with say 2 nodes then the same piece of program runs 2 times instead of running in parallel.
My program -
gettimeofday(&t0, NULL);
for (k=0; k<size; k++) {
#pragma omp parallel for shared(A)
for (i=k+1; i<size; i++) {
//parallel code
}
#pragma omp barrier
for (i=k+1; i<size; i++) {
#pragma omp parallel for
//parallel code
}
}
gettimeofday(&t1, NULL);
printf("Did %u calls in %.2g seconds\n", i, t1.tv_sec - t0.tv_sec + 1E-6 * (t1.tv_usec - t0.tv_usec));
It is an LU decomposition program.
When I am running it on 2 node then I am getting output something like this -
Did 1000 calls in 5.2 seconds
Did 1000 calls in 5.3 seconds
Did 2000 calls in 41 seconds
Did 2000 calls in 41 seconds
As you see each the program is run two times for each value (1000,2000,3000...) instead of running in parallel.
It is my homework program but I am stuck at this point.
I am using SLURM script to run this program on my college computing cluster. This is the standard script provided by the professor.
#!/bin/sh
##SBATCH --partition=general-compute
#SBATCH --time=60:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
##SBATCH --mem=24000
# Memory per node specification is in MB. It is optional.
# The default limit is 3GB per core.
#SBATCH --job-name="lu_openmpnew2nodes"
#SBATCH --output=luopenmpnew1node2task.out
#SBATCH --mail-user=***#***.edu
#SBATCH --mail-type=ALL
##SBATCH --requeue
#Specifies that the job will be requeued after a node failure.
#The default is that the job will not be requeued.
echo "SLURM_JOBID="$SLURM_JOBID
echo "SLURM_JOB_NODELIST"=$SLURM_JOB_NODELIST
echo "SLURM_NNODES"=$SLURM_NNODES
echo "SLURMTMPDIR="$SLURMTMPDIR
cd $SLURM_SUBMIT_DIR
echo "working directory = "$SLURM_SUBMIT_DIR
module list
ulimit -s unlimited
#
echo "Launch luopenmp with srun"
export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so
for i in {1000..20000..1000}
do
srun ./openmpNew "$i"
done
#
echo "All Done!"

Be careful, you are confusing MPI and OpenMP here.
OpenMP works with Threads, i.e. on shared memory which do not communicate over several nodes of a distributed memory system (there exist some techniques to do so, but they are not performant enough).
What you are doing is starting the same program on two nodes each. If you where using MPI, this would be fine. But in your case you start two processes with a default number of threads. Those two processes are independent of each other.
I would suggest some further studies on the topics of Shared Memory Parallelization programming (like OpenMP) and Distributed Memory Parallelization (like MPI). There's tons of tutorials out there, and I would recommend the book "Introduction to High Performance Computing for Scientists and Engineers," by Hager and Wellein.
To try your program, start on one node, and specify OMP_NUM_THREADS like:
OMP_NUM_THREADS=1 ./openmpNew "$i"
OMP_NUM_THREADS=2 ./openmpNew "$i"
...
Here is an example script for SLURM: link.

Related

perf tool output, magic values

I ran perf with the parameter -x to print in machine readable format. The output is as follows:
1285831153,,instructions,1323535732,100.00
7332248,,branch-misses,1323535732,100.00
1316.587352,,cpu-clock,1316776510,100.00
1568113343,,cycles,1323535732,100.00
the first number is clear but then the values after the descriptions are not clear to me. Is the first one behind the description the runtime? Then why is it different? What does the 100.00 mean at the end of each line? It is not documented; I looked it up here: https://perf.wiki.kernel.org/index.php/Tutorial#Machine_readable_output
-x option of stat command is implemented in tools/perf/builtin-stat.c file as csv_output flag, and printing is static void printout function "(line 1061). Last values in the string are probably from:
print_noise(counter, noise);
print_running(run, ena);
With single run of target program (no -r 5 or -r 2 options - https://perf.wiki.kernel.org/index.php/Tutorial#Repeated_measurement) print_noise will not print anything. And print_running is printing the "run" argument twice, as value and as percentage of ena
static void print_running(u64 run, u64 ena)
{
if (csv_output) {
fprintf(stat_config.output, "%s%" PRIu64 "%s%.2f",
csv_sep,
run,
csv_sep,
ena ? 100.0 * run / ena : 100.0);
} else if (run != ena) {
fprintf(stat_config.output, " (%.2f%%)", 100.0 * run / ena);
}
}
You have run/ena = 1 (100.00%), so theses field have no useful information for you.
They are used in the case of event multiplexing (try perf stat -d or perf stat -dd; https://perf.wiki.kernel.org/index.php/Tutorial#multiplexing_and_scaling_events) when user ask perf to measure more event that can be enabled at same time (8 hardware events on intel with only 7 real hardware counting hardware units). Perf (perf_events subsystem of kernel) will enable some subsets of events and will change these subsets several times per second. Then run/ena will be proportional to the time share when this event was enabled, and run will probably show exact time amount when the event was counted. With normal human-readable perf stat this is marked when there is no [100%] for the event line; and the reported event count may be scaled (estimated) for the full running time of the program (inexact scaled).

Multi-GPU programming using CUDA on a NUMA Machine

I currently porting an algorithm to two GPUs. The hardware has the following setup:
Two CPUs as a NUMA System, so the main memory is splitted to both NUMA
nodes.
Each GPU is physically connected to one of the GPUs. (Each PCIe controller has one GPU)
I created two threads on the host to control the GPUs. The threads are bound each to a NUMA-Node, i.e. each of both threads runs on one CPU socket. How can I determine the number of the GPU such that I can select the directly connected GPU using cudaSetDevice()?
As I mentioned in the comments, this is a type of CPU GPU affinity. Here is a bash script that I hacked together. I believe it will give useful results on RHEL/CentOS 6.x OS. It probably won't work properly on many older or other linux distros. You can run the script like this:
./gpuaffinity > out.txt
You can then read out.txt in your program to determine which logical CPU cores correspond to which GPUs. For example, on a NUMA Sandy Bridge system with two 6-core processors and 4 GPUs, sample output might look like this:
0 03f
1 03f
2 fc0
3 fc0
This system has 4 GPUs, numbered from 0 to 3. Each GPU number is followed by a "core mask". The core mask corresponds to the cores which are "close" to that particular GPU, expressed as a binary mask. So for GPUs 0 and 1, the first 6 logical cores in the system (03f binary mask) are closest. For GPUs 2 and 3, the second 6 logical cores in the system (fc0 binary mask) are closest.
You can either read the file in your program, or else you can use the logic illustrated in the script to perform the same functions in your program.
You can also invoke the script like this:
./gpuaffinity -v
which will give slightly more verbose output.
Here is the bash script:
#!/bin/bash
#this script will output a listing of each GPU and it's CPU core affinity mask
file="/proc/driver/nvidia/gpus/0/information"
if [ ! -e $file ]; then
echo "Unable to locate any GPUs!"
else
gpu_num=0
file="/proc/driver/nvidia/gpus/$gpu_num/information"
if [ "-v" == "$1" ]; then echo "GPU: CPU CORE AFFINITY MASK: PCI:"; fi
while [ -e $file ]
do
line=`grep "Bus Location" $file | { read line; echo $line; }`
pcibdf=${line:14}
pcibd=${line:14:7}
file2="/sys/class/pci_bus/$pcibd/cpuaffinity"
read line2 < $file2
if [ "-v" == "$1" ]; then
echo " $gpu_num $line2 $pcibdf"
else
echo " $gpu_num $line2 "
fi
gpu_num=`expr $gpu_num + 1`
file="/proc/driver/nvidia/gpus/$gpu_num/information"
done
fi
The nvidia-smi tool can tell the topology on NUMA machine.
% nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 CPU Affinity
GPU0 X PHB SOC SOC 0-5
GPU1 PHB X SOC SOC 0-5
GPU2 SOC SOC X PHB 6-11
GPU3 SOC SOC PHB X 6-11
Legend:
X = Self
SOC = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI)
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks

Running OpenMP on a single node of a cluster

I am able to do simple for loops in OpenMP on my desktop/laptop of the form (a mild simplification of what I actually have...)
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
%%%% #include other libraries...
int main(void){
.
.
.
%%% declare and initialize variables.
.
.
.
#pragma omp parallel for collapse(3) shared(tf, p, Fx, Fy, Fz) private(v, i,j,k,t0)
for (i = 0; i < Nx; i++){
for (j = 0; j < Ny; j++){
for (k = 0; k < Nz; k++){
v[0] = Fx[i][j][k];
v[1] = Fy[i][j][k];
v[2] = Fz[i][j][k];
///My_fn changes v and then I put it back into Fx, Fy, Fz
My_fn(v, t0, tf, p);
Fx[i][j][k] = v[0];
Fy[i][j][k] = v[1];
Fz[i][j][k] = v[2];
}
}
}
}
If I want, I can even specify to use n_threasds = 1, 2, 3 or 4 cores on my laptop by adding omp_set_num_threads(n_threads); to the top, and I notice the performance I want. However, when using a cluster, I comment that line out.
I have access to a cluster and would like to run the code on a single node since the cluster has nodes with up to 48 cores and my laptop only 4. When I use the cluster, after compiling, I type into the terminal
$export OMP_NUM_THREADS=10
$bsub -n 10 ./a.out
But the program does not run properly: I output into a file and see it took 0 seconds to run, and the the values of Fx, Fy and Fz are what they are when I initiate them, so it seems the loop is not even run at all.
Edit: This issue was addressed by the people who managed the cluster, and is likely very specific to that cluster, hence I caution people to relate the issue to their specific case.
Looks to me that this question has nothing to do with programming but rather with using the batch system (a.k.a. distributed resource manager) on your cluster. The usual practice is to write a script instead and inside the script set OMP_NUM_THREADS to the number of slots granted. Your batch system appears to be LSF (a wild guess, based on the presence of bsub), then you'd mostly like to have something similar in the script (let's call it job.sh):
#BSUB -n 10
export OMP_NUM_THREADS=$LSB_DJOB_NUMPROC
./a.out
Then submit the script with bsub < job.sh. LSF exports the number of slots granted to the job in the LSB_DJOB_NUMPROC environment variable. By doing the assignment you may submit the same job file with different parameters like: bsub -n 20 < job.sh. You might need to give a hint to the scheduler that you'd like to have all slots on the same node. One can usually do that by specifying -R "span[ptile=n]". There might be other means to do that, e.g. an esub executable that you might need to specify:
#BSUB -a openmp
Please, note that Stack Overflow is not where your administrators store the cluster documentation. You'd better ask them, not us.
I am not sure that I understand correctly what you are up to, but I fear that your idea is that OpenMP would automatically run your application in a distributed way on a cluster.
OpenMP is not made for such a task, it supposes that you run your code in a shared memory setting. For a distributed setting (processors only connected through a networking link) there are other tools, namely MPI. But such a setting is a bit more complicated to set up than just the #pragma annotations that you are used to when using openMP.
Hristo is right, but i think you should add
#BSUB -R "span[hosts=1]" # run on a single node
in your .sh file. The ptile option is only to specify the number of tasks per node
, see i.e
https://doc.zih.tu-dresden.de/hpc-wiki/bin/view/Compendium/PlatformLSF
Otherwise, depending on the queue settings of the cluster, which you might get with
bqueues -l
the task would be runned on every node, which is available to you.
If the node has 24 cores
#PBS -l nodes=1:ppn=24
in my system. Probably in the cluster you use it will be like
#BSUB -l nodes=1:ppn=24

Making sure two processes interleave

In a C program on Linux, I fork() followed by execve() twice to create two processes running two seperate programs. How do I make sure that the execution of the two child processes interleave?
Thanks
Tried to do the above task as an answer given below had suggested but seems on encountering sched_scheduler() process hangs. Including code below...replay1 and replay2 are two prograns which simply prints "Replay1" and "Replay2" respectively.
# include<stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <signal.h>
#include <sched.h>
void main()
{
int i,pid[5],pidparent,new=0;
char *newargv1[] = {"./replay1",NULL};
char *newargv2[] = {"./replay2",NULL};
char *newenviron[] = {NULL};
struct sched_param mysched;
mysched.sched_priority = 1;
sched_setscheduler(0,SCHED_FIFO, &mysched);
pidparent =getpid();
for(i=0;i<2;i++)
{
if(getpid()==pidparent)
{
pid[i] = fork();
if(pid[i] != 0)
kill(pid[i],SIGSTOP);
if(i==0 && pid[i]==0)
execve(newargv1[0], newargv1, newenviron);
if (i==1 && pid[i]==0)
execve(newargv2[0], newargv2, newenviron);
}
}
for(i=0;i<10;i++)
{
if(new==0)
new=1;
else
new=0;
kill(pid[new],SIGCONT);
sleep(100);
kill(pid[new], SIGSTOP);
}
}
Since you need random interleaving, here's a horrible hack to do it:
Immediately after forking, send a SIGSTOP to each application.
Set your parent application to have real-time priority with sched_setscheduler. This will allow you to have more fine-grained timers.
Send a SIGCONT to one of the child processes.
Loop: Wait a random, short time. Send a SIGSTOP to the currently-running application, and a SIGCONT to the other. Repeat.
This will help force execution to interleave. It will also make things quite slow. You may also want to try using sched_setaffinity to assign each process to a different CPU (if you have a dual-core or hyperthreaded CPU) - this will cause them to effectively run simultaneously, modulo wait times for I/O. I/O wait times (which could cause them to wait for the hard disk, at which point they're likely to wake up sequentially and thus not interleave) can be avoided by making sure whatever data they're manipulating is on a ramdisk (on linux, use tmpfs).
If this is too coarse-grained for you, you can use ptrace's PTRACE_SINGLESTEP operation to step one CPU operation at a time, interleaving as you see fit.
As this is for testing purposes, you could place sched_yield(); calls after every line of code in the child processes.
Another potential idea is to have a parent process ptrace() the child processes, and use PTRACE_SINGLESTEP to interleave the two process's execution on an instruction-by-instruction basis.
if you need to synchronize them and they are your own processes, use semaphores. If you do not have access to the source, then there is no way to synchronize them.
If your aim is to do concurrency testing, I know of only two techniques:
Test exact scenarios using synchronization. For example, process 1 opens a connection and executes a query, then process 2 comes in and executes a query, then process1 gets active again and gets the results, etc. You do this with synchronization techniques mentioned by others. However, getting good test scenarios is very difficult. I have rarely used this method in the past.
In random you trust: fire up a high number of test processes that execute a long running test suite. I used this method for both multithreading and multiprocess testing (my case was testing device driver access from multiple processes without blue screening out). Usually you want to make the number of processes and number of iterations of the test suite per process configurable so that you can either do a quick pass or do a longer test before a release (running this kind of test with 10 processes for 10-12 hours was not uncommon for us). A usual run for this sort of testing is measured in hours. You just fire up the processes, let them run for a few hours, and hope that they will catch all the timing windows. The interleaving is usually handled by the OS, so you don't really need to worry about it in the test processes.
Job control is much simpler with the Bash instead of C. Try this:
#! /bin/bash
stop ()
{
echo "$1 stopping"
kill -SIGSTOP $2
}
cont ()
{
echo "$1 continuing"
kill -SIGCONT $2
}
replay1 ()
{
while sleep 1 ; do echo "replay 1 running" ; done
}
replay2 ()
{
while sleep 1 ; do echo "replay 2 running" ; done
}
replay1 &
P1=$!
stop "replay 1" $P1
replay2 &
P2=$!
stop "replay 2" $P2
trap "kill $P1;kill $P2" EXIT
while sleep 1 ; do
cont "replay 1 " $P1
cont "replay 2" $P2
sleep 3
stop "replay 1 " $P1
stop "replay 2" $P2
done
The two processes are running in parallel:
$ ./interleave.sh
replay 1 stopping
replay 2 stopping
replay 1 continuing
replay 2 continuing
replay 2 running
replay 1 running
replay 1 running
replay 2 running
replay 1 stopping
replay 2 stopping
replay 1 continuing
replay 2 continuing
replay 1 running
replay 2 running
replay 2 running
replay 1 running
replay 2 running
replay 1 running
replay 1 stopping
replay 2 stopping
replay 1 continuing
replay 2 continuing
replay 1 running
replay 2 running
replay 1 running
replay 2 running
replay 1 running
replay 2 running
replay 1 stopping
replay 2 stopping
^C

How can I run this DTrace script to profile my application?

I was searching online for something to help me do assembly line profiling. I searched and found something on http://www.webservertalk.com/message897404.html
There are two parts of to this problem; finding all instructions of a particular type (inc, add, shl, etc) to determine groupings and then figuring out which are getting executed and summing correcty. The first bit is tricky unless grouping by disassembler is sufficient. For figuring which instructions are being executed, Dtrace is of course your friend here( at least in userland).
The nicest way of doing this would be instrument only the begining of each basic block; finding these would be a manual process right now... however, instrumenting each instruction is feasible for small applications. Here's an example:
First, our quite trivial C program under test:
main()
{
int i;
for (i = 0; i < 100; i++)
getpid();
}
Now, our slightly tricky D script:
#pragma D option quiet
pid$target:a.out::entry
/address[probefunc] == 0/
{
address[probefunc]=uregs[R_PC];
}
pid$target:a.out::
/address[probefunc] != 0/
{
#a[probefunc,(uregs[R_PC]-address[probefunc]), uregs[R_PC]]=count();
}
END
{
printa("%s+%#x:\t%d\t%#d\n", #a);
}
main+0x1: 1
main+0x3: 1
main+0x6: 1
main+0x9: 1
main+0xe: 1
main+0x11: 1
main+0x14: 1
main+0x17: 1
main+0x1a: 1
main+0x1c: 1
main+0x23: 101
main+0x27: 101
main+0x29: 100
main+0x2e: 100
main+0x31: 100
main+0x33: 100
main+0x35: 1
main+0x36: 1
main+0x37: 1
From the example given, this is exactly what i need. However I have no idea what it is doing, how to save the DTrace program, how to execute with the code that i want to get the results of. So i opened this hoping some people with good DTrace background could help me understand the code, save it, run it and hopefully get the results shown.
If all you want to do is run this particular DTrace script, simply save it to a .d script file and use a command like the following to run it against your compiled executable:
sudo dtrace -s dtracescript.d -c [Path to executable]
where you replace dtracescript.d with your script file name.
This assumes that you have DTrace as part of your system (I'm running Mac OS X, which has had it since Leopard).
If you're curious about how this works, I wrote a two-part tutorial on using DTrace for MacResearch a while ago, which can be found here and here.

Resources