Running OpenMP on a single node of a cluster - c

I am able to do simple for loops in OpenMP on my desktop/laptop of the form (a mild simplification of what I actually have...)
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
%%%% #include other libraries...
int main(void){
.
.
.
%%% declare and initialize variables.
.
.
.
#pragma omp parallel for collapse(3) shared(tf, p, Fx, Fy, Fz) private(v, i,j,k,t0)
for (i = 0; i < Nx; i++){
for (j = 0; j < Ny; j++){
for (k = 0; k < Nz; k++){
v[0] = Fx[i][j][k];
v[1] = Fy[i][j][k];
v[2] = Fz[i][j][k];
///My_fn changes v and then I put it back into Fx, Fy, Fz
My_fn(v, t0, tf, p);
Fx[i][j][k] = v[0];
Fy[i][j][k] = v[1];
Fz[i][j][k] = v[2];
}
}
}
}
If I want, I can even specify to use n_threasds = 1, 2, 3 or 4 cores on my laptop by adding omp_set_num_threads(n_threads); to the top, and I notice the performance I want. However, when using a cluster, I comment that line out.
I have access to a cluster and would like to run the code on a single node since the cluster has nodes with up to 48 cores and my laptop only 4. When I use the cluster, after compiling, I type into the terminal
$export OMP_NUM_THREADS=10
$bsub -n 10 ./a.out
But the program does not run properly: I output into a file and see it took 0 seconds to run, and the the values of Fx, Fy and Fz are what they are when I initiate them, so it seems the loop is not even run at all.
Edit: This issue was addressed by the people who managed the cluster, and is likely very specific to that cluster, hence I caution people to relate the issue to their specific case.

Looks to me that this question has nothing to do with programming but rather with using the batch system (a.k.a. distributed resource manager) on your cluster. The usual practice is to write a script instead and inside the script set OMP_NUM_THREADS to the number of slots granted. Your batch system appears to be LSF (a wild guess, based on the presence of bsub), then you'd mostly like to have something similar in the script (let's call it job.sh):
#BSUB -n 10
export OMP_NUM_THREADS=$LSB_DJOB_NUMPROC
./a.out
Then submit the script with bsub < job.sh. LSF exports the number of slots granted to the job in the LSB_DJOB_NUMPROC environment variable. By doing the assignment you may submit the same job file with different parameters like: bsub -n 20 < job.sh. You might need to give a hint to the scheduler that you'd like to have all slots on the same node. One can usually do that by specifying -R "span[ptile=n]". There might be other means to do that, e.g. an esub executable that you might need to specify:
#BSUB -a openmp
Please, note that Stack Overflow is not where your administrators store the cluster documentation. You'd better ask them, not us.

I am not sure that I understand correctly what you are up to, but I fear that your idea is that OpenMP would automatically run your application in a distributed way on a cluster.
OpenMP is not made for such a task, it supposes that you run your code in a shared memory setting. For a distributed setting (processors only connected through a networking link) there are other tools, namely MPI. But such a setting is a bit more complicated to set up than just the #pragma annotations that you are used to when using openMP.

Hristo is right, but i think you should add
#BSUB -R "span[hosts=1]" # run on a single node
in your .sh file. The ptile option is only to specify the number of tasks per node
, see i.e
https://doc.zih.tu-dresden.de/hpc-wiki/bin/view/Compendium/PlatformLSF
Otherwise, depending on the queue settings of the cluster, which you might get with
bqueues -l
the task would be runned on every node, which is available to you.

If the node has 24 cores
#PBS -l nodes=1:ppn=24
in my system. Probably in the cluster you use it will be like
#BSUB -l nodes=1:ppn=24

Related

Compiling Postgresql disable "fno-aggressive-loop-optimizations"?

I'm trying to compilation and install PostgreSQL in my system. My operating System is Debian 9 gcc-4.9 Below posted is my error
The database cluster will be initialized with locale en_US.UTF-8. The default database encoding has accordingly been set to UTF8.
creating directory p01/pgsql/data ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers/max_fsm_pages ... 24MB/153600
creating configuration files ... ok
creating template1 database in p01/pgsql/data/base/1 ... ok
initializing pg_authid ... FATAL: wrong number of index expressions
STATEMENT: CREATE TRIGGER pg_sync_pg_database AFTER INSERT OR UPDATE OR DELETE ON
pg_database FOR EACH STATEMENT EXECUTE PROCEDURE flatfile_update_trigger();
child process exited with exit code 1
initdb: removing data directory "p01/pgsql/data"
In another post, a user suggests to disable the "fno-aggressive-loop-optimizations". But how can I disable this? It is a parameter in ./configure when compiling the fonts. See below the suggestion:
initdb: initializing pg_authid ... FATAL: wrong number of index expressions
I ran into the same problem after compiling postgresql 8.1.4 with gcc 4.9.3.
The problem seems to be the way postgres uses to represent variable length arrays:
typedef struct
{
int32 size; /* these fields must match ArrayType! */
int ndim;
int flags;
Oid elemtype;
int dim1;
int lbound1;
int2 values[1]; /* VARIABLE LENGTH ARRAY */
} int2vector; /* VARIABLE LENGTH STRUCT */
In some cases, for loops accessing 'values', GCC assumes that they will do one iteration at most. Loops like the one below (extracted from postgres's source code):
ii->ii_NumIndexAttrs = numKeys;
for (i = 0; i < numKeys; i++)
ii->ii_KeyAttrNumbers[i] = indexStruct->indkey.values[i];
might end up being reduced to something like:
ii->ii_NumIndexAttrs = numKeys;
if (numKeys)
ii->ii_KeyAttrNumbers[0] = indexStruct->indkey.values[0];
as deduced by looking at the assembler generated for it:
.L161:
testl %r12d, %r12d
movl %r12d, 4(%rbx)
jle .L162
movzwl 40(%r13), %eax
movw %ax, 8(%rbx)
.L162:
The problem went away after re-compiling postgres with that optimization disabled by using -fno-aggressive-loop-optimizations.
Thank you for the tips... I was able to solve the problem.
Here's the solution if someone has this problem.
To compile PostgreSQL 9.0.1 sources using GCC-4.9, I used the following directive in postgresql source:
./configure -prefix=/opt/postgres9.0 CFLAGS="-Wno-aggressive-loop-optimizations"
Wno-aggressive-looop-optiimizations disables aggressive GCCs Optimization, avoiding the error reported in previous message and in discussion-List pgsql-general ->
https://www.postgresql.org/message-id/CAOD%3DoQ-kq3Eg5SOvRYOVxDuqibVWC8R0wEivPsMGcyzZY-nfzA%40mail.gmail.com
I hope the removal of "GCCs aggressive loop optimization" does not cause any errors of any kind in the DBMS.

Running openmp on cluster

I have to run an openmp program on a cluster with different configuration (such as different number of nodes).
But the problem I am facing is that whenever I am trying to run the program with say 2 nodes then the same piece of program runs 2 times instead of running in parallel.
My program -
gettimeofday(&t0, NULL);
for (k=0; k<size; k++) {
#pragma omp parallel for shared(A)
for (i=k+1; i<size; i++) {
//parallel code
}
#pragma omp barrier
for (i=k+1; i<size; i++) {
#pragma omp parallel for
//parallel code
}
}
gettimeofday(&t1, NULL);
printf("Did %u calls in %.2g seconds\n", i, t1.tv_sec - t0.tv_sec + 1E-6 * (t1.tv_usec - t0.tv_usec));
It is an LU decomposition program.
When I am running it on 2 node then I am getting output something like this -
Did 1000 calls in 5.2 seconds
Did 1000 calls in 5.3 seconds
Did 2000 calls in 41 seconds
Did 2000 calls in 41 seconds
As you see each the program is run two times for each value (1000,2000,3000...) instead of running in parallel.
It is my homework program but I am stuck at this point.
I am using SLURM script to run this program on my college computing cluster. This is the standard script provided by the professor.
#!/bin/sh
##SBATCH --partition=general-compute
#SBATCH --time=60:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
##SBATCH --mem=24000
# Memory per node specification is in MB. It is optional.
# The default limit is 3GB per core.
#SBATCH --job-name="lu_openmpnew2nodes"
#SBATCH --output=luopenmpnew1node2task.out
#SBATCH --mail-user=***#***.edu
#SBATCH --mail-type=ALL
##SBATCH --requeue
#Specifies that the job will be requeued after a node failure.
#The default is that the job will not be requeued.
echo "SLURM_JOBID="$SLURM_JOBID
echo "SLURM_JOB_NODELIST"=$SLURM_JOB_NODELIST
echo "SLURM_NNODES"=$SLURM_NNODES
echo "SLURMTMPDIR="$SLURMTMPDIR
cd $SLURM_SUBMIT_DIR
echo "working directory = "$SLURM_SUBMIT_DIR
module list
ulimit -s unlimited
#
echo "Launch luopenmp with srun"
export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so
for i in {1000..20000..1000}
do
srun ./openmpNew "$i"
done
#
echo "All Done!"
Be careful, you are confusing MPI and OpenMP here.
OpenMP works with Threads, i.e. on shared memory which do not communicate over several nodes of a distributed memory system (there exist some techniques to do so, but they are not performant enough).
What you are doing is starting the same program on two nodes each. If you where using MPI, this would be fine. But in your case you start two processes with a default number of threads. Those two processes are independent of each other.
I would suggest some further studies on the topics of Shared Memory Parallelization programming (like OpenMP) and Distributed Memory Parallelization (like MPI). There's tons of tutorials out there, and I would recommend the book "Introduction to High Performance Computing for Scientists and Engineers," by Hager and Wellein.
To try your program, start on one node, and specify OMP_NUM_THREADS like:
OMP_NUM_THREADS=1 ./openmpNew "$i"
OMP_NUM_THREADS=2 ./openmpNew "$i"
...
Here is an example script for SLURM: link.

Core Dump is is not working

When I run programs with segfaults I get an error message Segmentation fault: 11. For some reason, I'm not getting the (core dumped) message. I tried running the shell command ulimit -c unlimited, but I still get the same error and it doesn't say core dumped. I'm new to GDB so I tried it with a simple program:
/* coredump.c */
#include <stdio.h>
int main (void) {
int *point = NULL;
*point = 0;
return 0;
}
But when I compile using:
gcc coredump.c -g -o coredump
And run it, it still says segfault: 11
Is it still creating a core dump somewhere I don't know about? I want to be able to use gdb coredump core.
Look at this link:
How to generate a core dump in Linux when a process gets a segmentation fault?
Options include:
ulimit -c unlimited (default = 0: no core files generated)
the directory for the dump must be writable. By default this is the current directory of the process, but that may be changed by setting /proc/sys/kernel/core_pattern.
in some conditions, the kernel value in /proc/sys/fs/suid_dumpable may prevent the core to be generated.
"man core" for other options
find / -name core -print 2> /dev/null to search your filesystem for core files
I presume you're running Linux, and I presume you're executing the .exe in a directory where you have write permissions.
So my top two guesses would be 1) "ulimit -c unlimited" isn't getting set, or is being overridden, or 2) the core files are being generated, but going "somewhere else".
The above suggestions should help. Please post back what you find!
If you're running the program that crashes from the shell, then you should follow the guidelines in Apple's Tech Note TN2124, which I found out about in in the answer to SO2207233.
There are a few key points:
You need to set ulimit -c unlimited in bash (same effect, different command in tcsh).
You need to set the permissions on the /cores directory so that you can create files in it. The default permissions are 1775; you need 1777. The 1 indicates the sticky bit is set.
The core dumps are then created in /cores suffixed with a PID (/cores/core.5312, for example).
If you want programs launched graphically to dump core when they crash, then you need to create /etc/launchd.conf if it does not already exist, and add a line limit core unlimited to the file. Again, see the information in the Tech Note for more details.
Watch it; core dumps are huge! Consider this not very complicated or big program:
#include <stdio.h>
int main(void)
{
int *i = 0;
int j = 0;
printf("i = %d, j = %d, i / j = %d\n", *i, j, *i / j);
return 0;
}
The core dump from this is nearly 360 MB.
Using gcc, if you add the flags:
gcc -g -dH
you should be able to generate a core dump
The -g flag produces some debugging information to use with gdb, and the -dH flag produces a core dump when there is an error
Sometimes core files are not store in current directory and may follow a different naming rule
sysctl -a | grep kern.core
may give hints to where your core files are stored

How can I run this DTrace script to profile my application?

I was searching online for something to help me do assembly line profiling. I searched and found something on http://www.webservertalk.com/message897404.html
There are two parts of to this problem; finding all instructions of a particular type (inc, add, shl, etc) to determine groupings and then figuring out which are getting executed and summing correcty. The first bit is tricky unless grouping by disassembler is sufficient. For figuring which instructions are being executed, Dtrace is of course your friend here( at least in userland).
The nicest way of doing this would be instrument only the begining of each basic block; finding these would be a manual process right now... however, instrumenting each instruction is feasible for small applications. Here's an example:
First, our quite trivial C program under test:
main()
{
int i;
for (i = 0; i < 100; i++)
getpid();
}
Now, our slightly tricky D script:
#pragma D option quiet
pid$target:a.out::entry
/address[probefunc] == 0/
{
address[probefunc]=uregs[R_PC];
}
pid$target:a.out::
/address[probefunc] != 0/
{
#a[probefunc,(uregs[R_PC]-address[probefunc]), uregs[R_PC]]=count();
}
END
{
printa("%s+%#x:\t%d\t%#d\n", #a);
}
main+0x1: 1
main+0x3: 1
main+0x6: 1
main+0x9: 1
main+0xe: 1
main+0x11: 1
main+0x14: 1
main+0x17: 1
main+0x1a: 1
main+0x1c: 1
main+0x23: 101
main+0x27: 101
main+0x29: 100
main+0x2e: 100
main+0x31: 100
main+0x33: 100
main+0x35: 1
main+0x36: 1
main+0x37: 1
From the example given, this is exactly what i need. However I have no idea what it is doing, how to save the DTrace program, how to execute with the code that i want to get the results of. So i opened this hoping some people with good DTrace background could help me understand the code, save it, run it and hopefully get the results shown.
If all you want to do is run this particular DTrace script, simply save it to a .d script file and use a command like the following to run it against your compiled executable:
sudo dtrace -s dtracescript.d -c [Path to executable]
where you replace dtracescript.d with your script file name.
This assumes that you have DTrace as part of your system (I'm running Mac OS X, which has had it since Leopard).
If you're curious about how this works, I wrote a two-part tutorial on using DTrace for MacResearch a while ago, which can be found here and here.

How many files can i have opened at once?

On a typical OS how many files can i have opened at once using standard C disc IO?
I tried to read some constant that should tell it, but on Windows XP 32 bit that was a measly 20 or something. It seemed to work fine with over 30 though, but i haven't tested it extensively.
I need about 400 files opened at once at max, so if most modern OS's support that, it would be awesome. It doesn't need to support XP but should support Linux, Win7 and recent versions of Windows server.
The alternative is to write my own mini file system which i want to avoid if possible.
On Linux, this is dependent on the amount of available file descriptors.
You can use ulimit -n to set / show the number of available FD's per shell.
See these instructions to how to check (or change) the value of available total FD:s in Linux.
This IBM support article suggests that on Windows the number is 512, and you can change it in the registry (as instructed in the article)
As open() returns the fd as int - size of int limits also the upper limit.
(irrelevant as INT_MAX is a lot)
A process can query the limit using the getrlimit system-call.
#include<sys/resource.h>
struct rlimit rlim;
getrlimit(RLIMIT_NOFILE, &rlim);
printf("Max number of open files: %d\n", rlim.rlim_cur-1);
FYI, as root, you have first to modify the 'nofile' item in /etc/security/limits.conf . For example:
* hard nofile 10240
* soft nofile 10240
(changes in limits.conf typically take effect when the user logs in)
Then, users can use the ulimit -n bash command. I've tested this with up to 10,240 files on Fedora 11.
ulimit -n <max_number_of_files>
Lastly, all this is limited by the kernel limit, given by: (I guess you could echo a value into this to go even higher... at your own risk)
cat /proc/sys/fs/file-max
Also, see http://www.karakas-online.de/forum/viewtopic.php?t=9834

Resources