I apologize ahead of time that I don't quite have the proper jargon to describe my problem, and that I have likely not given enough information.
I've been running my MPI code under gcc 4.4 and OpenMPI/MPICH2 for months now with no issue on a variety of platforms. Recently I upgrade a set of servers and my desktop to Ubuntu 11.04 (running gcc 4.5 now) and ran an 8 task job on a node with 8 processors. Typically I see nearly 100% user CPU utilization, and now I see only 60% user CPU and over 30% system cpu. This leads to a remarkable slowdown of my code when run in this fashion.
Investigating further, I simply ran a serial job, and noted that the process reported 150+% cpu time was being used. So, my program was multithreading itself over many processors. I verified this explicitly using 'ps -eLF' and looking at the per-processor loads.
This is an incredibly bad and inefficient thing for my MPI code, and I have no idea where it comes from. Nothing has changed other than the move to Ubuntu 11.04 and gcc 4.5. I have verified this against varying OpenMPI versions.
I also moved binaries around between two binary-compatible machines. If I compile on another machine (ubuntu 10.10/gcc 4.4) and run there, everything is fine. Moving the binary to the Ubuntu 11.04 machine, the same binary begins threading itself.
It is worth noting that I have explicitly disabled all optimizations (-O0), thinking my default (-O3) could include something I didn't understand in 4.5. I get identical behavior regardless of the optimization level.
Please let me know what further information I can provide to determine the source of this problem.
* ADDITIONAL INFO *
Results of ldd in response to request. Simply, it's OpenMPI, libconfig, and scalapack, along with standard gcc stuff:
linux-vdso.so.1 => (0x00007ffffd95d000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f2bd206a000)
libconfig.so.8 => /usr/lib/libconfig.so.8 (0x00007f2bd1e60000)
libscalapack-openmpi.so.1 => /usr/lib/libscalapack-openmpi.so.1 (0x00007f2bd151c000)
libmpi.so.0 => /usr/lib/libmpi.so.0 (0x00007f2bd126b000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2bd0ed7000)
libblacsCinit-openmpi.so.1 => /usr/lib/libblacsCinit-openmpi.so.1 (0x00007f2bd0cd4000)
libblacs-openmpi.so.1 => /usr/lib/libblacs-openmpi.so.1 (0x00007f2bd0aa4000)
libblas.so.3gf => /usr/lib/libblas.so.3gf (0x00007f2bd022f000)
liblapack.so.3gf => /usr/lib/liblapack.so.3gf (0x00007f2bcf639000)
libmpi_f77.so.0 => /usr/lib/libmpi_f77.so.0 (0x00007f2bcf406000)
libgfortran.so.3 => /usr/lib/x86_64-linux-gnu/libgfortran.so.3 (0x00007f2bcf122000)
libopen-rte.so.0 => /usr/lib/libopen-rte.so.0 (0x00007f2bceed3000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f2bcecb5000)
/lib64/ld-linux-x86-64.so.2 (0x00007f2bd22fc000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f2bcea9f000)
libopen-pal.so.0 => /usr/lib/libopen-pal.so.0 (0x00007f2bce847000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f2bce643000)
libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007f2bce43f000)
All the best.
is it possible that you are running into this feature? http://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html
basically, certain standard library routines have parallel implementations.However, it is only turned on when the _GLIBCXX_PARALLEL macro is defined.
Seeing 60%/40% doesn't tell anything, perhaps the processing is just accounted differently. The only interesting figure here would be to compare the wallclock time of the total execution of your code.
Also, I would think that (if so) it is not your binary itself that is parallelized but the MPI libaries. To check that you would not only have to compile your code on the other machine but also to link it statically. Only then you can be sure that you run exactly the same binary code in all of its aspects on the other machine.
Then, also you can't be sure that the MPI library doesn't use C++ under the hood. I remember that it was quite difficult for one of the MPI libs (don't remember which) to convice to not to compile against the C++ interface, even if you were only doing C.
Related
Say I have the OpenCL kernel,
/* Header to make Clang compatible with OpenCL */
/* Test kernel */
__kernel void test(long K, const global float *A, global float *b)
{
for (long i=0; i<K; i++)
for (long j=0; j<K; j++)
b[i] = 1.5f * A[K * i + j];
}
I'm trying to figure out how to compile this to a binary which can be loaded into OpenCL using the clCreateProgramWithBinary command.
I'm on a Mac (Intel GPU), and thus I'm limited to OpenCL 1.2. I've tried a number of different variations on the command,
clang -cc1 -triple spir test.cl -O3 -emit-llvm-bc -o test.bc -cl-std=cl1.2
but the binary always fails when I try to build the program. I'm at my wits' end with this, it's all so confusing and poorly documented.
The performance of the above test function can, in regular C, be significantly improved by applying the standard LLVM compiler optimization flag -O3. My understanding is that this optimization flag some how takes advantage of the contiguous memory access pattern of the inner loop to improve performance. I'd be more than happy to listen to anyone who wants to fill in the details on this.
I'm also wondering how I can first convert to SPIR code, and then convert that to a buildable binary. Eventually I would like to find a way to apply the -O3 compiler optimizations to my kernel, even if I have to manually modify the SPIR (as diffiult as that will be).
I've also gotten the SPIRV-LLVM-Translator tool working (as far as I can tell), and ran,
./llvm-spirv test.bc -o test.spv
and this binary fails to load at the clCreateProgramWithBinary step, I can't even get to the build step.
Possibly SPIRV doesn't work with OpenCL 1.2, and I have to use clCreateProgramWithIL, which unfortunately doesn't exist for OpenCL 1.2. It's difficult to say for sure why it doesn't work.
Please see my previous question here for some more context on this problem.
I don't believe there's any standardised bitcode file format that's available across implementations, at least at the OpenCL 1.x level.
As you're talking specifically about macOS, have you investigated Apple's openclc compiler? This is also what Xcode invokes when you compile a .cl file as part of a target. The compiler is located in /System/Library/Frameworks/OpenCL.framework/Libraries/openclc; it does have comprehensive --help output but that's not a great source for examples on how to use it.
Instead, I recommend you try the OpenCL-in-Xcode tutorial, and inspect the build commands it ends up running:
https://developer.apple.com/library/archive/documentation/Performance/Conceptual/OpenCL_MacProgGuide/XCodeHelloWorld/XCodeHelloWorld.html
You'll find it produces bitcode files (.bc) for 4 "architectures": i386, x86_64, "gpu_64", and "gpu_32". It also auto-generates some C code which loads this code by calling gclBuildProgramBinaryAPPLE().
I don't know if you can untangle it further than that but you certainly can ship bitcode which is GPU-independent using this compiler.
I should point out that OpenCL is deprecated on macOS, so if that's the only platform you're targeting, you really should go for Metal Compute instead. It has much better tooling and will be actively supported for longer. For cross-platform projects it might still make sense to use OpenCL even on macOS, although for shipping kernel binaries instead of source, it's likely you'll have to use platform-specific code for loading those anyway.
I was investigating whether a few memory functions(memcpy, memset, memmove) in glibc-2.25 with various versions(sse4, ssse3, avx2, avx512) could have performance gain for our server programs in Linux(glibc 2.12).
My first attempt was to download a tar ball of glibc-2.25 and build/test following the instructions here https://sourceware.org/glibc/wiki/Testing/Builds. I manually commented out kernel version check and everything went well. Then a test program was linked with newly built glibc with the procedure listed in section "Compile against glibc build tree" of glibc wiki and 'ldd test' shows that it indeed depended on the expected libraries:
# $GLIBC is /data8/home/wentingli/temp/glibc/build
libm.so.6 => /data8/home/wentingli/temp/glibc/build/math/libm.so.6 (0x00007fe42f364000)
libc.so.6 => /data8/home/wentingli/temp/glibc/build/libc.so.6 (0x00007fe42efc4000)
/data8/home/wentingli/temp/glibc/build/elf/ld-linux-x86-64.so.2 => /lib64/ld-linux-x86-64.so.2 (0x00007fe42f787000)
libdl.so.2 => /data8/home/wentingli/temp/glibc/build/dlfcn/libdl.so.2 (0x00007fe42edc0000)
libpthread.so.0 => /data8/home/wentingli/temp/glibc/build/nptl/libpthread.so.0 (0x00007fe42eba2000)
I use gdb to verify which memset/memcpy was actually called but it always shows that __memset_sse2_unaligned_erms is used while I was expecting that some more advanced version of the function(avx2,avx512) could be in use.
My questions are:
Did glibc-2.25 select the most suitable version of memory functions automatically according to cpu/os/memory address? If not, am I missing any configuration during glibc build or something wrong with my setup?
Is there any other alternatives for porting memory functions from newer glibc?
Any help or suggestion would be appreciated.
On x86, glibc will automatically select an implementation which is most suitable for the CPU of the system, usually based on guidance from Intel. (Whether this is the best choice for your scenario might not be clear because the performance trade-offs for many of the vector instructions are extremely complex.) Only if you explicitly disable IFUNCs in the toolchain, this will not happen, but __memset_sse2_unaligned_erms isn't the default implementation, so this does not apply here. The ERMS feature is pretty recent, so this is not completely unreasonable.
Building a new glibc is probably the right approach to test these string functions. Theoretically, you could also use LD_PRELOAD to override the glibc-provided functions, but it is a bit cumbersome to build the string functions outside the glibc build system.
If you want to run a program against a patched glibc without installing the latter, you need to use the testrun.sh script in the glibc build directory (or a similar approach).
I am trying to run a newly compiled binary on some oldish 32bits RedHat distribution.
The binary is compiled C (not++) on a CentOS 32bits VM running libc v2.12.
RedHat complains about libc version: error while loading shared libraries: requires glibc 2.5 or later dynamic linker
Since my program is rather simplistic, It is most likely not using anything new from libc.
Is there a way to reduce libc version requirement
An untested possible solution
What is "error while loading shared libraries: requires glibc 2.5 or later dynamic linker"?
The cause of this error is the dynamic binary (or one of its dependent
shared libraries) you want to run only has .gnu.hash section, but the
ld.so on the target machine is too old to recognize .gnu.hash; it only
recognizes the old-school .hash section.
This usually happens when the dynamic binary in question is built
using newer version of GCC. The solution is to recompile the code with
either -static compiler command-line option (to create a static
binary), or the following option:
-Wl,--hash-style=both
This tells the link editor ld to create both .gnu.hash and .hash
sections.
According to ld documentation here, the old-school .hash section
is the default, but the compiler can override it. For example, the GCC
(which is version 4.1.2) on RHEL (Red Hat Enterprise Linux) Server
release 5.5 has this line:
$ gcc -dumpspecs
....
*link:
%{!static:--eh-frame-hdr} %{!m32:-m elf_x86_64} %{m32:-m elf_i386} --hash-style=gnu %{shared:-shared} ....
^^^^^^^^^^^^^^^^
...
For more information, see here.
I already had the same problem, trying to compile a little tool (I wrote) for an old machine for which I had not compiler. I compiled it on an up to date machine, and the binary required at least GLIBC 2.14 in order to run.
By making a dump of the binary (with xxd), I found this :
....
5f64 736f 5f68 616e 646c 6500 6d65 6d63 _dso_handle.memc
7079 4040 474c 4942 435f 322e 3134 005f py##GLIBC_2.14._
....
So I replaced the memcpy calls in my code by a call to an home-made memcpy, and the dependency with the glibc 2.14 magically disappeared.
I'm sorry I can't really explain why it worked, or I can't explain why it didn't work before the modification.
Hope it helped !
Ok then, trying to find some balance between elegance and brute force, I downloaded a VM matching the target kernel version, hence fixing library issues.
The whole thing (download + yum install gcc) took less than 30 minutes.
References: Virtual machines, Kernel Version Mapping Table
I'm running out of good ideas on how to crack this bug. I have 1000 lines of code that crashes every 2 or 3 runs. It is currently a prototype command line application written in C. An issue is that it's proprietary and I cannot give you the source, but I'd be happy to send a debug compiled executable to any brave soul on a Debian Squeeze x86_64 machine.
Here is what I got so far:
When I run it in GDB, it always complete successfully.
When I run it in Valgrind, it always complete successfully.
The issue seems to emanate from a recursive function call that is very basic. In an effort to pin point the error in this recursive function I wrote the same function in a separate application. It always completes successfully.
I built my own gcc 4.7.1 compiler, compiled my code with it and I'm still getting the same behavior.
FTped my application to another machine to eliminate the risk of HW issues and I still get the same behavior.
FTped my source code to another machine to eliminate the risk of a corrupt build environment and I still get the same behavior.
The application is single threaded and does no signal handling that might cause race conditions. I memset(,0,) all large objects
There are no exotic dependencies, the ldd follows below.
ldd gives me this:
ldd tst
linux-vdso.so.1 => (0x00007fff08bf0000)
libpthread.so.0 => /lib/libpthread.so.0 (0x00007fe8c65cd000)
libm.so.6 => /lib/libm.so.6 (0x00007fe8c634b000)
libc.so.6 => /lib/libc.so.6 (0x00007fe8c5fe8000)
/lib64/ld-linux-x86-64.so.2 (0x00007fe8c67fc000)
Are there any tools out there that could help me?
What would be your next step if you were in my position?
Thanks!
This is what got me in the right direction -Wextra I already used -Wall.
THANKS!!! This was really driving me crazy.
I suggested in comments :
to compile with -Wall -Wextra and improve the source code till no warnings are given;
to compile with both -g and -O; this is helpful to inspect dumped core files with gdb (you may want to set a big enough coredump size limit with e.g. ulimit bash builtin)
to show your code to a colleague and explain the issue?
to use ltrace or strace
Apparently -Wextra was helpful. It would be nice to understand why and how.
BTW, for larger programs, you could even add your own warnings to GCC by extending it with MELT; this may take days and is worthwhile mostly in big projects.
In this case, i think that you have some memory problems (see the output of valgrind carefully), cause GDB and valgrind change the original program by adding some memory tracking functions (so your original addresses are changed). You can compile with -ggdb option and set coredump (ulimit -c unlimited) and then trying to analyze what's going on. This link may help you:
http://en.wikipedia.org/wiki/Unusual_software_bug
Regards.
My question is a little bit similar to this but it is about TCL extensions.
I am using C on Linux (gcc) and I have a package with three modules A, B, and C. Module A contains functions and also define (not only declare) global variables. I compile and link module A into a dynamic library (libA.so).
Now, I want that B and C are TCL extensions. Both are using functions and global variables from A, while C is also using functions from B. I have made B and C shared library (B.so and C.so) but without using "-Wl -soname". I made B.so depends on A.so, while C.so is without user dependencies. Although this is strange, bot extensions loaded and worked properly. Here is, what I have (A=libbiddy.so, B=bddscout.so, C=bddscoutIFIP.so):
meolic#meolic:/usr/lib/bddscout$ ldd *.so
bddscout.so:
linux-gate.so.1 => (0x00177000)
libbiddy.so.1 => /usr/lib/libbiddy.so.1 (0x00eca000)
libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0x00342000)
/lib/ld-linux.so.2 (0x0061f000)
bddscoutIFIP.so:
linux-gate.so.1 => (0x00fc2000)
libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0x00110000)
/lib/ld-linux.so.2 (0x00c75000)
meolic#meolic:/usr/lib/bddscout$ wish
% puts $tcl_patchLevel
8.5.8
% load ./bddscout.so
% load ./bddscoutIFIP.so
% info loaded
{./bddscoutIFIP.so Bddscoutifip} {./bddscout.so Bddscout} {{} Tk}
The problem is, that exactly the same package is not working everywhere. On a new computer extension C.so does not load.
meolic#altair:/usr/lib/bddscout$ ldd *.so
bddscout.so:
linux-gate.so.1 => (0xb76ef000)
libbiddy.so.1 => /usr/lib/libbiddy.so.1 (0xb76c9000)
libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xb754d000)
/lib/ld-linux.so.2 (0xb76f0000)
bddscoutIFIP.so:
linux-gate.so.1 => (0xb7780000)
libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xb75e8000)
/lib/ld-linux.so.2 (0xb7781000)
meolic#altair:/usr/lib/bddscout$ wish
% puts $tcl_patchLevel
8.5.10
% load ./bddscout.so
% load ./bddscoutIFIP.so
couldn't load file "./bddscoutIFIP.so": ./bddscoutIFIP.so: undefined symbol: biddy_termFalse
The reported undefined symbol is one of global variables from A. Question1: is my approach correct as it works on some systems? Question2: why it does not work on a new system?
Tcl's load command uses dlopen() under the covers (on Linux; it's different on other platforms of course) and it uses it with the RTLD_LOCAL flag; symbols in the library are not exported to the rest of the application. Because of this, unbound symbols in one dynamically-loaded library will not resolve against another one; this boosts isolation, but forces you to do more work to make things all function correctly where you want such a dependency to actually exist.
Your options are:
If libscoutIFIP.so depends on libbiddy.so's symbols, tell this to the linker when building the library and the dynamic linker engine will sort it all out so that the dependency doesn't get loaded multiple times. That is, if a library depends on a symbol in another library, it should explicitly list that library as a dependency.
Arrange for libbiddy.so to export its symbols as a stub table (i.e., structure of pointers to functions/variables) through Tcl's package API (Tcl_PkgProvide()). Then when libscoutIFIP.so does Tcl_PkgRequireEx() on the biddy package, it will get a pointer to that stub table and can use the references within it instead of doing direct linking. This is how Tcl's stub mechanism works, and its awesome and portable and lets you do fairly complex API version management (if necessary). It's a bit more work to set up though. The Tcler's Wiki goes into quite a lot more depth on this topic.
If option 1 works for you, go with that; for Linux-specific code that should be just fine as the system dynamic linker isn't desperately dense (unlike the situation on Windows).
[EDIT]: Note that older versions of Tcl (up to 8.5.9) used RTLD_GLOBAL instead. It seems that this change should have been labelled ***POTENTIAL INCOMPATIBILITY*** in the release notes and trailed more widely. Apologies on behalf of the Tcl developers.