I was investigating whether a few memory functions(memcpy, memset, memmove) in glibc-2.25 with various versions(sse4, ssse3, avx2, avx512) could have performance gain for our server programs in Linux(glibc 2.12).
My first attempt was to download a tar ball of glibc-2.25 and build/test following the instructions here https://sourceware.org/glibc/wiki/Testing/Builds. I manually commented out kernel version check and everything went well. Then a test program was linked with newly built glibc with the procedure listed in section "Compile against glibc build tree" of glibc wiki and 'ldd test' shows that it indeed depended on the expected libraries:
# $GLIBC is /data8/home/wentingli/temp/glibc/build
libm.so.6 => /data8/home/wentingli/temp/glibc/build/math/libm.so.6 (0x00007fe42f364000)
libc.so.6 => /data8/home/wentingli/temp/glibc/build/libc.so.6 (0x00007fe42efc4000)
/data8/home/wentingli/temp/glibc/build/elf/ld-linux-x86-64.so.2 => /lib64/ld-linux-x86-64.so.2 (0x00007fe42f787000)
libdl.so.2 => /data8/home/wentingli/temp/glibc/build/dlfcn/libdl.so.2 (0x00007fe42edc0000)
libpthread.so.0 => /data8/home/wentingli/temp/glibc/build/nptl/libpthread.so.0 (0x00007fe42eba2000)
I use gdb to verify which memset/memcpy was actually called but it always shows that __memset_sse2_unaligned_erms is used while I was expecting that some more advanced version of the function(avx2,avx512) could be in use.
My questions are:
Did glibc-2.25 select the most suitable version of memory functions automatically according to cpu/os/memory address? If not, am I missing any configuration during glibc build or something wrong with my setup?
Is there any other alternatives for porting memory functions from newer glibc?
Any help or suggestion would be appreciated.
On x86, glibc will automatically select an implementation which is most suitable for the CPU of the system, usually based on guidance from Intel. (Whether this is the best choice for your scenario might not be clear because the performance trade-offs for many of the vector instructions are extremely complex.) Only if you explicitly disable IFUNCs in the toolchain, this will not happen, but __memset_sse2_unaligned_erms isn't the default implementation, so this does not apply here. The ERMS feature is pretty recent, so this is not completely unreasonable.
Building a new glibc is probably the right approach to test these string functions. Theoretically, you could also use LD_PRELOAD to override the glibc-provided functions, but it is a bit cumbersome to build the string functions outside the glibc build system.
If you want to run a program against a patched glibc without installing the latter, you need to use the testrun.sh script in the glibc build directory (or a similar approach).
Related
Cast of characters
big-old-app is linked to an old version of glibc, say glibc-2.12. I cannot do anything to change this.
cute-new-addon.o is linked to a newer version, glibc-2.23. This glibc-2.23 is in a nonstandard path (because I don't have sudo powers).
The story
I want to use cute-new-addon.o inside big-old-app. I would normally write a script for big-old-app to execute, which then calls cute-new-addon.o to perform its tricks. From the command line, it would look like:
$ big-old-app script.txt
However, when I do that, big-old-app would complain that cute-new-addon.o cannot find glibc-2.23. That's understandable, because I have not specified any standard paths. What if I do:
$ LD_LIBRARY_PATH=/path/to/mylibs:$LD_LIBRARY_PATH big-old-app script.txt
It segfaults! :(
I think this is because big-old-app references a newer mylibc.so.6. When doing so, the implementations are no longer what big-old-app is used to, so it segfaults.
The question
Regarding script.txt, I don't think I have the ability to specify the newer mylibc.so.6 before invoking cute-new-addon.o. big-old-app and cute-new-addon.o are tightly intertwined that I have no way of knowing when either of them need their corresponding glibc.
And yes, cute-new-addon.o rpath is pointed to /path/to/mylibs and I can confirm via ldd that all the libraries it needs, it looks for in /path/to/mylibs.
Can I use LD_PRELOAD to load two different versions of glibc? And let big-old-app and cute-new-addon.o look for what they need as they please?
LD_PRELOAD cannot be used because the glibc dynamic linker (sometimes called ld.so or the program interpreter; the location on disk is platform-specific) is only compatible with libc.so.6 (and the rest of the libraries) from the same glibc build.
You can use an explicit loader invocation of the other glibc, along with library path settings that cause the loader to load the glibc objects from a separate directory, and not the system directories. The glibc wiki has an example how to do this.
Basically, I want to get a list of libraries a binary might load.
The unreliable way I came up with that seems to work (with possible false-positives):
comm -13 <(ldd elf_file | sed 's|\s*\([^ ]*\)\s.*|\1|'| sort -u) <(strings -a elf_file | egrep '^(|.*/)lib[^:/]*\.so(|\.[0-9]+)$' | sort -u)
This is not reliable. But it gives useful information, even if the binary was stripped.
Is there a reliable way to get this information without possible false-positives?
EDIT: More context.
Firefox is transitioning from using gstreamer to using ffmpeg.
I was wondering what versions of libavcodec.so will work.
libxul.so uses dlopen() for many optional features.
And the library names are hard-coded. So, the above command helps
in this case.
I also have a general interest in package management and binary dependencies.
I know you can get direct dependencies with readelf -d, dependencies of
dependencies with ldd. And I was wondering about optional dependencies, hence the question.
ldd tells you the libraries your binary has been linked against. These are not those that the program could open with dlopen.
The signature for dlopen is
void *dlopen(const char *filename, int flag);
So you could, still unreliably, run strings on the binary, but this could still fail if the library name is not a static string, but built or read from somewhere during program execution -- and this last situation means that the answer to your question is "no"... Not reliably. (The name of the library file could be read from the network, from a Unix socket, or even uncompressed on the fly, for example. Anything is possible! -- although I wouldn't recommend any of these ideas myself...)
edit: also, as John Bollinger mentioned, the library names could be read from a config file.
edit: you could also try substituting the dlopen system call with one of yours (this is done by the Boehm garbage collector with malloc, for example), so it would open the library, but also log its name somewhere. But if the program didn't open a specific library during execution, you still won't know about it.
(I am focusing on Linux; I guess that most of my answer fits for every POSIX systems; but on MacOSX dlopen wants .dylib dynamic library files, not .so shared objects)
A program could even emit some C code in some temporary file /tmp/foo1234.c, fork a compilation of that /tmp/foo1234.c into a shared library /tmp/foo1234.so by some gcc -O -shared -fPIC /tmp/foo1234.c -o /tmp/foo1234.so command -generated and executed at runtime of your program-, perhaps remove the /tmp/foo1234.c file -since it is not needed any more-, and dlopen that /tmp/foo1234.so (and perhaps even remove /tmp/foo1234.so after dlopen), all that in the same process. My GCC MELT plugin for gcc does exactly this, and so does Bigloo, and the GCCJIT library is doing something close.
So in general, your quest is impossible and even has no sense.
Is there a reliable way to get this information without possible false-positives?
No, there is no reliable way to get such information without false positives (you could prove that equivalent to the halting problem, or to some other undecidable problem). See also Rice's theorem.
In practice, most dlopen happens on plugins provided by some configuration. There might not be exactly named as such in a configuration file (e.g. some Foo programs might have a convention like a plugin named bar in some foo.conf configuration file is provided by foo-bar.so plugin).
However, you might find some heuristic approximation. Most programs doing some dlopen have some plugin convention requesting some particular symbol names in the plugin. You could search for shared objects defining these names. Of course you'll get false positives.
For example, the zsh shell accepts plugins called zsh modules. the example module shows that enables_,
boot_, features_ etc... functions are expected in zsh modules. You could use nm -D to find *.so files providing these (hence finding the plugins likely to be perhaps loadable by zsh)
(I am not convinced that such an approach is worthwhile; in fact you should usually know which plugins are useful on your system by which applications)
BTW, you could use strace(1) on the execution of some command to understand the syscalls it is doing, hence the plugins it is loading. You might also use ltrace(1), or pmap(1) (on some given process), or simply -for a process 1234- use cat /proc/1234/maps to understand its virtual address space, hence the plugins it has already loaded. See proc(5).
Notice that strace, ltrace, pmap exist on Linux, but many POSIX systems have similar programs.
Also, a program could generate some machine code at runtime and execute it (SBCL does that at every REPL interaction!). Your program could also use some JIT techniques (e.g. with libjit, llvm, asmjit, GCCJIT or with hand-written code...) to do likewise. So plugin-like behavior can happen without dlopen (and you might mimic dlopen with mmap calls and some ELF relocation processing).
Addenda:
If you are installing firefox from its packaged version (e.g. the iceweasel package on Debian), its package is likely to handle the dependencies
I am trying to run a newly compiled binary on some oldish 32bits RedHat distribution.
The binary is compiled C (not++) on a CentOS 32bits VM running libc v2.12.
RedHat complains about libc version: error while loading shared libraries: requires glibc 2.5 or later dynamic linker
Since my program is rather simplistic, It is most likely not using anything new from libc.
Is there a way to reduce libc version requirement
An untested possible solution
What is "error while loading shared libraries: requires glibc 2.5 or later dynamic linker"?
The cause of this error is the dynamic binary (or one of its dependent
shared libraries) you want to run only has .gnu.hash section, but the
ld.so on the target machine is too old to recognize .gnu.hash; it only
recognizes the old-school .hash section.
This usually happens when the dynamic binary in question is built
using newer version of GCC. The solution is to recompile the code with
either -static compiler command-line option (to create a static
binary), or the following option:
-Wl,--hash-style=both
This tells the link editor ld to create both .gnu.hash and .hash
sections.
According to ld documentation here, the old-school .hash section
is the default, but the compiler can override it. For example, the GCC
(which is version 4.1.2) on RHEL (Red Hat Enterprise Linux) Server
release 5.5 has this line:
$ gcc -dumpspecs
....
*link:
%{!static:--eh-frame-hdr} %{!m32:-m elf_x86_64} %{m32:-m elf_i386} --hash-style=gnu %{shared:-shared} ....
^^^^^^^^^^^^^^^^
...
For more information, see here.
I already had the same problem, trying to compile a little tool (I wrote) for an old machine for which I had not compiler. I compiled it on an up to date machine, and the binary required at least GLIBC 2.14 in order to run.
By making a dump of the binary (with xxd), I found this :
....
5f64 736f 5f68 616e 646c 6500 6d65 6d63 _dso_handle.memc
7079 4040 474c 4942 435f 322e 3134 005f py##GLIBC_2.14._
....
So I replaced the memcpy calls in my code by a call to an home-made memcpy, and the dependency with the glibc 2.14 magically disappeared.
I'm sorry I can't really explain why it worked, or I can't explain why it didn't work before the modification.
Hope it helped !
Ok then, trying to find some balance between elegance and brute force, I downloaded a VM matching the target kernel version, hence fixing library issues.
The whole thing (download + yum install gcc) took less than 30 minutes.
References: Virtual machines, Kernel Version Mapping Table
I've been using GCC 4.6.2 on Mac OS X 10.6. I use the -static-libgcc option when I compile, otherwise my binaries look for libgcc on the system and I'm not sure anything over GCC 4.2 is supported on OS X. This works fine, but why do I even need libgcc? I read up on it and the GNU docs say it contains "arithmetic operations that the target processor cannot perform directly." How do I know what these operations are? And why are they so complex that I need to include this library? Why can't GCC just optimize the code directly instead of having to resort to these library functions? I'm a little confused. Any insight into this would be appreciated!
Yes, you do need it .... probably. If you don't need it then statically linking it is harmless. You can tell if you need it by using the -t link trace option (I think).
There are various things that you cant do in one instruction (typically things like 64-bit operations on 32-bit architectures). These things can be done, but if they use a non-trivial number of instructions then it's more space-efficient to have them all in one place.
When you disable optimization using -O0 (that's actually the default anyway) then GCC pretty much always uses the libgcc routines.
When you enable speed optimization then GCC may choose to insert the instruction sequence directly into the code (if it knows how). You may find that it ends up using none of the libgcc versions - it will certainly use fewer libgcc calls.
When you enable size optimizations then GCC may prefer the function call, or may not - it depends on what the GCC developers think is the best speed/size trade-off in each case. Note that even when you tell it to optimize for speed, the compiler may judge that some functions are unlikely to be used, and optimize those for size - even more so if you use PGO.
Basically, you can think of it in the same way as memcpy or the math-library functions: the compiler will inline functions it judges to be beneficial, and call library functions otherwise. The compiler can "inline" standard functions and libgcc function without looking at the library definition, of course - it just "knows" what they do.
Whether to use static or dynamic libgcc is an interesting trade-off. On the one hand, a dynamic (shared) library will use less memory across your whole system, and is more likely to be cached, etc. On the other hand, a static libgcc has a lower call overhead.
The most important thing though is compatibility. Obviously the libgcc library has to be present for your program to run, but it also has to be a compatible version. You're ok on a Linux distro with a stable GCC version, but otherwise static linking is safer.
I hope that answers your questions.
I apologize ahead of time that I don't quite have the proper jargon to describe my problem, and that I have likely not given enough information.
I've been running my MPI code under gcc 4.4 and OpenMPI/MPICH2 for months now with no issue on a variety of platforms. Recently I upgrade a set of servers and my desktop to Ubuntu 11.04 (running gcc 4.5 now) and ran an 8 task job on a node with 8 processors. Typically I see nearly 100% user CPU utilization, and now I see only 60% user CPU and over 30% system cpu. This leads to a remarkable slowdown of my code when run in this fashion.
Investigating further, I simply ran a serial job, and noted that the process reported 150+% cpu time was being used. So, my program was multithreading itself over many processors. I verified this explicitly using 'ps -eLF' and looking at the per-processor loads.
This is an incredibly bad and inefficient thing for my MPI code, and I have no idea where it comes from. Nothing has changed other than the move to Ubuntu 11.04 and gcc 4.5. I have verified this against varying OpenMPI versions.
I also moved binaries around between two binary-compatible machines. If I compile on another machine (ubuntu 10.10/gcc 4.4) and run there, everything is fine. Moving the binary to the Ubuntu 11.04 machine, the same binary begins threading itself.
It is worth noting that I have explicitly disabled all optimizations (-O0), thinking my default (-O3) could include something I didn't understand in 4.5. I get identical behavior regardless of the optimization level.
Please let me know what further information I can provide to determine the source of this problem.
* ADDITIONAL INFO *
Results of ldd in response to request. Simply, it's OpenMPI, libconfig, and scalapack, along with standard gcc stuff:
linux-vdso.so.1 => (0x00007ffffd95d000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f2bd206a000)
libconfig.so.8 => /usr/lib/libconfig.so.8 (0x00007f2bd1e60000)
libscalapack-openmpi.so.1 => /usr/lib/libscalapack-openmpi.so.1 (0x00007f2bd151c000)
libmpi.so.0 => /usr/lib/libmpi.so.0 (0x00007f2bd126b000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2bd0ed7000)
libblacsCinit-openmpi.so.1 => /usr/lib/libblacsCinit-openmpi.so.1 (0x00007f2bd0cd4000)
libblacs-openmpi.so.1 => /usr/lib/libblacs-openmpi.so.1 (0x00007f2bd0aa4000)
libblas.so.3gf => /usr/lib/libblas.so.3gf (0x00007f2bd022f000)
liblapack.so.3gf => /usr/lib/liblapack.so.3gf (0x00007f2bcf639000)
libmpi_f77.so.0 => /usr/lib/libmpi_f77.so.0 (0x00007f2bcf406000)
libgfortran.so.3 => /usr/lib/x86_64-linux-gnu/libgfortran.so.3 (0x00007f2bcf122000)
libopen-rte.so.0 => /usr/lib/libopen-rte.so.0 (0x00007f2bceed3000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f2bcecb5000)
/lib64/ld-linux-x86-64.so.2 (0x00007f2bd22fc000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f2bcea9f000)
libopen-pal.so.0 => /usr/lib/libopen-pal.so.0 (0x00007f2bce847000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f2bce643000)
libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007f2bce43f000)
All the best.
is it possible that you are running into this feature? http://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html
basically, certain standard library routines have parallel implementations.However, it is only turned on when the _GLIBCXX_PARALLEL macro is defined.
Seeing 60%/40% doesn't tell anything, perhaps the processing is just accounted differently. The only interesting figure here would be to compare the wallclock time of the total execution of your code.
Also, I would think that (if so) it is not your binary itself that is parallelized but the MPI libaries. To check that you would not only have to compile your code on the other machine but also to link it statically. Only then you can be sure that you run exactly the same binary code in all of its aspects on the other machine.
Then, also you can't be sure that the MPI library doesn't use C++ under the hood. I remember that it was quite difficult for one of the MPI libs (don't remember which) to convice to not to compile against the C++ interface, even if you were only doing C.