AMD HIP executable with undefined symbols during runtime - linker

I want to use the AMD HIP framework for my self-written GPU kernels. I do that by using a third-party library, which takes responsibility of taking the code and compiling it with HIP (and additional backends if desired). The technical setup looks as follows:
The kernel code is compiled into a static helper library with AMD HIP linking and toolchain enabled (CMake: set_target_properties(${target_name} PROPERTIES LINKER_LANGUAGE HIP))
This helper library is then linked into the core part of our own library, which is a shared library
This core part is then linked into the final shared library which is shipped at the end
Therefore, we have 3 different libraries in the build process that are linked together.
The build process exits without any errors s.t. that during compile- and link-time there are no errors. However, when I now want to use this library, I get the following error during runtime: undefined symbol: __hip_fatbin.
Because the code used to not even link correctly, I added these two flags to CMake which made it build successfully (as suggested by others on GitHub): -fgpu-rdc --hip-link. However, the library still does not run because of this undefined symbol error during execution. Inspecting the created libraries with nm -gD shows U in front of __hip_fatbin which makes me wonder why it is like that. Shouldn't that be somehow defined when linking with the HIP toolchain?
So my question is if anybody experienced the same issue yet when trying to use AMD HIP across multiple libraries that are linked against each other. Might this be an issue with gcc and HIP's clang? Or is there any chance for me to get further details which makes me understand what to do now. Thank you!

Related

Why plugin dynamic library can't find the symbol in the application?

I have an application, which has been compiled several libraries into it through static link.
And this application will load a plugin through dlopen when it runs.
But it seems that the plugin can't resolve the symbol in the application, which I can find them through "nm".
So what can I do? Recompile the libraries into shared mode, and link them to the plugin?
You have to use the gcc flag -rdynamic when linking your application, which exports the symbols of the application for dynamic linkage with shared libraries.
From the gcc documentation:
Pass the flag -export-dynamic to the ELF linker, on targets that support it. This instructs the linker to add all symbols, not only used ones, to the dynamic symbol table. This option is needed for some uses of dlopen or to allow obtaining backtraces from within a program.
This should eliminate your problem.
The usual suggestion to add -rdynamic is too heavyweight in practice as it causes linker to export all functions in executable. This will slow down program startup (due to increased time for relocation processing) and, more importantly, will eventually make the interface between your application and plugins too wide so it would be hard to maintain in future (e.g. you won't be able to remove any function from your application in fear that it may be used by some unknown external plugin). Normally you should strive to expose a minimal and well-defined API for plugin authors.
I thus recommend to provide explicit exports file via -Wl,--dynamic-list when linking (see example usage in Clang sources).

Why specify the target architecture to the linker?

I've been working on using the Meson build system for an embedded project. Since I'm working on an embedded platform, I've written a custom linker script and also an invocation for the linker. I didn't have any problems until I tried to link in newlib to my project, when I started to have link issues. Just before I got it working, the last error was undefined reference to main which I knew was clearly in the project.
Out of happenstance, I tried adding -mcpu=cortex-m4 to my linker invocation (I am using gcc to link, I am told this is quite typical instead of directly calling ld). It worked! Now, my only question is "why"?
Perhaps I am missing something about how the linking process actually works, but considering I am just producing an ELF file, I didn't think it would be important to specify the CPU architecture to the linker. Is this a newlib thing, or has gcc just been doing magic behind the scenes for me that I haven't seen before?
For reference, here's my project (it's not complete)
In general, you should always link via the compiler driver (link forms of the gcc command), not via direct invocation of ld. If you're developing for bare metal on a particular exact target, it's possible to determine the set of linker arguments you need and use ld directly, but there's a lot that the compiler driver takes care of for you, and it's usually better to let it. (If you don't have a single fixed target, there are unlimited combinations of possibilities and no way you can reproduce all present and future ones someone may care about.)
You can still pass whatever options you like to the linker, e.g. custom linker scripts, via -Wl,... option forms.
As for why the specific target architecture ISA level could matter to linking, linking is not a dumb process of just sticking together binary chunks. Linking can involve patching up (relocations) or even generating (thunks for distant jump targets, etc.) code, in which case the linker may need to care what particular ISA level/variant it's targeting.
Such linker options ensure that the appropriate standard library and start-up code is linked when these are defaulted rather then explicitly specified or overridden.
The one ARM toolchain supports a variety of ARM architecture variants and options; they may be big or little-endian, have various instruction sets - ARM, Thumb. Thumb-2, ARM64 etc, and various extensions such a SIMD or DSP units. The linker requires the architecture information to select the correct library to link for both performance and binary compatibility.

Prevent linking of mallocr.o file within libc.a

This is for my company, so I'm leery of being too specific, but I'll try.
I am attempting to add support for some existing ANSI C code to our platform. I am using GCC 4.7.2 as well as the GNU linker. We use part of newlib, but also some other C libraries, specifically libc.a. The end goal of this is to get an EXE or ELF image (this is for a PowerPC architecture micro) to put into the micro's RAM. This is being done on Windows XP. I am simply using a batch file, not a build environment or toolchain.
One of my build errors is a multiple definition problem of malloc/free functions. The cmd window spits out the error that there are definitions of these in both malloc.o and mallocr.o. Both of these are within libc.a. I've been told the "r" in mallocr.o is for reentrancy. I've also been told our platform does not support reentrancy.
I'm trying to resolve this error by preventing the linking of mallocr.o from within libc.a. This is the part where I am lost, I don't know how to do this. Google hasn't turned up anything helpful, and I haven't found a question on this site yet that answers my problem. I don't know if this is even possible.
There is really no specific code snippet to include in this question. Below is the error from the cmd window. I've *'d out company specific things I am not comfortable sharing.
c:\***\platform\2_2_0_r2013-2_x86-32\tools\gcc_4_7_2\ppc\bin\..\powerpc-eabi\lib\libc.a(mallocr.o): In function `free':
mallocr.c:(.text+0x19c): multiple definition of `free'
c:\***\platform\2_2_0_r2013-2_x86-32\tools\gcc_4_7_2\ppc\bin\..\powerpc-eabi\lib\libc.a(malloc.o):malloc.c:(.text+0x28): first defined here
c:\***\platform\2_2_0_r2013-2_x86-32\tools\gcc_4_7_2\ppc\bin\..\powerpc-eabi\lib\libc.a(mallocr.o): In function `malloc':
mallocr.c:(.text+0x468): multiple definition of `malloc'
c:\***\platform\2_2_0_r2013-2_x86-32\tools\gcc_4_7_2\ppc\bin\..\powerpc-eabi\lib\libc.a(malloc.o):malloc.c:(.text+0x0): first defined here

What all are the various linkers connected to libraries in gcc

I once tried to execute a program and it showed certain errors like "undefined reference to sqrt" in linux. Then I started to go through many blogs and stuffs like that and I understood that there are two processes before execution and that is compilation and linking. It would be very helpful if anyone could help me regarding various linker flags connected to their corresponding libraries.
I understood that there are two processes before execution and that is compilation and linking.
Compilation and linking are build time actions. Their result is a binary that can be immediately executed. I.e. binaries installed on your machine already went through that steps.
When a program binary is loaded, it might be necessary to dynamically load and link further libraries. This is performed by the so called dynamic linker, which is part of the operating system (some programs include their own dynamic linker though).
I once tried to execute a program and it showed certain errors like "undefined reference to sqrt"
messages like this usually show up upon program compilation and linking, not execution. In the case of a dynamically linked program the most likely error message related to linking problems would be a message about a required library that can not be located and/or loaded.
Anyway, the sqrt function is part of the standard C math library libm. To link a library with GCC you use the -l flag, which takes the library name, omitting the leading lib… part. So in your case -lm

Linking several dependent libraries into my "bare metal" C application

I am developing a bare metal C applications on an ST ARM-Cortex-M3. I have also developed libraries that are usable across all these applications.
I used to use Keil ARM-MDK, but want to move over to GNU-GCC. I thus downloaded the latest version of GCC and started recompiling the code.
Although similar questions to this one have been answered, it does not solve my problem ans therefore I am posting my question.
I have a problem with the following:
Lib_Flash has a function Read_Flash(). Lib_AppCfg links in Lib_Flash as it uses Read_Flash().
My application (App) links in both Lib_Flash and Lib_AppCfg. App also uses Read_Flash() for some specific FLASH checks.
In Keil MDK-ARM it worked fine.
With GCC, when functions using Lib_AppCfg are built, I get errors stating that Read_Flash() is an "undefined reference".
I am not sure where the problem lies. Is it in the linking of the Lib_Appcfg is built or is the problem when I link App?
Please advise. If you need additional information, please let me know.
The GNU linker by default searches the libraries once in the order listed on the command line. So if a library later in the list has a reference to symbol defined in an earlier library or object file, then it cannot be resolved.
The simple solution is to use library grouping; this causes the linker to repeatedly search a list of libraries until no further synbols can be resolved. If you are invoking the linker (ld) separately, then the linker options are:
--start-group _Flash _AppCfg --end-group
or the alternative form
-( _Flash _AppCfg -)
See the GNU linker manual for details. If driving the linker indirectly through gcc you pass linker options via the -Wl option, something like:
-Wl,-(,_Flash,_AppCfg,-)
I think.
It sounds to me like you have got an ordering problem in your libraries. Some linkers will rescan all the libraries on the command line till all references are resolved (or can't be resolved). Other linkers work sequentially along the link line.
In particular, this means that if library A defines a symbol SYM_A and library B which comes after library A references this symbol, it won't be resolved on the 2nd type of linker, and your link will fail.
To get round this, you can do one or more of the following
Reorder the libraries
Replicate libraries on the link line where
necessary
Refactor your libraries so there aren't mutual
dependencies between them (that is A references symbol SYMB, which
is defined in B, but B references SYMA)

Resources