Duplicate symbols in Microsoft C library - c

I'm writing a linker for Windows PE format object files, and I've got to the stage where it can link together object files produced by the Microsoft compiler, but when I try to link with libcmt.lib I get a lot of duplicate symbols.
For example, cosl is defined by three different objects in the library. All three refer to definitions in different places, and all three look genuine, e.g. they point to text segments named .text$mn and have storage class IMAGE_SYM_CLASS_EXTERNAL.
Is it the case that these are alternate versions and the linker is supposed to pick one based on some criterion, or am I misunderstanding something about the semantics of the PE library format?

As referenced in the comments, the OP is not processing the COMDAT section properly.
http://download.microsoft.com/download/e/b/a/eba1050f-a31d-436b-9281-92cdfeae4b45/pecoff.doc

Related

Why aren't unreferenced symbols not optimized out by default?

I realize there are ways to remove unreferenced symbols from the final binary by passing flags to the compiler and linker, but why doesn't this happen by default (static linking)?
Because there are some traditional practices that depend on unreferenced variables staying in the binary.
In particular, it has been common to declare a global string containing special sequences that are replaced by the version control system, e.g. something like this:
static char sccsid[] = "#(#)ls.c 8.1 (Berkeley) 6/11/93";
Standard(*) linking semantics for static libraries are that exactly those object files from an archive (static library) which are needed to resolve undefined symbols get pulled into the link, as if they were object files listed on the link command line. So, as long as you factor your libraries into indepedent translation units (and thus object files) well, unreferenced symbols "are optimized out" already, by never being pulled in to begin with.
If you want finer-grained optimizing-out, you need to leave the object files in a form where this is possible. Traditionally, object files contain a single text section for all code and a single data section for all data, and these are already flattened in a way that individual functions or data objects can't be subsequently removed. Modern tooling optionally supports using a separate section per function or data object, which the linker can then use for fine-grained dropping of unreferenced sections via --gc-sections. Arguably this should be default nowadays, but it does break certain custom linking setups using explicit placement of code or objects into sections without explicit referencing, which is probably the reason why it's still not default.
(*) Here "standard" is outside the scope of the C language standard, and is a matter of how the Unix-derived C language tooling has always worked and been specified (roughly equivalently) in various places like SysV, ELF, etc.

GNU ld: how do I detect multiply-defined symbols?

I'm aggregating two very similar sets of source code into a single library archive. There are maybe 5 or 6 functions which are defined with identical signatures in the two code sets, but with slightly different implementation. I need to find these functions, so that I can either change their names (if I need both of them), or to remove one of them.
I thought that ld would do the hard work for me, by reporting that the functions were multiply-defined, but it's not doing it. I've currently got a 2-stage link procedure:
1 - an incremental link of the two sets of source files, to produce an archive file. If I already know which functions are multiply defined, I can use nm to confirm that the symbol appears twice in the archive.
2 - a final link of this archive file with the object that calls the library code. 'ld' doesn't complain during this step, and presumably is just linking the first matching object that it finds in the archive, without reporting that a second object could also be used.
Any idea how I can get ld to scan the entire archive, and report the functions which are multiply-defined? Thanks.
Attempt a link of all the component .o files (rather than .a files), and you will get the multiply-defined messages.

What is the difference between a .o file and a .lib file?

What is the difference between a .o file and a .lib file?
Conceptually, a compilation unit (the unit of code in a source file/object file) is either linked entirely or not at all. While some implementations, with significant levels of cooperation between the compiler and linker, are able to remove unused code from object files at link time, it doesn't change the issue that including 2 compilation units with conflicting symbol names in a program is an error.
As a practical example, suppose your library has two functions foo and bar and they're in an object file together. If I want to use bar, but my program already has an external symbol named foo, I'm stuck with an error. Even if or how the implementation might be able to resolve this problem for me, the code is still incorrect.
On the other hand, if I have a library file containing two separate object files, one with foo and the other with bar, only the one containing bar will get pulled into my program.
When writing libraries, you should avoid including multiple functions in the same object file unless it's essential that they be used together. Doing so will bloat up applications which link your library (statically) and increase the likelihood of symbol conflicts. Personally I prefer erring on the side of separate files when there's a doubt - it's even useful to put foo_create and foo_free in separate files if the latter is nontrivial so that short one-off programs that don't need to call foo_free can avoid pulling in the code for deep freeing (and possibly even avoid pulling in the implementation of free itself).
A .LIB file is a collection of .OBJ
files concatenated together with an
index. There should be no difference
in how the linker treats either.
Quoted from here:
What is the difference between .LIB and .OBJ files? (Visual Studio C++)
They are actually quite different, specially with older linkers.
The .o (or .obj) files are object files, they contain the output of the compiler generated code. It is still in an intermediate format, for example, most references are still unresolved. Usually there is a one to one mapping between the source file and the object file.
The .a (or .lib) files are archives, also known as library, and are a set of object files.
All operating systems have tools that allow you to add/remove/list object files to library files.
Another difference, specially with older linkers is how the files are dealt with, when linking them. Some linked will place the complete object file into the final binary, regardless of what is actually being used, while they will only extract the useful information out of library files.
Nowadays most linkers are smart enough to remove all stuff that is not being used.

Object code, linking time in C language

When compiling, C produces object code before linking time.
I wonder if object code is in the form of binary yet?
If so, what happened next in the linking time?
Wikipedia says,
In computer science, an object file is
an organised collection of named
objects, and typically these objects
are sequences of computer instructions
in a machine code format, which may be
directly executed by a computer's CPU.
Object files are typically produced by
a compiler as a result of processing a
source code file. Object files contain
compact code, and are often called
"binaries".
A linker is typically used
to generate an executable or library
by amalgamating parts of object files
together. Object files for embedded
systems typically contain nothing but
machine code but generally, object
files also contain data for use by the
code at runtime: relocation
information, stack unwinding
information, comments, program symbols
(names of variables and functions) for
linking and/or debugging purposes, and
other debugging information.
Another great site has much more detailed info and a useful diagram, here:
Object files as produced by the C compiler essentially contain binary code with holes in each place where an address should go that is yet unknown (addresses of function from other files -- including libraries -- called, addresses of variables from other files that are accessed in this one, ...).
It also contains a table indexed by symbol names ("x" or "_x" for variable x, "f" or "_f" for function f). For each such symbol, there is a status code ("defined here", "not defined here but used", ...) and the addresses of holes in the binary code that need to be filed with each address when it becomes known.
If you are using Unix (or gcc on Windows), you can print the later table with the command "nm file.o".
Yes, object code is usually in binary form. Just try opening it in your favorite text editor.
You can learn what linkers do here or here.

Distribute binary library on OSX

I'm planning to release some compiled code that shall be linked by client applications on MacOSX.
The distribution is some kind of code library and a set of header files defining the public interface for the library.The code is internally C++ but its public interface (i.e what's being shown in the headers) is completely C.
These are my requirements or atleast what I hope I can accomplish:
I want my library to be as agnostic
as possible for what version of OSX
and GCC the user is running. Having
separate libraries for 64 bit and 32
bit is okay though.
I want my library
to be loadable from languages that
supports loading C libraries such as
python or similar.
I want my
libraries internal symbols to be
isolated from the code it's being
linked into. I don't want to have
duplicate symbol errors because we
happen to name an internal function
in the same way. My C++ code is properly namespaced so this may not be as big of an issue though, but some of the libraries I depend on is C and can be an issue (see next point).
I want my library
dependencies to be safe. My library
depends on some libraries such as
libpng, boost and stl and I don't
want issues because some users don't
necessarily have all of them installed
or get problems because they have
been compiled with other flags or
have different versions than I have.
On Windows I use a DLL with an export library and link all my dependencies statically into the dll. It fulfills all the criteria above and if I can get the same result on OSX it would be great, however I've heard that dynamic libraries tend not to isolate symbols on mac in the same way.
Is there some kind of best practice for this on OSX?
A normal OS X .dylib pretty much satisfies your requirements, with the note that you will want to have an exports file that the linker uses to determine exactly which symbols are exported (to prevent leaking your internal symbols).
In order to make your own library dependencies safe, you will probably need to either include those libraries with yours or link them statically into your library.
edit: To answer your follow-up question of how to apply an exports file to a link command, the man page for ld has the following to say:
-exported_symbols_list filename
The specified filename contains a list of global symbol names
that will remain as global symbols in the output file. All
other global symbols will be treated as if they were marked
as __private_extern__ (aka visibility=hidden) and will not be
global in the output file. The symbol names listed in file-
name must be one per line. Leading and trailing white space
are not part of the symbol name. Lines starting with # are
ignored, as are lines with only white space. Some wildcards
(similar to shell file matching) are supported. The *
matches zero or more characters. The ? matches one charac-
ter. [abc] matches one character which must be an 'a', 'b',
or 'c'. [a-z] matches any single lower case letter from 'a'
to 'z'.
So, if your library had only two functions that you wanted to be public, lets call them foo and bar, and they were C functions (so the symbol names aren't mangled), your exports file (let's call it myLibrary.exports) would contain these two lines:
_foo
_bar
and maybe some comments, etc. When you do the final link step to build the library, you would pass the -exported_symbols_list myLibrary.exports flag to the linker. This has the additional benefit that the link will fail if you don't provide one of the exported symbols; this can catch a lot of "oops, I forgot to include that file in the build" mistakes.
You don't need to use the command-line tools to do all this, of course. In the build settings for a dynamic library in XCode, you will find Exported Symbols File (undefined by default); set it to the path to your exports file there and it will be passed to the linker.
The key term you need is 'framework'. You need to create a 'universal' framework that is self-contained. ('Universal' is Apple-ease for 'compile several times and package into one library.) It's not as straightforward as on Windows in terms of encapsulation, but the necessary linker options are there.

Resources