C - Linkage process misunderstanding

C - Linkage process misunderstanding - c

Assume I have header file with a function declaration:
test.h:
int func(int a);
main.c:
#include "test.h"
int main {
return func(5);
}
test.c (without include to the test.h):
int func(int x) {
return x*x;
}
I understand why both files compile, but I thought since test.c doesn't have an include to the header, the linker won't be able to recognize that this is the implementation, but it did.
So, why did it?
Are there any "rules" when I should do include to header files?

When linkage takes place, header files are long gone by then. The linker works on so-called object files. An object file is compiled from each translation unit, i.e. in our case the C-files. Symbols that are not defined within the given object file will be resolved by the linker, which looks at all the other object files and tries to resolve the symbol.
In our case, test.c is compiled into test.o, and defines a single symbol: func. main.c is compiled into main.o, which defines the main symbol, and refers to an external symbol func. Then test.o and main.o is fed into the linker which (starting from main) will resolve func from test.o.

Header files is only a pre-processor thing. The term you're looking for is translation unit which is a fully pre-processed source file with all headers included. It's this translation unit that the compiler sees and uses as input (actually its a little more complicated than this, but lets keep it simple) to create the object files for the linker to use.
The linker knows nothing of "header files". Instead it examines the object files, which is in a special format that contains all information needed, like tables of exported symbols and other tables of undefined but referenced symbols, and then the linker uses this information from all object files and all libraries to construct the final executable program.
So in the object file generated from the main.c source file, there is a table saying "the symbol func is used but not defined here", and in the object file for the test.c source file there is a table saying "the symbol func is defined here". When the linker looks through the object files, it can match the usage of func in one object file to the definition in the other object file.

If you want to understand what the linker is doing, look at its inputs.
Compile main.o and test.o first. Then examine them with nm or objdump: you'll see that main.o has an undefined symbol func, and test.o has a defined symbol of the same name.
The linker never sees your code, your headers, or anything except those intermediate object files. Everything it needs is in there, and the only thing it matches is the symbol name and type.
Note that in C, there isn't even any information in the symbol about the function's number of arguments, or their type, or the return value. If you change test.c to declare func taking two arguments, the program will still link and run - at least, it will start running but may crash. If it survives, one of the arguments will be uninitialized. This mismatch (between func's declaration and definition) is why it's recommended to include the header in test.c, so the compiler can catch your mistake before the linker does something stupid.

Header files tell a C source file about shared information. These can be:
type definitions, such as structures;
prototypes of functions. The prototype tells what the return type is and what the parameters and their type are. This helps the compiler to check you are using return and parameters of the function correctly. Without a prototype, the compiler will assume the return type is int and the function can have any number and type of parameters;
symbolic constants, created by #defines and macros;
the names and types of global variables.
Including the header file in your compilation units (C source files) helps you to share this information with your compilation units.
The compiler will compile the unit, whether using include files or not, and is left with a number of symbols (variables and functions) that are not in the current compilation unit. In the object file it notes these. Now, when the inker collects all object files to create the executable, it will search in these object files and libraries for the symbols not resolved. If any of these symbols cannot be found, then no executable is created.
So no, the compiler doesn't need header files.

The linker only looks at the name, func. main.orequires it. test.oprovides it. So it works!
But...
Implementation files should always include their own header!
I always include the header file in the .c file to ensure that the declaration and definition have matching signatures.
Let's say you change the definition and forget to change the header. If you haven't included test.h from test.c the compiler won't be able to see that the declaration and definitions differ. But when you run your program you will get undefined behaviour.

After so many answers there is not so much to add, and, even if it could seem a little bit off topic, you may want to know about decorations.
The decorations, or mangling, applied to symbols name are used as a simple way to transfer checks from source to object.
MS in C modules, for __stdcall functions, simply adds an # sign after regular name followed by the total number of bytes used:
int func(char *) _func#4
int func(double) _func#8
For C++ the mechanism is even stronger, they add to the symbol name some characters representing the type of variable returned from a function, the class, the namespace and which parameters it takes (see).
The result is a 'fully qualified' function name. In this way it very difficult that a wrong symbol is associated to a function.

the linker doesn't really care about headers...
you only need headers to tell the compiler that main.c can use func() because WILL be there
the linker just takes all the symbols and puts them into one executable file.
NOTE: this is a simplistic view

Related

weird static function behaviour

i am looking at static functions, i know they have a scope limited to the file they are declared in. For clarification i use code blocks with GCC. If i declare the function in a ..c file and include it in my main.c file, the function can be accessed (but the compiler will complain if a non static function is defined, as i have multiple definitions). However if i have a c file with some static function and another non-static function, then the static function is not accessible and the non-static function is accessible. This to me is a rather weird. I know that the #include directive copies the contents of the to-be-included file in to the file where the include directive is declared. But then why can i access the non-static function without including the .c file in the main.c file?
Any suggestions on where i could read up on this topic? I am thinking it has something to do with linking, but i may be wrong.

When a function is declared without static, it has external linkage.1 This means that, when the program is linked2, the same name in different translation units3 can be made to refer to the same thing.
When you define a function in one source file, say int square(int x) { return x*x; }, compiling that source file produces an object file that contains information saying “The function foo is defined here.” When you declare a function in another source file and use that function, such as int square(int x); … b = square(a);, compiling that source file produces an object file that contains information saying “The function foo is used here.” When the linker or program loader processes these object files, it modifies the place where foo is used to refer to the place where foo is defined.
If you use a function without declaring it in that translation unit, your compiler might provide a default declaration for the function, as if it had been declared to return int with no information about its parameters. This behavior is a legacy of old versions of C. Quite likely your compiler issues a warning such as “implicit declaration of function” for this. You should not ignore this warning. When it appears, find the code causing the warning and fix it—either correct the typing of a misspelled function name or insert a declaration for the function.
When using GCC or Clang, you should use the switch -Werror to elevate warnings to errors, so that the compiler will not compile programs that cause warning messages. While this may at first make it more of a nuisance to get your programs compiled, it causes you to fix errors early and to learn to program correctly.
Footnotes
1 With static, an identifier for a function has internal linkage. The rules for identifiers for objects (which you may think of as variables) are more complicated, and they can also have no linkage.
2 In practice, each compilation produces an object module. Linking combines object modules to make an executable program or, in complicated builds, new object modules.
3 A translation unit is the source file being compiled including all the files it includes with #include statements.

So I found a decent answer to your question in a different stackoverflow question, but before I get to that I want to explain why including .c files is bad practice, and shouldn't be done:
As you pointed out, #include copies the content of the file. This means that you can't include the .c file more than once, since you would be creating >1 instance of the same function (of the same name).
If you try to create a universal .c file, say, math.c, you can't create any other .c files that use any of its functions, since you can only include it once, and thats reserved for your main.c file. This severely limits the usefulness of every function in your .c files
instead, you should only ever include header files (.h). You declare your functions in your .h file, define in your .c file.
Whatever functions you declare in your .h file, you can then use in any other files where said .h file is included, without causing compiler errors
I get that this is confusing, but here's examples from another stackoverflow thread
here's a further better document explaining how header files (.h) work
Why can you access non-static function despite not including the .c file?
this should not be possible generally in c if you're not explicitly including said .c file. If I were to do that I would get an "implicit declaration" function, because the compiler doesn't recognize the functions, as they've not been declared previously.
relevant stack overflow answer
long story short, declaring a function static in a .c file means that its scope is limited only to said .c file, you can't call it anywhere else

C - Scope of C Functions

I apologize if this is a beginner's question, but after working in C for a bit, I finally would like a bit of clarification on exactly what kind of files/functions are available to a function.
I understand we can explicitly include other files with the #include macro like so:
#include bar.c
But what about files that are in the same directory? Let's say we have the following file structure:
src-
|
a.c
b.c
And let's say a.c has the function "foo()" and b.c has "bar()"
Would I be able to call bar() within a.c even without explicitly including b.c in the header file?
And what about in sub-directories such as the following:
src-
|
a.c
|
someFolder-
|
b.c
Would I still be able to access bar() in a.c?
Basically, how exactly is the scope of functions (without including them in headers) defined?

You are confusing scope with linkage.
The term scope descibes if an identifier is visible to the compiler in a given context. The broadest scope known to the C language is file scope which means that the identifier is visible from the point it is declared to the end of the translation unit after preprocessing. You can make any identifier visible/known by declaring it, but making an identifier visible does not mean that refering to the identifier refers to the thing it refers to in another file.
The term linkage describes how two identifiers refer to the same thing. There are three levels of linkage: with no linkage, each identifier with the name refers to a different thing. With internal linkage, each identifier within the same translation unit refers to the same thing. With external linkage, each identifier in the whol program refers to the same thing. For two identifiers in two translation units to refer to the same thing, you need to declare both of them to have external linkage.
This is independent of how the declaration is obtained. There is no difference in writing int foo() in a common header file to writing the same line in both source files. In either case, all identifiers refer to the same foo() as functions have external linkage unless explicitly declared to have internal linkage with the static type specifier.
It is also independent of how your source code is laid out. As long as all translation units are linked into the same binary, you can call all functions with external linkage in each translation unit.

Depends on whether you are linking these into one binary or not.
If linking - then yes, functions can be called with implicit declaration.
Common pattern in this case is making a header(s) file with all declarations, including *.c files is usually not preferred, since #include is just textual inclusion of file contents

#include (a preprocessor directive, not a macro, BTW) has a configurable list of directories it searches.
If you use the #include "something.h" form (as compared to the angle bracket form for including system headers), the first directory in the list is the directory of the including file.
Although #include does a simple textual inclusion so it is technically possible to include any text file, c files are almost never included.
In C, you usually compile c files separately into object files and link them together, and what you do include is header files which consist of declarations, which represent interfaces of the other object files which your current compilation unit otherwise wouldn't see.
Check out http://www.cs.bu.edu/teaching/c/separate-compilation/ or another article on that topic.

The forward slash "/" should work in this case.
If you can include foo.h without problem, this means the directory (say, C:\DIRFOO) is predefined in the "include" directive of the compiler. So, another file (say bar.h) in a sub directory (say, C:\DIRFOO\DIRBAR) can be included via #include <DIRBAR/bar.h>.

What's the benefit for a C source file include its own header file

I understand that if a source file need to reference functions from other file then it needs to include its header file, but I don't understand why the source file include its own header file. Content in header file is simply being copied and pasted into the source file as function declarations in per-processing time. For source file who include its own header file, such "declaration" doesn't seem necessary to me, in fact, project still compile and link no problem after remove the header from it's source file, so what's the reason for source file include its own header?

The main benefit is having the compiler verify consistency of your header and its implementation. You do it because it is convenient, not because it is required. It may definitely be possible to get the project to compile and run correctly without such inclusion, but it complicates maintenance of your project in the long run.
If your file does not include its own header, you can accidentally get in a situation when forward declaration of a function does not match the definition of the function - perhaps because you added or removed a parameter, and forgot to update the header. When this happens, the code relying on the function with mismatch would still compile, but the call would result in undefined behavior. It is much better to have the compiler catch this error, which happens automatically when your source file includes its own header.

Practical example - assume the following files in a project:
/* foo.h */
#ifndef FOO_H
#define FOO_H
double foo( int x );
#endif
/* foo.c */
int foo( int x )
{
...
}
/* main.c */
#include "foo.h"
int main( void )
{
double x = foo( 1 );
...
}
Note that the declaration infoo.h does not match the definition in foo.c; the return types are different. main.c calls the foo function assuming it returns a double, according to the declaration in foo.h.
foo.c and main.c are compiled separately from each other. Since main.c calls foo as declared in foo.h, it compiles successfully. Since foo.c does not include foo.h, the compiler is not aware of the type mismatch between the declaration and definition, so it compiles successfully as well.
When you link the two object files together, the machine code for the function call won't match up with what the machine code for function definition expects. The function call expects a double value to be returned, but the function definition returns an int. This is a problem, especially if the two types aren't the same size. Best case scenario is you get a garbage result.
By including foo.h in foo.c, the compiler can catch this mismatch before you run your program.
And, as is pointed out in an earlier answer, if foo.h defines any types or constants used by foo.c, then you definitely need to include it.

The header file tells people what the source file can do.
So the source file for the header file needs to know its obligations. That is why it is included.

Yours seems a borderline case, but an include file can be viewed as a sort of contract between that source file and any other source files that may require those functions.
By writing the "contract" in a header file, you can ensure that the other source files will know how to invoke those functions, or, rather, you will be sure that the compiler will insert the correct code and check its validity at compile time.
But what if you then (even inadvertently) changed the function prototype in the corresponding source file?
By including in that file the same header as everyone else, you will be warned at compile time should a change inadvertently "break" the contract.
Update (from #tmlen's comment): even if in this case it does not, an include file may also use declarations and pragmas such as #defines, typedef, enum, struct and inline as well as compiler macros, that would make no sense writing more than once (actually that would be dangerous to write in two different places, lest the copies get out of sync with each other with disastrous results). Some of those (e.g. a structure padding pragma) could become bugs difficult to track down.

It is useful because functions can be declared before they are defined.
So it happens that you have the declaration, followed by a call\invocation, followed by the implementation.
You don't have to, but you can.
The header file contains the declarations. You're free to invoke anytime as long as the prototype matches. And as long as the compiler finds an implementation before finishing compilation.

Function definition and extern declaration are different ; but GCC does not even warn and passes the compilation

I was reviewing a particular code snippet in which function is declared as
int fn_xyz()
but while referencing the function in another .c file, it is defined as :
extern void fn_xyz()
while fn_xyz is called, there is no check for the return values ; GCC-4.7.0 never warned on the above mismatch ; is this expected ?

Each source file (technically, each translation unit) is compiled completely independently of the others. So the compiler never knows that you've declared the same symbol in multiple places. At link time, all type information has been removed, so the linker can't complain either.
This is precisely why you should declare functions in a header that all source files then include. That way, type mismatches will trigger a compiler warning/error.

Since the linking stage happens after compilation (and the compiler doesn't know or care where you link to; like linking to a shared libary), it makes sense that such a test would not be expected of a compiler.

Function prototype in header file doesn't match definition, how to catch this?

(I found this question which is similar but not a duplicate:
How to check validity of header file in C programming language )
I have a function implementation, and a non-matching prototype (same name, different types) which is in a header file. The header file is included by a C file that uses the function, but is not included in the file that defines the function.
Here is a minimal test case :
header.h:
void foo(int bar);
File1.c:
#include "header.h"
int main (int argc, char * argv[])
{
int x = 1;
foo(x);
return 0;
}
File 2.c:
#include <stdio.h>
typedef struct {
int x;
int y;
} t_struct;
void foo (t_struct *p_bar)
{
printf("%x %x\n", p_bar->x, p_bar->y);
}
I can compile this with VS 2010 with no errors or warnings, but unsurprisingly it segfaults when I run it.
The compiler is fine with it (this I understand)
The linker did not catch it (this I was slightly surprised by)
The static analysis tool (Coverity) did not catch it (this I was very surprised by).
How can I catch these kinds of errors?
[Edit: I realise if I #include "header.h" in file2.c as well, the compiler will complain. But I have an enormous code base and it is not always possible or appropriate to guarantee that all headers where a function is prototyped are included in the implementation files.]

Have the same header file included in both file1.c and file2.c. This will pretty much prevent a conflicting prototype.
Otherwise, such a mistake cannot be detected by the compiler because the source code of the function is not visible to the compiler when it compiles file1.c. Rather, it can only trust the signature that has been given.
At least theoretically, the linker could be able to detect such a mismatch if additional metadata is stored in the object files, but I am not aware if this is practically possible.

-Werror-implicit-function-declaration, -Wmissing-prototypes or equivalent on one of your supported compilers. then it will either error or complain if the declaration does not precede the definition of a global.
Compiling the programs in some form of strict C99 mode should also generate these messages. GCC, ICC, and Clang all support this feature (not sure about MS's C compiler and its current status, as VS 2005 or 2008 was the latest I've used for C).

You may use the Frama-C static analysis platform available at http://frama-c.com.
On your examples you would get:
$ frama-c 1.c 2.c
[kernel] preprocessing with "gcc -C -E -I. 1.c"
[kernel] preprocessing with "gcc -C -E -I. 2.c"
[kernel] user error: Incompatible declaration for foo:
different type constructors: int vs. t_struct *
First declaration was at header.h:1
Current declaration is at 2.c:8
[kernel] Frama-C aborted: invalid user input.
Hope this helps!

Looks like this is not possible with C compiler because of its way how function names are mapped into symbolic object names (directly, without considering actual signature).
But this is possible with C++ because it uses name mangling that depends on function signature. So in C++ void foo(int) and void foo(t_struct*) will have different names on linkage stage and linker will raise error about it.
Of course, that will not be easy to switch a huge C codebase to C++ in turn. But you can use some relatively simple workaround - e.g. add single .cpp file into your project and include all C files into it (actually generate it with some script).
Taking your example and VS2010 I added TestCpp.cpp to project:
#include "stdafx.h"
namespace xxx
{
#include "File1.c"
#include "File2.c"
}
Result is linker error LNK2019:
TestCpp.obj : error LNK2019: unresolved external symbol "void __cdecl xxx::foo(int)" (?foo#xxx##YAXH#Z) referenced in function "int __cdecl xxx::main(int,char * * const)" (?main#xxx##YAHHQAPAD#Z)
W:\TestProjects\GenericTest\Debug\GenericTest.exe : fatal error LNK1120: 1 unresolved externals
Of course, this will not be so easy for huge codebase, there can be other problems leading to compilation errors that cannot be fixed without changing codebase. You can partially mitigate it by protecting .cpp file contents with conditional #ifdef and use only for periodical checks rather than for regular builds.

Every (non-static) function defined in every foo.c file should have a prototype in the corresponding foo.h file, and foo.c should have #include "foo.h". (main is the only exception.) foo.h should not contain prototypes for any functions not defined in foo.c.
Every function should prototyped exactly once.
You can have .h files with no corresponding .c files if they don't contain any prototypes. The only .c file without a corresponding .h file should be the one containing main.
You already know this, and your problem is that you have a huge code base where this rule has not been followed.
So how do you get from here to there? Here's how I'd probably do it.
Step 1 (requires a single pass over your code base):
For each file foo.c, create a file foo.h if it doesn't already exist. Add "#include "foo.h" near the top of foo.c. If you have a convention for where .h and .c files should live (either in the same directory or in parallel include and src directories, follow it; if not, try to introduce such a convention).
For each function in foo.c, copy its prototype to foo.h if it's not already there. Use copy-and-paste to ensure that everything stays consistent. (Parameter names are optional in prototypes and mandatory in definitions; I suggest keeping the names in both places.)
Do a full build and fix any problems that show up.
This won't catch all your problems. You could still have multiple prototypes for some functions. But you'll have caught any cases where two headers have inconsistent prototypes for the same function and both headers are included in the same translation unit.
Once everything builds cleanly, you should have a system that's at least as correct as what you started with.
Step 2:
For each file foo.h, delete any prototypes for functions that aren't defined in foo.c.
Do a full build and fix any problems that show up. If bar.c calls a function that's defined in foo.c, then bar.c needs a #include "foo.h".
For both of these steps, the "fix any problems that show up" phase is likely to be long and tedious.
If you can't afford to do all this at once, you can probably do a lot of it incrementally. Start with one or a few .c files, clean up their .h files, and remove any extra prototypes declared elsewhere.
Any time you find a case where a call uses an incorrect prototype, try to figure out the circumstances in which that call is executed, and how it causes your application to misbehave. Create a bug report and add a test to your regression test suite (you have one, right?). You can demonstrate to management that the test now passes because of all the work you've done; you really weren't just messing around.
Automated tools that can parse C are likely to be useful. Ira Baxter has some suggestions. ctags may also be useful. Depending on how your code is formatted, you can probably throw together some tools that don't require a full C parser. For example, you might use grep, sed, or perl to extract a list of function definitions from a foo.c file, then manually edit the list to remove false positives.

Its obvious ("I have a huge code base") you cannot do this by hand.
What you need is an automated tool that can read your source files as the compiler sees them, collect all function prototypes and definitions, and verify that all definitions/prototypes match. I doubt you'll find such a tool lying around.
Of course, this match much check the signature, and this requires something like the compiler's front end to compare the signatures.
Consider
typedef int T;
void foo(T x);
in one compilation unit, and
typedef float T;
void foo(T x);
in another. You can't just compare the signature "lines" for equality; you need something that can resolve the types when checking.
GCCXML may be able to help, if you are using a GCC dialect of C; it extracts top-level declarations from source files as XML chunks. I don't know if it will resolve typedefs, though. You obviously have to build (considerable) support to collect the definitions in a central place (a database) and compare them. Comparing XML documents for equivalents is at least reasonably straightforward, and pretty easy if they are formatted in a regular way. This is likely your easiest bet.
If that doesn't work, you need something that has a full C front end that you can customize. GCC is famously available, and famously hard to customize. Clang is available, and might be pressed into service for this, but AFAIK only works with GCC dialects.
Our DMS Software Reengineering Toolkit has C front ends (with full preprocessing capability) for many dialects of C (GCC, MS, GreenHills, ...) and builds symbol tables with complete type information. Using DMS you might be able (depending on the real scale of your application) to simply process all the compilation units, and build just the symbol tables for each compilation unit. Checking that symbol table entries "match" (are compatible according to compiler rules including using equivalent typedefs) is built-into the C front ends; all one needs to do is orchestrate the reading, and calling the match logic for all symbol table entries at global scope across the various compilation units.
Whether you do this with GCC/Clang/DMS, it is a fair amount of work to cobble together a custom tool. So you have decide how critical you need for fewer suprises is, compared to the energy to build such a custom tool.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

C - Linkage process misunderstanding - c

the linker doesn't really care about headers... you only need headers to tell the compiler that main.c can use func() because WILL be there the linker just takes all the symbols and puts them into one executable file. NOTE: this is a simplistic view

Related

weird static function behaviour

C - Scope of C Functions

What's the benefit for a C source file include its own header file

Function definition and extern declaration are different ; but GCC does not even warn and passes the compilation

Function prototype in header file doesn't match definition, how to catch this?

Categories

Resources