untangling .h dependencies - c

What do you do when you have a set of .h files that has fallen victim to the classic 'gordian knot' situation, where to #include one .h means you end up including almost the entire lot? Prevention is clearly the best medicine, but what do you do when this has happened before the vendor (!) has shipped the library?
Here's an extension to the question, and this is probably the more pertinent question -- should you even attempt to disentangle the dependencies in the first place?;

I've done this on a C++ code base that was already split into many libraries (which was a good start).
I had to workout (or guess) which library was the most depended upon, which depended upon nothing else in the code base. I then processed each library in turn.
I looked at each module (*.cpp files) in turn and made sure that its own header was #included first and commented out the rest, then I commented out all the #includes in that header file and then re-compiled just that module to let the compiler tell me what was needed. I would un-comment the first header that seemed to be needed, and reviewed that one, recursing as necessary. It was interesting to see how many headers ended up not being needed.
Where only the name is needed (because you have a pointer or reference) use class name; or struct name;, which is called forward declaration and avoid #including the header file.
The compiler is very helpful in telling you what the dependencies are when you comment out #includes (you need to recompile with ALL the compilers you have to maintain portability).
Sometimes I had to move modules between libraries so that no pairs or groups of libraries were mutually dependant.

As you have the opportunity, you should refactor the code to reduce includes that are too large, however that assumes you can achieve some sort of package cohesion. If you disentangle things just to discover that every user of the code has to include all the elements anyway, the end result is the same.
Another option is to use #defines to configure sections on and off. Regardless, for an existing code base the solution is to move toward package cohesion.
Read: http://ivanov.files.wordpress.com/2007/02/sedpackages.pdf and research issues related to package cohesion.

I've untangled that knot a few times, and it generally helps a lot when maintaining a system to reduce the .h dependencies as much as possible. There are decent tools for generating dependency trees ( I was using Klocwork at the time ).
The downside I found was with conditional compilation. Someone might remove a header file because they think we don't need it, but it turns out that we only don't need it because VxWorks has some screwed up headers... on Solaris (or any reasonable Posix system) you do need it.

There is a balance to be struck between an enormous number of finely organized headers and a single header that includes everything. Consider the Standard C library; there are some biggish headers like <stdio.h>, which declares a lot of functions, but they are all related to I/O. There are other headers that are more of a miscellany - notably <stdlib.h>.
The Goddard Space Flight Center guidelines for C are worth hunting down.
The basic rule is that each header should declare the facilities provided by a suitable (usually small) set of source files. The facilities and header should be self-contained. That is, if someone needs the code in header "something.h", then that should be the only header that must be added to the compilation. If there are facilities needed by "something.h" that are not declared in the header, then it must include the relevant headers. That can mean that headers end up including <stddef.h> because one of the functions uses size_t, for example.
As #quamrana points out, you can use forward declarations for structures (not classes, since the question is tagged C and not C++) when appropriate - which primarily means when the interface takes pointers and does not need to know the size of the structures or any of the members.

Related

Is it right to simply include all header files?

Remembering the names of system header files is a pain...
Is there a way to include all existing header files at once?
Why doesn't anyone do that?
Including unneeded header files is a very bad practice. The issue of slowing down compilation might or might not matter; the bigger issue is that it hides dependencies. The set of header files you include in a source file should is the documentation of what functionality the module depends upon, and unlike external documentation or comments, it is automatically checked for completeness by the compiler (failing to include needed header files will result in an error). Ensuring the absence of unwanted dependencies not only improves portability; it also helps you track down unneeded and potentially dangerous interactions, for instance cases where a module which should be purely computational or purely data structure management is accessing the filesystem.
These principles apply whether the headers are standard system headers or headers for modules within your own program or third-party libraries.
Your source code files are preprocessed before the compiler looks at them, and the #include statement is one of the directives that the preprocessor uses. When being preprocessed, #include statements are replaced with the entire contents of the file being included. The result of including all of the system files would be very large source files that the compiler then needs to work through, which will cost a lot of time during compilation.
No one includes all the header files. There are too many, and a few of them are mutually exclusive with other files (like ncurses.h and curses.h).
It really is not that bad when writing a program even from scratch. A few are quite easy to remember: stdio.h for any FILE stuff; ctype.h for any character classification, alloc.h for any use of malloc(), etc.
If you don't remember one:
leave the #include out
compile
examine first few error messages for indication of a missing header file, such as some type not declared, or calling a function with assumed parameter types
figure out which function call is the cause
look at the man page (or whatever documentation your compiler has) for that function
notice the #include shown by the documentation and add it
repeat until all errors fixed
It is quite a bit easier for adding to an existing code base. You could go hundreds or thousands of working hours and never have to add a #include.
No it is a terrible idea and will massively increase your compile times and possible make your exe a lot larger by including massive amounts of unused code.
I know what you're talking about, but I need to double-check the function prototypes for the functions I'm using (for ones I don't use daily, anyway) -- I'll just copy and paste the #includes straight out of the manpage for the associated functions. I'm already looking at the manpage (it's a simple K in vim(1)), so it doesn't feel like an extra burden.
You can create a "master" header, where you put all your includes into. Then in everything else include it! Beware of conflicting definitions and circular references... So.... Master1.h, master2.h, ...
Not advocating it. Just saying.

Good C header style

My C headers usually resemble the following style to avoid multiple inclusion:
#ifndef <FILENAME>_H
#define <FILENAME>_H
// define public data structures / prototypes, macros etc.
#endif /* !<FILENAME>_H */
However, in his Notes on Programming in C, Rob Pike makes the following argument about header files:
There's a little dance involving #ifdef's that can prevent a file being read twice, but it's usually done wrong in practice - the #ifdef's are in the file itself, not the file that includes it. The result is often thousands of needless lines of code passing through the lexical analyzer, which is (in good compilers) the most expensive phase.
On the one hand, Pike is the only programmer I actually admire. On the other hand, putting several #ifdefs in multiple source files instead of putting one #ifdef in a single header file feels needlessly awkward.
What is the best way to handle the problem of multiple inclusion?
In my opinion, use the method that requires less of your time (which likely means putting the #ifdefs in the header files). I don't really mind if the compiler has to work harder if my resulting code is cleaner. If, perhaps, you are working on a multi-million line code base that you constantly have to fully rebuild, maybe the extra savings is worth it. But in most cases, I suspect that the extra cost is not usually noticeable.
Keep doing what you do - It's clear, less bug-prone, and well known by compiler writers, so not as inefficient as it maybe was a decade or two ago.
You could use the non-standard #pragma once - If you search, there's probably at least a bookshelf's worth of include guards vs pragma once discussion, so I'm not going to recommend one over the other.
Pike wrote some more about it in https://talks.golang.org/2012/splash.article:
In 1984, a compilation of ps.c, the source to the Unix ps command, was
observed to #include <sys/stat.h> 37 times by the time all the
preprocessing had been done. Even though the contents are discarded 36
times while doing so, most C implementations would open the file, read
it, and scan it all 37 times. Without great cleverness, in fact, that
behavior is required by the potentially complex macro semantics of the
C preprocessor.
Compilers have become quite clever since: https://gcc.gnu.org/onlinedocs/cppinternals/Guard-Macros.html, so this is less of an issue now.
The construction of a single C++ binary at Google can open and read
hundreds of individual header files tens of thousands of times. In
2007, build engineers at Google instrumented the compilation of a
major Google binary. The file contained about two thousand files that,
if simply concatenated together, totaled 4.2 megabytes. By the time
the #includes had been expanded, over 8 gigabytes were being delivered
to the input of the compiler, a blow-up of 2000 bytes for every C++
source byte.
As another data point, in 2003 Google's build system was moved from a
single Makefile to a per-directory design with better-managed, more
explicit dependencies. A typical binary shrank about 40% in file size,
just from having more accurate dependencies recorded. Even so, the
properties of C++ (or C for that matter) make it impractical to verify
those dependencies automatically, and today we still do not have an
accurate understanding of the dependency requirements of large Google
C++ binaries.
The point about binary sizes is still relevant. Compilers (linkers) are quite conservative regarding stripping unused symbols. How to remove unused C/C++ symbols with GCC and ld?
In Plan 9, header files were forbidden from containing further
#include clauses; all #includes were required to be in the top-level C file. This required some discipline, of course—the programmer was
required to list the necessary dependencies exactly once, in the
correct order—but documentation helped and in practice it worked very
well.
This is a possible solution. Another possiblity is to have a tool that manages the includes for you, for example MakeDeps.
There is also unity builds, sometimes called SCU, single compilation unit builds. There are tools to help manage that, like https://github.com/sakra/cotire
Using a build system that optimizes for the speed of incremental compilation can be advantageous too. I am talking about Google's Bazel and similar. It does not protect you from a change in a header file that is included in a large number of other files, though.
Finally, there is a proposal for C++ modules in the works, great stuff https://groups.google.com/a/isocpp.org/forum/#!forum/modules. See also What exactly are C++ modules?
The way you're currently doing it is the common way. Pike's method cuts a bit on compilation time, but with modern compilers probably not very much (when Pike wrote his notes, compilers weren't optimizer-bound), it clutters modules and its bug-prone.
You could still cut on multi-inclusion by not including headers from headers, but instead documenting them with "include <foodefs.h> before including this header."
I recommend you put them in the source-file itself. No need to complain about some thousand needless parsed lines of code with actual PCs.
Additionally - it is far more work and source if you check every single header in every source-file that includes the header.
And you would have to handle your header-files different from default- and other third-party-headers.
He may have had an argument the time he was writing this. Nowadays decent compilers are clever enough to handle this well.
I agree with your approach - as others have commented, its clearer, self-documenting, and lower maintenance.
My theory on why Rob Pike might have suggested his approach: He's talking about C, not C++.
In C++, if you have a lot of classes and you are declaring each one in its own header file, then you'll have a lot of header files. C doesn't really provide this kind of fine-grained structure (I don't recall seeing many single-struct C header files), and .h/.c file-pairs tend to be larger and contain something like a module or a subsystem. So, fewer header files. In that scenario Rob Pike's approach might work. But I don't see it as suitable for non-trivial C++ programs.

Any good reason to #include source (*.c *.cpp) files?

i've been working for some time with an opensource library ("fast artificial neural network"). I'm using it's source in my static library. When i compile it however, i get hundreds of linker warnings which are probably caused by the fact that the library includes it's *.c files in other *.c files (as i'm only including some headers i need and i did not touch the code of the lib itself).
My question: Is there a good reason why the developers of the library used this approach, which is strongly discouraged? (Or at least i've been told all my life that this is bad and from my own experience i believe it IS bad). Or is it just bad design and there is no gain in this approach?
I'm aware of this related question but it does not answer my question. I'm looking for reasons that might justify this.
A bonus question: Is there a way how to fix this without touching the library code too much? I have a lot of work of my own and don't want to create more ;)
As far as I see (grep '#include .*\.c'), they only do this in doublefann.c, fixedfann.c, and floatfann.c, and each time include the reason:
/* Easy way to allow for build of multiple binaries */
This exact use of the preprocessor for simple copy-pasting is indeed the only valid use of including implementation (*.c) files, and relatively rare. (If you want to include some code for another reason, just give it a different name, like *.h or *.inc.) An alternative is to specify configuration in macros given to the compiler (e.g. -DFANN_DOUBLE, -DFANN_FIXED, or -DFANN_FLOAT), but they didn't use this method. (Each approach has drawbacks, so I'm not saying they're necessarily wrong, I'd have to look at that project in depth to determine that.)
They provide makefiles and MSVS projects which should already not link doublefann.o (from doublefann.c) with either fann.o (from fann.c) or fixedfann.o (from fixedfann.c) and so on, and either their files are screwed up or something similar has gone wrong.
Did you try to create a project from scratch (or use your existing project) and add all the files to it? If you did, what is happening is each implementation file is being compiled independently and the resulting object files contain conflicting definitions. This is the standard way to deal with implementation files and many tools assume it. The only possible solution is to fix the project settings to not link these together. (Okay, you could drastically change their source too, but that's not really a solution.)
While you're at it, if you continue without using their project settings, you can likely skip compiling fann.c, et. al. and possibly just removing those from the project is enough – then they won't be compiled and linked. You'll want to choose exactly one of double-/fixed-/floatfann to use, otherwise you'll get the same link errors. (I haven't looked at their instructions, but would not be surprised to see this summary explained a bit more in-depth there.)
Including C/C++ code leads to all the code being stuck together in one translation unit. With a good compiler, this can lead to a massive speed boost (as stuff can be inlined and function calls optimized away).
If actual code is going to be included like this, though, it should have static in most of its declarations, or it will cause the warnings you're seeing.
If you ever declare a single global variable or function in that .c file, it cannot be included in two places which both compile to the same binary, or the two definitions will collide. If it is included in even one place, it cannot also be compiled on its own while still being linked into the same binary as its user.
If the file is only included in one place, why not just make it a discrete compilation unit (and use its globals via extern declarations)? Why bother having it included at all?
If your C files declare no global variables or functions, they are header files and should be named as such.
Therefore, by exhaustive search, I can say that the only time you would ever potentially want to include C files is if the same C code is used in building multiple different binaries. And even there, you're increasing your compile time for no real gain.
This is assuming that functions which should be inlined are marked inline and that you have a decent compiler and linker.
I don't know of a quick way to fix this.
I don't know that library, but as you describe it, it is either bad practice or your understanding of how to use it is not good enough.
A C project that wants to be included by others should always provide well structured .h files for others and then the compiled library for linking. If it wants to include function definitions in header files it should either mark them as static (old fashioned) or as inline (possible since C99).
I haven't looked at the code, but it's possible that the .c or .cpp files being included actually contain code that works in a header. For example, a template or an inline function. If that is the case, then the warnings would be spurious.
I'm doing this at the moment at home because I'm a relative newcomer to C++ on Linux and don't want to get bogged down in difficulties with the linker. But I wouldn't recommend it for proper work.
(I also once had to include a header.dat into a C++ program, because Rational Rose didn't allow headers to be part of the issued software and we needed that particular source file on the running system (for arcane reasons).)

What is a C header file? [duplicate]

This question already has answers here:
Closed 13 years ago.
Possible Duplicates:
[C] Header per source file.
In C++ why have header files and cpp files?
C++ - What should go into an .h file?
Is the only reason header files exist in C is so a developer can quickly see what functions are available, and what arguments they can take? Or is it something to do with the compiler?
Why has no other language used this method? Is it just me, or does it seem that having 2 sets of function definitions will only lead to more maintenance and more room for errors? Or is knowing about header files just something every C developer must know?
Header files are needed to declare functions and variables that are available. You might not have access to the definitions (=the .c files) at all; C supports binary-only distribution of code in libraries.
The compiler needs the information in the header files to know what functions, structures, etc are available and how to use them.
All languages needs this kind of information, although they retrieve the information in different ways. For example, a Java compiler does this by scanning either the class-file or the java source code to retrieve the information.
The drawback with the Java-way is that the compiler potentially needs to hold a much more of information in its memory to be able to do this. This is no big deal today, but in the seventies, when the C language was created, it was simply not possible to keep that much information in memory.
The main reason headers exist is to share declarations among multiple source files.
Say you have the function float *f(int a, int b) defined in the file a.c and reused in b.c and d.c. To allow the compiler to properly check arguments and return values you either put the function prototype in an header file and include it in the .c source files or you repeat the prototype in each source file.
Same goes for typedef etc.
While you could, in theory, repeat the same declaration in each source file, it would become a real nightmare to properly manage it.
Some language uses the same approach. I remember the TurboPascal units being not very different. You would put use ... at the beginning to signal that you were going to require functions that were defined elsewhere. I can't remember if that was passed into Delphi as well.
Know what is in a library at your disposal.
Split the program into bite-size chunks for the compiler. Compiling a megabyte of C files simultaneously will take more resources than most modern hardware can offer.
Reduce compiler load. Why should it know in screen display procedures about deep database engine? Let it learn only of functions it needs now.
Separate private and public data. This use isn't frequent but you may implement in C what C++ uses private fields for: each .c file includes two .h files, one with declarations of private stuff, the other with whatever others may require from the file. Less chance of a namespace conflict, safer due to hermetization.
Alternate configs. Makefile decides which header to use, and the same code may service two different platforms given two different header files.
probably more.

C project structure - header-per-module vs. one big header

I've worked with a number of C projects during my programming career and the header file structures usually fall into one of these two patterns:
One header file containing all function prototypes
One .h file for each .c file, containing prototypes for the functions defined in that module only.
The advantages of option 2 are obvious to me - it makes it cheaper to share the module between multiple projects and makes dependencies between modules easier to see.
But what are the advantages of option 1? It must have some advantages otherwise it would not be so popular.
This question would apply to C++ as well as C, but I have never seen #1 in a C++ project.
Placement of #defines, structs etc. also varies but for this question I would like to focus on function prototypes.
I think the prime motivation for #1 is ... laziness. People think it's either too hard to manage the dependencies that splitting things into separate files can make more obvious, and/or think it's somehow "overkill" to have separate files for everything.
It can also, of course, often be a case of "historical reasons", where the program or project grew from something small, and no-one took the time to refactor the header files.
Option 1 allows for having all the definitions in one place so that you have to include/search just one file instead of having to include/search many files. This advantage is more obvious if your system is shipped as a library to a third party - they don't care much about your library structure, they just want to be able to use it.
Another reason for using a different .h for every .c is compile time. If there is just one .h (or if there are more of them but you are including them all in every .c file), every time you make a change in the .h file, you will have to recompile every .c file. This, in a large project, can represent a valuable amount of time being lost, which can also break your workflow.
1 is just unnecessary. I can't see a good reason to do it, and plenty to avoid it.
Three rules for following #2 and have no problems:
start EVERY header file with a
#ifndef _HEADER_Namefile
#define _HEADER_Namefile_
end the file with
#endif
That will allow you to include the same header file multiple times on the same module (innadvertely may happen) without causing any fuss.
you can't have definitions on your header files... and that's something everybody thinks he/she knows, about function prototypes, but almost ever ignores for global variables.
If you want a global variable, which by definition should be visible outside it's defining C module, use the extern keyword:
extern unsigned long G_BEER_COUNTER;
which instructs the compiler that the G_BEER_COUNTER symbol is actually an unsigned long (so, works like a declaration), that on some other module will have it's proper definition/initialization. (This also allows the linker to keep the resolved/unresolved symbol table.) The actual definition (same statement without extern) goes in the module .c file.
only on proven absolute necessity do you include other headers within a header file. include statements should only be visible on .c files (the modules). That allows you to better interpret the dependecies, and find/resolve issues.
I would recommend a hybrid approach: making a separate header for each component of the program which could conceivably be used independently, then making a project header that includes all of them. That way, each source file only needs to include one header (no need to go updating all your source files if you refactor components), but you keep a logical organization to your declarations and make it easy to reuse your code.
There is also I believe a 3rd option: each .c has its own .h, but there is also one .h which includes all other .h files. This brings the best of both worlds at the expense of keeping a .h up to date, though that could done automatically.
With this option, internally you use the individual .h files, but a 3rd party can just include the all-encompassing .h file.
When you have a very large project with hundreds/thousands of small header files, dependency checking and compilation can significantly slow down as lots of small files must be opened and read. This issue can be often solved by using precompiled headers.
In C++ you would definitely want one header file per class and use pre-compiled headers as mentioned above.
One header file for an entire project is unworkable unless the project is extremely small - like a school assignment
That depends on how much functionality is in one header/source file. If you need to include 10 files just to, say, sort something, it's bad.
For example, if I want to use STL vectors I just include and I don't care what internals are necessary for vector to be used. GCC's includes 8 other headers -- allocator, algobase, construct, uninitialized, vector and bvector. It would be painful to include all those 8 just to use vector, would you agree?
BUT library internal headers should be as sparse as possible. Compilers are happier if they don't include unnecessary stuff.

Resources