Best Practice in Module (file) Size for C - c

I have read that it is best to aim keep functions to no more than approx a screen full of lines.
Is there a similar guideline for module (file) sizes?
I have read several C programming style guidelines but cannot find reference to recommended module sizes (only that of functions)
I apologise if this is akin to asking how long a piece of string is - but I would be very interested in seeing if there is some agreement among experts on this?

I would recommend using a separate .h and .c file for each struct and associated functions, and if possible not have more than a 1000 lines per file.

I have been taught that module size isn't the issue, but rather code readability. That is why the "screen full of lines" for functions is best, as well as lines no more that around 80-100 characters long, no more than 2 cyclomatic nests (for loop-if/then-for loop-if/then...), etc. As long as your code is organized, I don't see any real limit to the size of a module, as long as the principle of cohesion is practiced when constructing a module. That is the real standard, which allows the user of your code to include, as much as it is possible, only what he or she needs to get the job done and not much else.

If you mean file length, it should be as per the ANSI C standard. You can refer to stdio.h FILENAME_MAX. Note that most of the implementations support file name length more than specified in the C standard.
Dont make file size very large. On an average, the size could go from 1000 to 2000 files. But it depends on how you have written the functions.

Keep your source code manageable. Separate cohesive units into modules and keep each module in a separate '.c' file and each '.c' file should have an accompanying '.h' file. Following this system and based on the complexity of your project, you may have '.c' files ranging from a few lines to approximately 1000 lines. Those numbers are reasonable and are easy on your complier and platform.

Related

Counting the number of functions and data structures in a C codebase

Is there a way to take a C file (or a directory/project) and count the number of functions + data structures? This is similar to counting the LOC but instead is focused on counting the number of "conceptual units" the program handles as a way to measure its complexity.
It sounds like you are in need of perusing your source code. Doxygen is an excellent tool for summarizing just about every aspect of a C project. (and many other languages). It is OpenSource, and easily downloaded. Additionally, the list of features is extensive.
On a Linux environment, you'll want to look at a tool like objdump, that will show you a bunch of information about the compiled output.
There are pages that explain some of its complicated output, such as this.
But perhaps one of the simplest is objdump -T.

C theory/general practice related to "splitting" the system into x number of source files

I have a quick question when programming in C. I am writing a simple application in C as the title suggests but i find myself defining rather large functions in separate source files so it makes maintenance and debugging much easier but my question is is there a standard X amount of lines in a c source file before you should "split" it up into multiple files or is it very dependant on the system/functions in question.
Say for example i have 20 source files with 1 function in each say the functions are somewhat related but they all do different things (e.g. they all manipulate the same struct in some way) should you in theory have these 20 files, or 1 larger file with 20 functions and keep the modification of X structure in the same file?
My idea is the more "split" the better/easier the coding becomes, but then again im quite new to C.
Any input will be appreciated.
Cheers,
Chris.
It makes sense to put code related to the same conceptional area together. If you have functions which work on matrices for example, it would seem to make sense to have a file called matrices.c within which, there are X number of matrix functions. A function called render would obviously not belong there.
Yet if the number of matrix function were to grow huge, it started to feel wrong to shove them all into a single file. Under such a situation I would look for sub-categories and create separate files for each, e.g 2d_matrix.c, 3d_matrix.c, etc.
As for the number of functions you place in a file before you recategorize it, that's is up to personal choice and sometimes development rules of the team you work for.
The same consideration sometimes applies to the size of a function. One team I have worked for would not allow code which is over two screens high, feeling that such code should be broken up into a number of smaller functions which would make the code more readable.
To me, structure your code in a way that makes sense. Keep related code together and be sensible with sizes of functions, number of functions in a file (both too few or too many).
The larger a function gets, the more easy it is to accidentally break it.
The more code you shove in one file, the more likely it will be for other people to be a little sloppy and shove more, and possibly unrelated code in the same file.
Splitting up of a file is not function/system dependant. That entirely depends on the programmer. I have seen 1000-1500 or even more lines of code in a single C file. Keeping twenty functions in a same file makes sense if they are not very different from each other. However if you split the functions among the files, make sure that you write the Makefile properly when compiling them. The phrase " the more split, the easier coding becomes" is debatable.
I liked alk's answer in the closed duplicate: If you follow an object oriented style in C, i.e. use structures and operations on them, the files separate quite naturally in the same way as they would in C++. Operations on the same data types, together forming a "poor man's class", go together.

Is saving a binary file a standard? Is it limited to only 1 type?

When should a programmer use .bin files? (practical examples).
Is it popular (or accepted) to save different data types in one file?
When iterating over the data in a file (that has several data types), the program must know the exact length of every data type, and I find that limiting.
If you mean for some idealized general purpose application data, text files are often preferred because they provide transparency to the user, and might also make it easier to (for instance) move the data to a different application and avoid lock-in.
Binary files are mostly used for performance and compactness reasons, encoding things as text has non-trivial overhead in both of these departments (today, perhaps mostly in size) which sometimes are prohibitive.
Binary files are used whenever compactness or speed of reading/writing are required.
Those two requirements are closely related in the obvious way that reading and writing small files is fast, but there's one other important reason that binary I/O can be fast: when the records have fixed length, that makes random access to records in the file much easier and faster.
As an example, suppose you want to do a binary search within the records of a file (they'd have to be sorted, of course), without loading the entire file to memory (maybe because the file is so large that it doesn't fit in RAM). That can be done efficiently only when you know how to compute the offset of the "midpoint" between two records, without having to parse arbitrarily large parts of a file just to find out where a record starts or ends.
(As noted in the comments, random access can be achieved with text files as well; it's just usually harder to implement and slower.)
I think when embedded developers see a ".bin" file, it's generally a flattened version of an ELF or the like, intended for programming as firmware for a processor. For instance, putting the Linux kernel into flash (depending on your bootloader).
As a general practice of whether or not to use binary files, you see it done for many reasons. Text requires parsing, and that can be a great deal of overhead. If it's intended to be usable by the user though, binary is a poor format, and text really shines.
Where binary is best is for performance. You can do things like map it into memory, and take advantage of the structure to speed up access. Sometimes, you'll have two binary files, one with data, and one with metadata, that can be used to help with searching through gobs of data. For example, Git does this. It defines an index format, a pack format, and an object format that all work together to save the history of your project is a readily accessible, but compact way.

Why do file formats have magic numbers?

For example, Portable Executable has several, including the famous "MZ" at the beginning, as well as the "PE\0\0" at the start of the PE header. The Rar file format has the "Rar!" header at the beginning, and several others have similar "magic values" in the file.
What purpose do such magic values serve?
Because users change the file extension, or other programs steal the file extension, it allows the application to cancel processing of a file in an unknown format instead of trying its best and then failing anyway.
the concept of magic numbers goes back to unix and pre-dates the use of file extensions.
The original idea of the shell was that all 'executable' would look the same - it didn't matter how the file had been created or what program should be used to evaluate it. The shell would look at the contents of the file and determine the appropriate file. Microsoft came along and chose a different approach and the era of file extensions was born. Then to make things 'nicer' for users microsoft chose to 'hide' these extensions and the era of trojan files which look like they are of one type but really have a different extension and are processed by a different file was born.
If two applications store data differently, but are constructed such that a file for one might possibly also be a valid (but meaningless) file for the other, very bad things can happen. A program may think it has successfully loaded the file (unaware that the data is meaningless) and then write back a file which to it would be semantically identical, but which would no longer be meaningfully readable by the application that wrote it (or anything else for that matter).
Using magic numbers doesn't entirely prevent this, but it can help at least somewhat.
BTW, trying to guess about the format of data is often very dangerous. For example, suppose one has a list of what are probably dates in the format nn-nn-nn. If one doesn't know what format the dates are in, there may be enough information to pretty well guess the format (e.g. if one of the records is 12-31-99, then absent information to the contrary, the dates are probably mm-dd-yy) but if all dates are within the first 12 days of a month, the data could easily be misinterpreted. Suppose, though, the data were preceded by something saying "MM-DD-YY". Then the risks of misinterpretation could be reduced.
To quickly identify the type of the file, or the positions within it.
Your question should not be “why do file formats have magic number”, but rather “what are the advantages of file formats having magic number”!
Suggestions:
Programs that undelete files by reading disk free space may recognize file types
Your UNIX knows whether an executable file is to be interpreted (she-bang) or is binary
When you lose extensions, programs like file can detect what your files are
Designer of file formats consider it is always safer when applications can easily ensure they are reading a file which has the good format.
As you have a header, it does not cost much to put it at header start.

Is there a widespread C library for reading name/value pairs from a file?

My program is reading a text file containing various lines of text for a settings file. Some of the lines could get very large. Currently the buffer size is 4096 chars. It is possible that some lines could exceed this, whether through maliciousness or due to various factors operating within the program.
The current routines were rather tedious to write and now I want to expand the possible contents of the file which will require more of this tedious repetitive code. (This is for a settings type file, consisting of name value pairs and the occasional section header. Some numerical values need to be read as strings due to multiple precision).
The main thing I want is to read an arbitrary length line without buffer overflow. I've just discovered getline can do this for me, but, is there for heavens sake a library that will just do the whole lot of this tediousness for me?
edit:
I don't wish to be forced to place an = sign between the name and values, a blank space should suffice as separator.
By widespread, I mean the library should be available in the standard packages of the popular Linux distributions.
I'm aware of libconfig but it seems complete overkill for my requirements.
Look into libini, sounds about right. It is quite old and not exactly undergoing frantic development, but if it already works for your problem, that should be fine.
A more up to date library, with a bunch of other benefits, is glib, it has a key-value-parser API.
My suggestion is, DIY, since it's quite easy.
Read each line
count chars until your separator and after your separator
allocate buffers
and read name value pairs with sscanf
like:
sscanf(line, "%[^:]: %[^\n]", key, value);
You will be safe since you counted chars before sccanf.
I contributed an updated fork of libini at CCAN. It also contains a very useful dictionary implementation as well as some simple hashing algorithms. Rusty put it in the repo, so I guess I did a reasonably good job of bringing it up to date and fixing the few minor bugs.
The latest version of the library can be found if you poke through this tree, it contains basic token support as well as basic transaction support (useful for re-reading configuration files and reverting if there's a parsing error). It also contains a much more updated set of unit tests.
I don't actively maintain the fork any more, as the original author of libini became active again, however the module is maintained in CCAN.

Resources