How does the compiler differentiate indentically-named items - c

In the following example:
int main(void) {
int a=7;
{
int a=8;
}
}
The generated assembly would be something like this (from Compiler Explorer) without optimizations:
main:
pushq %rbp
movq %rsp, %rbp
movl $7, -4(%rbp) // outer scope: int a=7
movl $8, -8(%rbp) // inner scope: int a=8
movl $0, %eax
popq %rbp
ret
How does the compiler know where the variable is if there are duplicately-named variables? That is, when in the inner scope, the memory address is at %rbp-8 and when in the outer scope the address is at %rbp-4.

There are many ways to implement the local scoping rule. Here is a simple example:
the compiler can keep a list of nested scopes, each with its own list of symbol definitions.
this list initially has a single element for the global scope,
when it parses a function definition, it adds a new scope element in front of the scope list for the function argument names, and adds each argument name with the corresponding information in the identifier list of this scope element.
for each new block, it adds a new scope element in front of the scope list. for ( introduces a new scope too for definitions in its first clause.
upon leaving the scope (at the end of the block), it pops the scope element from the scope list.
when it parses a declaration or a definition, if the corresponding symbol is already in the current scope's list, it is a local redefinition, which is forbidden (except for extern forward declarations). Otherwise the symbol is added to the scope list.
when it encounters a symbol in an expression, it looks it up in the current scope's list of symbols, and each successive scope in the scope list until it finds it. If the symbol cannot be found, it is undefined, which is an error according to the latest C Standard. Otherwise the symbol information is used for further parsing and code generation.
The above steps are performed for type and object names, a separate list of symbols is maintained for struct, union and enum tags.
Preprocessing is performed before all of this occurs, in a separate phase of program translation.

The C programming language has some specification, like n1570 (or newer ones). That specification defines in ยง6.2.1 the scope of an identifier.
So any C compiler should follow that specification.
How does a C compiler implements that specification requires a good book for explanations. I recommend the Dragon book.
Some simple or complex C compilers are open source. Look inside the source code of TinyCC, nwcc, Clang, or GCC to understand how they implement that specification (they have symbol tables, but details are specific to each compiler).
How does the compiler know where the variable is if there are duplicately-named variables?
It manages symbol tables, and update them when parsing blocks. Usually, a compiler build some abstract syntax tree of the compiled source code, and leafs in that tree representing variables refer to some symbol table. The GCC compiler documents its Generic Tree and GIMPLE data structures and provide dump options to output them. You could also compile your foo.c as gcc -S -O -fverbose-asm foo.c and look into the emitted assembler code foo.s.
At last, your example can be considered as poor programming style. Some coding guidelines (like MISRA-C or GNU coding standards) disallow or discourage it. Your code review process should catch such code (in my opinion, your example is a quite unreadable code).
My feeling is that single letter variables should have a very small scope - a dozen of lines at most.
I suggest to look (for inspiration) inside the C code of existing free software projects (like GNU bash or GNU make). Care has been taken to choose understandable names.
Take advantage of modern source code editors like GNU emacs or vim. You can configure them to type long identifiers with a few keyboard presses (they have auto-completion; and some input libraries like GNU readline provides that too). Since you (or your colleagues) will spend much more time in reading source code than in typing it, such an effort (naming well your variables and identifiers) is worth your valuable time.
If you use GCC as your compiler, invoke it as gcc -Wall -Wextra -g to get a lot of warnings and debug information. You could also use static source code analysis tools like Frama-C or the Clang static analyzer.
For real life software projects (for example GTK), you'll have a document specifying coding conventions, and you could write some GCC plugin checking most of them. See also the DECODER project.
For some parts of your software project, you may use C code generators like SWIG or GNU bison. In some cases, you would have your own C code generator. Then be sure to generate long C identifiers to reduce the possibility of name clashes.
Some code obfuscation tools are renaming most C identifiers. If you ship C source code without comments and with most identifiers generated like _0TwK4TkhEG the resulting C code can be compiled at your client site and would practically stay unreadable. You technically could write a code obfuscator transforming readable C code to cryptic C code.

Related

How to access assembly language symbols without a leading _ from C on 6502 (cc65)

I'm writing some code in 6502 assembly language using cc65.
Because I'm living in 2022 and not 1979 and have access to a development machine that is a million times more powerful than the target platform, I'm writing unit tests for the assembly language code in C.
Obviously the calling conventions for C and assembly language are different, so I have a bunch of wrapper functions that accept C-style arguments and then call the assembly language functions.
But after calling an assembly language function, I want to check the state of various globals that are defined in assembly language, but I can't because C expects all identifiers to start with an underscore '_' and the identifiers in my assembly language modules don't.
I could just export every symbol twice, once with a '_' prefix and once without, but it seems so clunky and I just wonder if there's an easier way? Is there a #pragma or something that I can use to tell C to use the symbol name exactly as-is, without adding an underscore?
I've looked in the cc65 docs and found nothing, but it seems like a pretty common need, and I'm wondering what other people do.
It is likely that the cc65 compiler only supports access to symbols with the ABI-specificed decoration, i.e. those beginning in an underscore _.
To access other symbols, they therefore must either be renamed to follow the decoration or a decorated alias must be created.
_foo EQU foo
For functions, it is also worth considering to write wrapper functions. This may improve the ability to debug the code as debuggers tend to get confused when two symbols refer to the same address.

How to retrieve the real definition of a type, variable, macro etc. from the C language headers?

Nowadays, C language compiler environments are quite complicated. I often encounter problems on determining the actual definition of a type, variable, function, or macro as defined in some header file as it is activated by the current compiler options.
The included files have conditional definitions, conditional inclusions, etc. depending on the compiler options selecting which language "standard" to use during a specific compilation. So, it is quite difficult to retrieve the actual definition of structure (for example) conditionally defined deep in some header. I need a method to display or pinpoint its actual definition.
For example, take the definition of struct tm which is supposed to be defined in time.h. However, you are not going to find it there in the GNU Project C Compiler.
I can always, refer to documents (ISO/IEC 9899 Standard or GCC Online Documentation), but there may be some cases where the definition will change depending on which Standard or non-standard compiler environment I select. So, the question is:
How can I list the real definition of a function prototype, macro, variable, or type as it is being processed by the current activation of the compiler subject to the selected compile-time options?
Some examples:
Find the value of macro EOF in stdio.h.
Find the definition of "type" FILE in stdio.h.
Find the definition of assert macro in assert.h.
Find the definition of struct timespec in time.h.
What is the meaning of __restrict in the prototype definition of fopen in stdio.h?
How can I list the real definition of a function prototype, macro, variable, or type as it is being processed by the current activation of the compiler subject to the selected compile-time options?
From within the C language itself, you can't. C language doesn't have reflection, it can't inspect itself.
Going to a type definition, navigating and browsing through a source code tree - these are jobs for an IDE that is integrated with C programming language, this is not part of a programming language itself. There are vast number of IDEs available that integrate with C, there are language servers, and there are C code indexers like ctags, GNU global. Configure the indexing tool or IDE with the same options and macros you provide your compiler with and the tool will help you through code. There are also build systems integrated with IDE, so that the compiler invoked by the build system uses same command line arguments as the indexer automatically (like with the help of compile_commands.json in case of cmake).
For example, take the definition of struct tm
For example, install eclipse with the C and C++ plugins installed. Create new C/C++ project, create there some.c file and type in it #include <time.h> followed by struct tm;. Save the file, let the eclipse indexer index the project (should be instantaneous) or click on Project->C/C++ Index->Rebuild. Then put the cursor on tm string and click F3 -> viola, on my pc cursor goes into /usr/include/bits/types/struct_tm.h file.
But, my question was related to command line compilers like gcc
How to retrieve the real definition of a type, variable, macro etc. from the C language headers?
gcc is a compiler - it does not support such feature.

How preprocessor directives works in C?

I am going through book [Let us C-by Yashwant Kanetkar ], here it stated:
When we compile a program, before the source code passes to the compiler, it is examined by the C preprocessor for any macro definition. When it sees the #define directive, it goes through the entire program in search of macro templates; wherever it finds one, it replaces the macro template with the appropriate macro expansion. Only after this procedure has been completed, is the program handled over to the compiler.
My question is that, before the program is passed to compiler, how can Preprocessor program is able to read the TOKENS corresponding to the macro templates? Is preprocessor program also able to divide the program into TOKENS.
That description is confusing (so I won't recommend that book; read instead the K&R The C Programming Language book). The preprocessor does not go through the entire program, it has previously processed some input. Only past preprocessed input matters for the behavior of the preprocessor (in other words, the preprocessor is a single-pass mechanism).
Read wikipage on C preprocessor, then read documentation of GNU cpp and other documentation on preprocessor, and the wikibook chapter on C programming/Preprocessor.
In current C compilers (for performance reasons) the preprocessor is no longer a separate program, it is part of the compiler itself. For recent GCC look into libcpp/ (its preprocessor library, internal to the compiler).
If using the GCC compiler, you can get the preprocessed form of your source code file csource.c by running gcc -C -E csource.c > csource.i then looking inside the generated preprocessed form csource.i (e.g. with a pager or an editor).
(I strongly recommend doing that once in a while; you'll learn a lot; and yes, you could be surprised by the amount of code pulled by a usual #include <stdio.h> directive)
I believe your book is explaining wrongly. The preprocessor handles every preprocessing directive. When it encounters a #define it stores in some preprocessor symbol table the definition of that symbol. When it encounters after that #define an occurrence of that preprocessor symbol, it does the appropriate substitution.
In book K & R The C Programming Language.
Page No: 88
C provides certain language facilities by means of a preprocessor, which is conceptually a separate first step in compilation.
In book Compiler Principles, Techniques and Tools by Aho, Lam, Sethi and Ullman
Page No. 3
The task of collecting the source program is sometimes entrusted to a separate program, called a preprocessor. The preprocessor may also expand shorthands, called macros, into source language statements. The modified source program is then fed to compiler.
In GCC GNU Documentation
The C preprocessor is a macro processor that is used automatically by the C compiler to transform your program before actual compilation.
Andn read this too.
So from these three official sources, one can say that the Preprocessor is a separate program run by Compiler. So in book Let Us C by Yashwant P Kanetkar that Preprocessor is a program that processes before the compiler as its name suggests is no wrong, and the expanded code can be seen in file.i.
Now let's come to your question,
In book K & R The C Programming Language.
Page No: 89
Substitution are made only for tokens and do not take place within quoted strings.
and as Basile told in his answer that
In current C compilers (for performance reasons) the preprocessor is no longer a separate program, it is part of the compiler itself.
and compiling is a long process that passes through several phases, Preprocessor actually comes after the program is converted in tokens, but as sources says that it is the process of before compilation that means it is done before any kind of intermediate code generation, and yes, breaking program into tokens is the first step of compiler before any intermediate code generation.

gcc - OS-independent function labels

void foo(){
...
}
Compiling this to assembly, it seems that gcc on linux will create label foo as an entry point but label _foo on OSX.
We can, of course, do an OS-specific selection whenever we need a label, but this is cumbersome.
Is there any way to suppress this so that the labels on both systems are the same (preferably one that is also Windows-compatible)?
No. It's part of the name mangling specifications of the platform.
You can't change that. You're still writing assembly. Don't expect it to be portable in any way, that's what C was invented for.
The early C compilers decorated the name of the functions with an _ to avoid name clashing when linking against the already developed and huge assembly libraries of the times.
Credits for this information go to this excellent old answer.
Today this is not needed anymore but the tradition is still sticking around, mostly for backward compatibility, even though some systems are getting rid of it.
This is not an OS issue, OSes are completely orthogonal to programming languages, name decoration is not something defined by the OS ABI, it is a matter of the compiler/linker designers; though standards have been created to reduce the incompatibilities and an ABI may suggest their use.
In order to fully understand how you can mitigate your problem it is worth noting that while the OS API are language agnostic, a C program rarely invoke them directly, more likely it uses the C run-time.
The C run-time is usually statically linked and it expects names to be decorated according to the scheme of the compiler used to create it.
So if you need to use the C run-time you have to stick with the same name decoration as your system components are using.
This last point rules out the -fno-leading-underscore option as it will generate a linker error on the relevant platforms.
It is better to work on the assembly files, since you have the freedom to define and imports names exactly as typed. Furthermore usually the assembly code is limited.
If you are using NASM1 there is a nice trick you can use, it's called Macro indirection and it allow you to append a symbol, define at command line, to a name.
Consider:
BITS 32
mov eax, %[p]data
_data db 0
data db 0
If you compile this file twice, the first time as nasm -Dp=_ ... and the second as nasm -Dp= ..., by inspecting the immediate value in the generated opcode for mov eax, %[p]data you can check that in the first case it has been translated as mov eax, _data and in the second as mov eax, data.
Assuming you access external symbols by declaring them as EXTERN symn (precise syntax is irrelevant here), you can define a macro PEXTERN that works like the directive EXTERN but import the symbol with or without a leading underscore based on the value of the macro p (you can change this name) and define an alias for it so that its imported name is the same regardless.
BITS 32
%macro PEXTERN 1
EXTERN %[p]%1
%ifnidn %1, %[p]%1
%define %1 %[p]%1
%endif
%endmacro
PEXTERN foo
PEXTERN bar
mov eax, foo
call bar
Running nasm -Dp= -e ... and nasm -Dp=_ -e ... produces the listings
extern foo extern _foo
extern bar extern _bar
mov eax, foo mov eax, _foo
call bar call _bar
You'll need to update the building scripts/Makefiles, off the top of my head you can use two methods:
Detect the OS type and properly define the symbol p.
With Makefiles this may be easier.
Try compiling a test program.
Write a minimal C program that import/export a function and a minimal assembly file that export/import that function.
Define the symbol as _ and try to assemble + compile (redirecting everything into /dev/null).
If it fails redefine the symbol as empty.
Note that besides names, individual OSes may need specific assembly flags, so a universal building script maybe more involved but not necessarily unmanageable.
You'll end up needing something like Cygwin for Windows.
1 If not, check if you can port the idea into your assembler.

How to access C preprocessor constants in assembly?

If I define a constant in my C .h file:
#define constant 1
How do I access it in my assembly .s file?
If you use the GNU toolchain, gcc will by default run the preprocessor on files with the .S extension (uppercase 'S'). So you can use all cpp features in your assembly file.
There are some caveats:
there might be differences in the way the assembler and the preprocessor tokenize the input.
If you #include header files, they should only contain preprocessor directives, not C stuff like function prototypes.
You shouldn't use # comments, as they would be interpreted by the preprocessor.
Example:
File definitions.h
#define REGPARM 1
File asm.S
#include "definitions.h"
.text
.globl relocate
.align 16
.type relocate,#function
relocate:
#if !REGPARM
movl 4(%esp),%eax
#endif
subl %ecx,%ecx
...
Even if you don't use gcc, you might be able to use the same approach, as long as the syntax of your assembler is reasonably compatible with the C preprocessor (see caveats above). Most C compilers have an option to only preprocess the input file (e.g. -E in gcc) or you might have the preprocessor as a separate executable. You can probably include this preprocessing prior to assembly in your build tool.
You can't, unless a specific development chain allows it. But in 20 years or so of embedded programming I never saw one.
Usually, the only way for assembly and C to communicate is the linker, i.e. labels defined in C/C++ are accessable from within assembly (and vice versa).
When I had to share definitions between C/C++ and asm, I usually did it with a custom code generator.
Since high-level data are rarely exchanged with assembly, a few defines and maybe some external references are usually enough, and thus the code generator is really easy to make.
You can use for instance perl or awk to parse a very simple list of common constants and produce a pair of files, one with #defines and the other with the equivalent EQU directives.

Resources