It stands to reason that, for executable code to be called a function, it should conform to the function calling convention of the platform it's running on.
However, _start() does not; for example in this reference implementation there is no return address on the stack:
.section .text
.global _start
_start:
# Set up end of the stack frame linked list.
movq $0, %rbp
pushq %rbp # rip=0
pushq %rbp # rbp=0
movq %rsp, %rbp
# We need those in a moment when we call main.
pushq %rsi
pushq %rdi
# Prepare signals, memory allocation, stdio and such.
call initialize_standard_library
# Run the global constructors.
call _init
# Restore argc and argv.
popq %rdi
popq %rsi
# Run main
call main
# Terminate the process with the exit code.
movl %eax, %edi
call exit
.size _start, . - _start
Yet it's called a function in a myriad of sources. A number of questions and answers on StackOverflow also refer to it as a function.
Is a function simply a group of instructions identified by the address to the entry point, or must it conform to the calling convention? The C standard does not seem to define the concept of a function, neither do the gcc and clang docs. What is the authoritative source that defines this concept?
About the lack of a return making a piece of code not a function, even a function written in C, does not have to have a return instruction in it:
int call_fn(int(*fn)()) {
return fn();
}
This function, with proper optimizations compiles down to a single jmp instruction: https://godbolt.org/z/nxT9qTvaf
call_fn(int (*)()): # #call_fn(int (*)())
jmp rdi # TAILCALL
In general, I don't think the C or the C++ standard would define anything about stuff written in assembly. A common calling convention helps for making direct calls into functions written in other languages, but you can still call functions using other calling conventions using a trampoline.
It stands to reason that, for executable code to be called a function, it should conform to the function calling convention of the platform it's running on.
"Function" is the primary idea here; "calling convention" is subsidiary to that. As such, I think a more supportable claim would be that for every function, there is a convention for calling it.
Interoperability considerations lead to standardization of calling conventions, but there is no One True calling convention, not even on a per-platform basis. Even subject to the influence of interoperability, there are platforms that support multiple standard calling conventions. In any case the existence of standard calling conventions does not necessarily relegate code with other conventions for entry and exit to non-function-hood.
Is a function simply a group of instructions identified by the address to the entry point, or must it conform to the calling convention?
This is a question of the definition of "function". There is room for variation on this, and in practice, different definitions apply in different contexts. For example, the question refers to the C language specification, but this speaks to the meaning of "function" in the context of C source code, not assembly or machine code.
In practice, in various languages and contexts, there are
functions with identifiers and functions without;
functions that return a value and functions that don't;
functions with a single entry point and functions with multiple entry points;
functions with a single exit point and functions with multiple exit points;
functions that always return to the caller, functions that usually return, functions that occasionally return, and functions that never return;
a wide variety of patterns for how functions receive data to operate on, how they return data to their caller (if they do so), and what invariants they do and do not ensure
other dimensions of variation, too
Thus, no, I do not accept in any universal sense that a piece of code needs to conform to a particular calling convention to be called a "function", and I also do not accept "a group of instructions identified by the address to the entry point" as a satisfactory universal definition.
Is _start() a function?
A _start() function such as is provided by GCC / Glibc satisfies some relevant definitions of the term. I have no problem with calling it a "function".
There seems to be this idea going around in the newer programming models that all running code is inside functions; but in the beginning this was not so, and if we look at the old languages we can observe this.
Drawing from lisp:
(format t "Hello, World!")
This is hello world in common lisp, and is not a function in any normal sense. For comparison, here is it as a function:
(defun hello ()
(format t "Hello, World!"))
(hello)
And from near the other root of all programming languages; here is Fortran (source):
PROGRAM FUNDEM
C Declarations for main program
REAL A,B,C
REAL AV, AVSQ1, AVSQ2
REAL AVRAGE
C Enter the data
DATA A,B,C/5.0,2.0,3.0/
C Calculate the average of the numbers
AV = AVRAGE(A,B,C)
AVSQ1 = AVRAGE(A,B,C) **2
AVSQ2 = AVRAGE(A**2,B**2,C**2)
PRINT *,'Statistical Analysis'
PRINT *,'The average of the numbers is:',AV
PRINT *,'The average squared of the numbers: ',AVSQl
PRINT *,'The average of the squares is: ', AVSQ2
END
REAL FUNCTION AVRAGE(X,Y,Z)
REAL X,Y,Z,SUM
SUM = X + Y + Z
AVRAGE = SUM /3.0
RETURN
END
Yup that's top level statements and a function definition. Fortran has three things, the PROGRAM, SUBROUTINEs, and FUNCTIONs.
And again, we can do the same kind of example in QuickBasic:
CALL Hello
Sub Hello()
Print "Hello, World"
End Sub
QuickBasic was kind of funny; you never even tried to name the entry point and whatever .OBJ file was first in the build script was where the entry point was.
There's a general recurring theme here. In all of these, the top level isn't very function-like. The compiler would add stuff to the beginning of the entry point for you so that runtime initialization worked correctly.
Now what happened in C? C took a different path. The initialization routines were written in their own file that calls main() and the compiler just compiles main() as it would any other function and has no capacity for emitting code that runs at top level. Thus, the entry point (traditionally called _start but doesn't have to be) is not and cannot be written in C.
Don't get me wrong here, if you were to compile any of these on a Unix platform today and look at the resulting .o files you would see the modern compilers emit a main() function with the top level code in it. This is because of the preeminence of the C runtime and not because of any need for it to be a function. Had the other languages carried around in their runtimes the definitions of the system calls like they used to, this would not need to be.
Thus we have the process entry point is not a function.
We can take this argument one step farther; suppose (and I have seen news articles reference a thing kind of like this) we had a full native Java compiler that emitted .o files and linked against .so files providing the Java runtime; we could then ask Is _start a class method? The answer isn't no. The answer is the question makes no sense because you can't get a valid Java reference to the symbol. The same silly thing happens in C, we just need to pick a different platform. On DOS FAR model, _start is exported as PROC NEAR but void _start() expects a PROC FAR. The emitted link-time fixup is of the wrong size and trying to take the address of _start results in undefined behavior.
You are mixing fields. You can't apply "text specification" to oranges.
the C standard does not seem to define the concept of a function
C is a language. In the C language, the text like the following:
void func();
is a function declaration of a function func.
Is _start() a function?
The text you posted is not in C language. There are no functions declarations and definitions in it.
As you stated, the term function is not defined in the C standard. I would assume that the English language understanding of the term "function" applies here, as to any other word in the C standard.
I see in Merriam-Webster that a "function" is a computer subroutine, where a subroutine is a a sequence of computer instructions for performing a specified task that can be used repeatedly.
Clearly, _start is a function - it is a sequence of instructions to be executed repeatedly, it is executed on a computer, and it also operates on variables in the form of registers.
The text you posted represents the function _start in the form of a text using assembly language. It is not possible to represent the function _start in the C programming language.
(It is also not possible to express oranges, yet they exist in the real world. My point is, you can take any other word in the C standard, like, I don't know, "international", and ask "Are oranges international?". Applying C standard and "language-lawyer" tag to abstract contexts is not going to give you answers. Bottom line is that the C standard is a specification - it tells what happens when, it is not a dictionary.)
Is a function simply a group of instructions identified by the address to the entry point, or must it conform to the calling convention?
See Merriam-Webster function.
What is the authoritative source that defines this concept?
I googled and "There is no official agency that makes rules for English language".
The C standard is created by http://www.open-std.org/jtc1/sc22/wg14/ .
I'm programming for Windows in assembly in NASM, and i found this in the code:
extern _ExitProcess#4
;Rest of code...
; ...
call _ExitProcess#4
What does the #4 mean in the declaration and call of a winapi library function?
The winapi uses the __stdcall calling convention. The caller pushes all the arguments on the stack from right to left, the callee pops them again to cleanup the stack, typically with a RET n instruction.
It is the antipode of the __cdecl calling convention, the common default in C and C++ code where the caller cleans up the stack, typically with an ADD ESP,n instruction after the CALL. The advantage of __stdcall is that it is generates more compact code, just one cleanup instruction in the called function instead of many for each call to the function. But one big disadvantage: it is dangerous.
The danger lurks in the code that calls the function having been compiled with an out-dated declaration of the function. Typical when the function was changed by adding an argument for example. This ends very poorly, beyond the function trying to use an argument that is not available, the new function pops too many arguments off the stack. This imbalances the stack, causing not just the callee to fail but the caller as well. Extremely hard to diagnose.
So they did something about that, they decorated the name of the function. First with a leading _underscore, as is done for __cdecl functions. And appended #n, the value of n is the operand of the RET instruction at the end of the function. Or in other words, the number of bytes taken by the arguments on the stack.
This provides a linker diagnostic when there's a mismatch, a change in a foo(int) function to foo(int, int) for example generates the name _foo#8. The calling code not yet recompiled will look for a _foo#4 function. The linker fails, it cannot find that symbol. Disaster avoided.
The name decoration scheme for C is documented at Format of a C Decorated Name. A decorated name containing a # character is used for the __stdcall calling convention:
__stdcall: Leading underscore (_) and a trailing at sign (#) followed by a number representing the number of bytes in the parameter list
Tools like Dependency Walker are capable of displaying both decorated and undecorated names.
Unofficial documentation can be found here: Name Decoration
It's a name decoration specifying the total size of the function's arguments:
The name is followed by the at sign (#) followed by the number of bytes (in decimal) in the argument list.
(source)
Recently, when working on a project, I had a need to measure the size of a C function in order to be able to copy it somewhere else, but was not able to find any "clean" solutions (ultimately, I just wanted to have a label inserted at the end of the function that I could reference).
Having written the LLVM backend for this architecture (while it may look like ARM, it isn't) and knowing that it emitted assembly code for that architecture, I opted for the following hack (I think the comment explains it quite well):
/***************************************************************************
* if ENABLE_SDRAM_CALLGATE is enabled, this function should NEVER be called
* from C code as it will corrupt the stack pointer, since it returns before
* its epilog. this is done because clang does not provide a way to get the
* size of the function so we insert a label with inline asm to measure the
* function. in addition to that, it should not call any non-forceinlined
* functions to avoid generating a PC relative branch (which would fail if
* the function has been copied)
**************************************************************************/
void sdram_init_late(sdram_param_t* P) {
/* ... */
#ifdef ENABLE_SDRAM_CALLGATE
asm(
"b lr\n"
".globl sdram_init_late_END\n"
"sdram_init_late_END:"
);
#endif
}
It worked as desired but required some assembler glue code in order to call it and is a pretty dirty hack that only worked because I could assume several things about the code generation process.
I've also considered other ways of doing this which would work better if LLVM was emitting machine code (since this approach would break once I added an MC emitter to my LLVM backend). The approach I considered involved taking the function and searching for the terminator instruction (which would either be a b lr instruction or a variation of pop ..., lr) but that could also introduce additional complications (though it seemed better than my original solution).
Can anyone suggest a cleaner way of getting the size of a C function without having to resort to incredibly ugly and unreliable hacks such as the ones outlined above?
I think you're right that there aren't any truly portable ways to do this. Compilers are allowed to re-order functions, so taking the address of the next function in source order isn't safe (but does work in some cases).
If you can parse the object file (maybe with libbfd), you might be able to get function sizes from that.
clang's asm output has this metadata (the .size assembler directive after every function), but I'm not sure whether it ends up in the object file.
int foo(int a) { return a * a * 2; }
## clang-3.8 -O3 for amd64:
## some debug-info lines manually removed
.globl foo
foo:
.Lfunc_begin0:
.cfi_startproc
imul edi, edi
lea eax, [rdi + rdi]
ret
.Lfunc_end0:
.size foo, .Lfunc_end0-foo ####### This line
Compiling this to a .o with clang-3.8 -O3 -Wall -Wextra func-size.c -c, I can then do:
$ readelf --symbols func-size.o
Symbol table '.symtab' contains 4 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000000000 0 FILE LOCAL DEFAULT ABS func-size.c
2: 0000000000000000 0 SECTION LOCAL DEFAULT 2
3: 0000000000000000 7 FUNC GLOBAL DEFAULT 2 foo ### This line
The three instructions total 7 bytes, which matches up with the size output here. It doesn't include the padding to align the entry point, or the next function: the .align directives are outside the two labels that are subtracted to calculate the .size.
This probably doesn't work well for stripped executables. Even their global functions won't still be present in the symbol table of the executable. So you might need a two-step build process:
compile your "normal" code
get sizes of functions you care about into a table, using readelf | some text processing > sizes.c
compile sizes.c
link everything together
Caveat
A really clever compiler could compile multiple similar functions to share a common implementation. So one of the functions jumps into the middle of the other function body. If you're lucky, all the functions are grouped together, with the "size" of each measuring from its entry point all the way to the end of the blocks of code it uses. (But that overlap would make the total sizes add up to more than the size of the file.)
Current compilers don't do this, but you can prevent it by putting the function in a separate compilation unit, and not using whole-program link-time optimization.
A compiler could decide to put a conditionally-executed block of code before the function entry point, so the branch can use a shorter encoding for a small displacement. This makes that block look like a static "helper" function which probably wouldn't be included in the "size" calculation for function. Current compilers never do this, either, though.
Another idea, which I'm not confident is safe:
Put an asm volatile with just a label definition at the end of your function, and then assume the function size is at most that + 32 bytes or something. So when you copy the function, you allocate a buffer 32B larger than your "calculated" size. Hopefully there's only a "ret" insn beyond the label, but actually it probably goes before the function epilogue which pops all the call-preserved registers it used.
I don't think the optimizer can duplicate an asm volatile statement, so it would force the compiler to jump to a common epilogue instead of duplicating the epilogue like it might sometimes for early-out conditions.
But I'm not sure there's an upper bound on how much could end up after the asm volatile.
I am reading Secure Programming Cookbook for C and C++ from John Viega. There is a code snippet where I need some help to understand:
asm(".long 0xCEFAEDFE \n"
"crc32_stored: \n"
".long 0xFFFFFFFF \n"
".long 0xCEFAEDFE \n"
);
int main(){
//crc32_stored used here as a variable
}
What do these lines exactly mean: "crc32_stored:\n", ".long 0xFFFFFFFF \n"? Is this a variable definition and initialization?
Trying to compile the code from the book I got the following error:
error: ‘crc32_stored’ undeclared (first use in this function)
crc32_stored: is simply a label, which in assembler is just an alias for a memory address. Since the label itself does not take up any space in the object code the address represented by crc32_stored is the address of .long 0xFFFFFFFF which assembles to four FF-bytes. In the object code the label will show up as a symbol, which means pretty much the same thing (just an alias for an address).
In C, a variable is (in a way) yet another way to express the same thing: A name that refers to a certain address in memory, but it has additional type information, i.e. int or long. You can create a variable in C with int crc32_stored = 0xFFFFFFFF; which (minus the type information) is equivalent to assembly crc32_stored: .long 0xFFFFFFFF, but that will create a different alias to yet another address.
You can tell the C compiler to not reserve a new address for the name "crc32_stored" but to create only the alias part and then to couple it with the address of a symbol with the same name. That is done with a declaration using the "extern" storage-class specifier, as in extern int crc32_stored. By this you "promise" to later link against another object file that will have this symbol.
Obviously you have to take care yourself that the C type information matches the intention of the assembly code (i.e. there are 4 bytes at the given address that should be interpreted as a signed 32-bit integer).
Addendum:
Without the extra declaration the symbol is not visible from C code, because the assembler parts are processed separately. The symbols can not be exported to C code automatically because the type information is missing. (An assembly label does not even include information about whether it points to data or code.)
So I made a very simple C program to study how C works on the inside. It has just 1 line in the main() excluding return 0:
system("cls");
If I use ollydebugger to analyze this program It will show something like this(text after the semicolons are comments generated by ollydebugger.
MOV DWORD PTR SS:[ESP],test_1.004030EC ; ||ASCII "cls"
CALL <JMP.&msvcrt.system> ; |\system
Can someone explain what this means, and if I want to change the "cls" called in the system() to another command, where is the "cls" stored? And how do I modify it?
You are using 32 bit Windows system, with its corresponding ABI (the assumptions used when functions are called).
MOV DWORD PTR SS:[ESP],test_1.004030EC
Is equivalent to a push 4030ech instruction, that simply store the address of the string cls on the stack.
This is the way parameters are passed to functions and tell us that the string cls is at address 4030ech.
CALL <JMP.&msvcrt.system> ; |\system
This is the call to the system function from the CRT.
The JMP in the name is due how linking works by default with Visual Studio compilers and linkers.
So those two lines are simply passing the address of the string to the system function.
If you want do modify it you need to check if it is in a writable section (I think is not) by checking the PE Sections, your debugger may have a tool for that. Or you could just try anyway the following:
Inspect the memory at 4030ech, you will see the string, try editing it (this is debugger dependent).
Note: I use the TASM notation for hex numbers, i.e. 123h means 0x123 in C notation.