Does assembly code when used(linked) in a c project ignore const keyword declared before C datatypes and functions?? And can it modify the contents of the datatypes??
Does assembly code when used(linked) in a c project ignore const
keyword declared before C datatypes and functions??
Yes, the const keyword is utterly ignored by assembly code.
And can it modify the contents of the datatypes??
If the compiler was able to place a const location in a read-only segment, then assembly code trying to modify it will cause a segmentation fault. Otherwise, unpredictable behavior will result, because the compiler may have optimized parts of the code, but not others, under the assumption that const locations were not modified.
And can it modify the contents of the datatypes??
Maybe, maybe not. If the original object was declared const then the compiler might emit it into a read-only data segment, which would be loaded into a read-only memory page at runtime. Writing to that page, even from assembly, would trigger a runtime exception (access violation or segmentation fault).
You won't receive a compile-time error, but at runtime your program might crash or behave erratically.
Assembly uses the datatypes you declared in C to better optimize how it stores the information in memory. Everything is written in binary at the end of the day (int, long, char, etc), so there is no datatypes once you get to the low level code.
The question is not very precise. What does "ignores" mean?
Assembly language does not have the same concept of const as C language does. So, assembly cannot ignore it. It simply has no idea about it.
Yet the assembly code generated by C compiler for a C program might easily be affected by the placement of const keywords in your C program.
In other words, assembly code can be affected by const keywords. But once that assembly code is built, the const keyword is no longer necessary.
To say that assembler can modify something declared as const is not exactly correct either. If you declare a variable as const, in some cases the compiler might be smart enough to eliminate that variable entirely, replacing it with immediate value of that variable. This means that that const variable might disappear from the final code entirely, leaving nothing for the assembly code to "modify".
GCC places global variables marked as const in a separate section, called .rodata.
The .rodata is also used for storing string constants.
Since contents of .rodata section will not be modified,
they can be placed in Flash. The linker script has to modified to accomodate this.
#include <stdio.h>
const int a = 10 ;
int main ( void ) {
return a ;
}
.section .rodata
.align 4
a:
.long 10
gcc with 00 :
movl a(%rip), %eax // variabile constant
addq $32, %rsp
popq %rbp
gcc with O3 :
movl $10, %eax // numeric constant
addq $40, %rsp
ret
Related
It stands to reason that, for executable code to be called a function, it should conform to the function calling convention of the platform it's running on.
However, _start() does not; for example in this reference implementation there is no return address on the stack:
.section .text
.global _start
_start:
# Set up end of the stack frame linked list.
movq $0, %rbp
pushq %rbp # rip=0
pushq %rbp # rbp=0
movq %rsp, %rbp
# We need those in a moment when we call main.
pushq %rsi
pushq %rdi
# Prepare signals, memory allocation, stdio and such.
call initialize_standard_library
# Run the global constructors.
call _init
# Restore argc and argv.
popq %rdi
popq %rsi
# Run main
call main
# Terminate the process with the exit code.
movl %eax, %edi
call exit
.size _start, . - _start
Yet it's called a function in a myriad of sources. A number of questions and answers on StackOverflow also refer to it as a function.
Is a function simply a group of instructions identified by the address to the entry point, or must it conform to the calling convention? The C standard does not seem to define the concept of a function, neither do the gcc and clang docs. What is the authoritative source that defines this concept?
About the lack of a return making a piece of code not a function, even a function written in C, does not have to have a return instruction in it:
int call_fn(int(*fn)()) {
return fn();
}
This function, with proper optimizations compiles down to a single jmp instruction: https://godbolt.org/z/nxT9qTvaf
call_fn(int (*)()): # #call_fn(int (*)())
jmp rdi # TAILCALL
In general, I don't think the C or the C++ standard would define anything about stuff written in assembly. A common calling convention helps for making direct calls into functions written in other languages, but you can still call functions using other calling conventions using a trampoline.
It stands to reason that, for executable code to be called a function, it should conform to the function calling convention of the platform it's running on.
"Function" is the primary idea here; "calling convention" is subsidiary to that. As such, I think a more supportable claim would be that for every function, there is a convention for calling it.
Interoperability considerations lead to standardization of calling conventions, but there is no One True calling convention, not even on a per-platform basis. Even subject to the influence of interoperability, there are platforms that support multiple standard calling conventions. In any case the existence of standard calling conventions does not necessarily relegate code with other conventions for entry and exit to non-function-hood.
Is a function simply a group of instructions identified by the address to the entry point, or must it conform to the calling convention?
This is a question of the definition of "function". There is room for variation on this, and in practice, different definitions apply in different contexts. For example, the question refers to the C language specification, but this speaks to the meaning of "function" in the context of C source code, not assembly or machine code.
In practice, in various languages and contexts, there are
functions with identifiers and functions without;
functions that return a value and functions that don't;
functions with a single entry point and functions with multiple entry points;
functions with a single exit point and functions with multiple exit points;
functions that always return to the caller, functions that usually return, functions that occasionally return, and functions that never return;
a wide variety of patterns for how functions receive data to operate on, how they return data to their caller (if they do so), and what invariants they do and do not ensure
other dimensions of variation, too
Thus, no, I do not accept in any universal sense that a piece of code needs to conform to a particular calling convention to be called a "function", and I also do not accept "a group of instructions identified by the address to the entry point" as a satisfactory universal definition.
Is _start() a function?
A _start() function such as is provided by GCC / Glibc satisfies some relevant definitions of the term. I have no problem with calling it a "function".
There seems to be this idea going around in the newer programming models that all running code is inside functions; but in the beginning this was not so, and if we look at the old languages we can observe this.
Drawing from lisp:
(format t "Hello, World!")
This is hello world in common lisp, and is not a function in any normal sense. For comparison, here is it as a function:
(defun hello ()
(format t "Hello, World!"))
(hello)
And from near the other root of all programming languages; here is Fortran (source):
PROGRAM FUNDEM
C Declarations for main program
REAL A,B,C
REAL AV, AVSQ1, AVSQ2
REAL AVRAGE
C Enter the data
DATA A,B,C/5.0,2.0,3.0/
C Calculate the average of the numbers
AV = AVRAGE(A,B,C)
AVSQ1 = AVRAGE(A,B,C) **2
AVSQ2 = AVRAGE(A**2,B**2,C**2)
PRINT *,'Statistical Analysis'
PRINT *,'The average of the numbers is:',AV
PRINT *,'The average squared of the numbers: ',AVSQl
PRINT *,'The average of the squares is: ', AVSQ2
END
REAL FUNCTION AVRAGE(X,Y,Z)
REAL X,Y,Z,SUM
SUM = X + Y + Z
AVRAGE = SUM /3.0
RETURN
END
Yup that's top level statements and a function definition. Fortran has three things, the PROGRAM, SUBROUTINEs, and FUNCTIONs.
And again, we can do the same kind of example in QuickBasic:
CALL Hello
Sub Hello()
Print "Hello, World"
End Sub
QuickBasic was kind of funny; you never even tried to name the entry point and whatever .OBJ file was first in the build script was where the entry point was.
There's a general recurring theme here. In all of these, the top level isn't very function-like. The compiler would add stuff to the beginning of the entry point for you so that runtime initialization worked correctly.
Now what happened in C? C took a different path. The initialization routines were written in their own file that calls main() and the compiler just compiles main() as it would any other function and has no capacity for emitting code that runs at top level. Thus, the entry point (traditionally called _start but doesn't have to be) is not and cannot be written in C.
Don't get me wrong here, if you were to compile any of these on a Unix platform today and look at the resulting .o files you would see the modern compilers emit a main() function with the top level code in it. This is because of the preeminence of the C runtime and not because of any need for it to be a function. Had the other languages carried around in their runtimes the definitions of the system calls like they used to, this would not need to be.
Thus we have the process entry point is not a function.
We can take this argument one step farther; suppose (and I have seen news articles reference a thing kind of like this) we had a full native Java compiler that emitted .o files and linked against .so files providing the Java runtime; we could then ask Is _start a class method? The answer isn't no. The answer is the question makes no sense because you can't get a valid Java reference to the symbol. The same silly thing happens in C, we just need to pick a different platform. On DOS FAR model, _start is exported as PROC NEAR but void _start() expects a PROC FAR. The emitted link-time fixup is of the wrong size and trying to take the address of _start results in undefined behavior.
You are mixing fields. You can't apply "text specification" to oranges.
the C standard does not seem to define the concept of a function
C is a language. In the C language, the text like the following:
void func();
is a function declaration of a function func.
Is _start() a function?
The text you posted is not in C language. There are no functions declarations and definitions in it.
As you stated, the term function is not defined in the C standard. I would assume that the English language understanding of the term "function" applies here, as to any other word in the C standard.
I see in Merriam-Webster that a "function" is a computer subroutine, where a subroutine is a a sequence of computer instructions for performing a specified task that can be used repeatedly.
Clearly, _start is a function - it is a sequence of instructions to be executed repeatedly, it is executed on a computer, and it also operates on variables in the form of registers.
The text you posted represents the function _start in the form of a text using assembly language. It is not possible to represent the function _start in the C programming language.
(It is also not possible to express oranges, yet they exist in the real world. My point is, you can take any other word in the C standard, like, I don't know, "international", and ask "Are oranges international?". Applying C standard and "language-lawyer" tag to abstract contexts is not going to give you answers. Bottom line is that the C standard is a specification - it tells what happens when, it is not a dictionary.)
Is a function simply a group of instructions identified by the address to the entry point, or must it conform to the calling convention?
See Merriam-Webster function.
What is the authoritative source that defines this concept?
I googled and "There is no official agency that makes rules for English language".
The C standard is created by http://www.open-std.org/jtc1/sc22/wg14/ .
I was reading tutorials regarding inline assembly within C, and they tried a simple variable assignment with
int a=10, b;
asm ("movl %1, %%eax;
movl %%eax, %0;"
:"=r"(b) /* output */
:"r"(a) /* input */
:"%eax" /* clobbered register */
);
which made sense to me (move input into eax then move eax to output). But when I removed the %movl %%eax, 0 line (which is supposed to move the proper value to the output), the variable b was still assigned the proper value from the inline assembly.
My main question is how does the output 'know' to read from this %eax register?
An inline-assembly statement is not a function call.
The "return in EAX" thing is for functions; it's part of the calling convention that lets compilers make code that can interact with other code even when they're compiled separately. A calling convention is defined as part of an ABI doc.
As well as defining how to return (e.g. small non-FP objects in EAX, floating point in XMM0 or ST0), they also define where callers put args, and which registers you can use without saving/restoring (call-clobbered) and which you can (call-preserved). See https://en.wikipedia.org/wiki/Calling_convention in general, and https://www.agner.org/optimize/calling_conventions.pdf for more about x86 calling conventions.
This inflexible rigid set of rules doesn't apply to inline asm because it doesn't have to; the compiler necessarily can see the asm statement as part of the surrounding C code. That would defeat the whole point of inline. Instead, in GNU C inline asm you write operands / constraints that describe the asm to the compiler, effectively creating a custom calling convention for each asm statement. (With parts of that convention left up to the compiler's choice for "=r" outputs. Use "=a" if you want to force it to pick AL/AX/EAX/RAX.)
If you want to write asm that returns in EAX without having to tell the compiler about it, write a stand-alone function. (e.g. in a .s file, or an asm("") statement as the body of an __attribute__((naked)) C function. Either way you have to write the ret yourself and get args via the calling convention, too.)
Falling off the end of a non-void function after running an asm statement that leaves a value in EAX may appear to work with optimization disabled, but it's totally unsafe and will break as soon as you enable optimization and the compiler inlines it.
My main question is how does the output 'know' to read from this %eax register?
It probably just happened to pick EAX for the "=r" output when you compiled with optimization disabled. EAX is always GCC's first choice for evaluating expressions. Look at the compiler-generated asm output (gcc -S -fverbose-asm) to see what asm it generated around your asm, and which register it substituted into your asm template. You probably have mov %eax, %eax ; mov %eax, %eax.
Using mov as the first or last instruction of an asm template almost always means you're doing it wrong and should have used better constraints to tell the compiler where to put or where to find your data.
e.g. asm("" : "=r"(b) : "0"(a)) will make the compiler put the input into the same register as it's expecting the output operand. So that copies a value. (And forces the compiler to materialize it in a register, and forget anything it knows about the current value, defeating constant-propagation and value range optimizations, as well as stopping the compiler from optimizing away that temporary entirely.)
Why does issuing empty asm commands swap variables? describes that happening by change, same as your case with the compiler picking the same reg for input and output "r" operands. And illustrates using asm comments *inside the asm template to print out what the compiler chose for any %0 or %1 operands you don't otherwise reference explicitly**.
See also segmentation fault(core dumped) error while using inline assembly for more about the basics of using input and output constraints.
Also related: What happens to registers when you manipulate them using asm code in C++? for another example and writeup of how compilers handle register in GNU C inline asm statements.
Simply coding C file like bellow:
int main()
{
int a = 999;
return 0;
}
after compile and link using gcc, it will generate a executable file (e.g. .exe, .out)
But when I open (NOT RUNNING) the executable file with some Editor, I can not find the value of variable 'a' the number 999 which in hex is 0x3E7.
My question is:
Is the variable number 999 exist in executable file?
If not, where does the variable number stored? How the executable file get the variable number in running?
P.S: I have a little knowledge about memory section like .data .bss .text .etc and assembly language. Evenly I can not find it using ollydbg.
There is no reason for the compiler to put the value 999 anywhere since it is not used anywhere. The program has the same observable behaviour regardless 999 being somewhere in memory or not.
ISO/IEC 9899:TC2 - 5.1.2.3 Program execution:
In the abstract machine, all expressions are evaluated as specified by the semantics. An actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no needed side effects are produced (including any caused by calling a function or accessing a volatile object).
One of the important functions of optimization is removing unused stuff. The behaviour of the program doesn't depend at all on the 999, and a isn't volatile, so the assignment isn't part of any visible side-effect, and the program is exactly equivalent to int main(){return 0;}
It's a lot easier to look at compiler output in asm form.
On the Godbolt compiler explorer, you can set it up so you can see asm output for gcc -O0 and gcc -O1 at the same time, in different panes. https://godbolt.org/z/wNHEnN.
mov DWORD PTR [rbp-4], 999 is there at -O0, along with stack-frame setup fluff. So the -O0 output will have 999 as a dword immediate operand to a mov instruction. The variable is local, so you won't find a symbol-table entry for it in the .data or .rdata sections (like you would with a global with a static initializer).
See also How to remove "noise" from GCC/clang assembly output? for more about looking at compiler output.
first, if somebody knows a function of the Standard C Library, that prints
a string without looking for a binary zero, but requires the number of characters to draw, please tell me!
Otherwise, I have this problem:
void printStringWithLength(char *str_ptr, int n_chars){
asm("mov 4, %rax");//Function number (write)
asm("mov 1, %rbx");//File descriptor (stdout)
asm("mov $str_ptr, %rcx");
asm("mov $n_chars, %rdx");
asm("int 0x80");
return;
}
GCC tells the following error to the "int" instruction:
"Error: operand size mismatch for 'int'"
Can somebody tell me the issue?
There are a number of issues with your code. Let me go over them step by step.
First of all, the int $0x80 system call interface is for 32 bit code only. You should not use it in 64 bit code as it only accepts 32 bit arguments. In 64 bit code, use the syscall interface. The system calls are similar but some numbers are different.
Second, in AT&T assembly syntax, immediates must be prefixed with a dollar sign. So it's mov $4, %rax, not mov 4, %rax. The latter would attempt to move the content of address 4 to rax which is clearly not what you want.
Third, you can't just refer to the names of automatic variables in inline assembly. You have to tell the compiler what variables you want to use using extended assembly if you need any. For example, in your code, you could do:
asm volatile("mov $4, %%eax; mov $1, %%edi; mov %0, %%esi; mov %2, %%edx; syscall"
:: "r"(str_ptr), "r"(n_chars) : "rdi", "rsi", "rdx", "rax", "memory");
Fourth, gcc is an optimizing compiler. By default it assumes that inline assembly statements are like pure functions, that the outputs are a pure function of the explicit inputs. If the output(s) are unused, the asm statement can be optimized away, or hoisted out of loops if run with the same inputs.
But a system call like write has a side-effect you need the compiler to keep, so it's not pure. You need the asm statement to run the same number of times and in the same order as the C abstract machine would. asm volatile will make this happen. (An asm statement with no outputs is implicitly volatile, but it's good practice to make it explicit when the side effect is the main purpose of the asm statement. Plus, we do want to use an output operand to tell the compiler that RAX is modified, as well as being an input, which we couldn't do with a clobber.)
You do always need to accurately describe your asm's inputs, outputs, and clobbers to the compiler using Extended inline assembly syntax. Otherwise you'll step on the compiler's toes (it assumes registers are unchanged unless they're outputs or clobbers). (Related: How can I indicate that the memory *pointed* to by an inline ASM argument may be used? shows that a pointer input operand alone does not imply that the pointed-to memory is also an input. Use a dummy "m" input or a "memory" clobber to force all reachable memory to be in sync.)
You should simplify your code by not writing your own mov instructions to put data into registers but rather letting the compiler do this. For example, your assembly becomes:
ssize_t retval;
asm volatile ("syscall" // note only 1 instruction in the template
: "=a"(retval) // RAX gets the return value
: "a"(SYS_write), "D"(STDOUT_FILENO), "S"(str_ptr), "d"(n_chars)
: "memory", "rcx", "r11" // syscall destroys RCX and R11
);
where SYS_WRITE is defined in <sys/syscall.h> and STDOUT_FILENO in <stdio.h>. I am not going to explain all the details of extended inline assembly to you. Using inline assembly in general is usually a bad idea. Read the documentation if you are interested. (https://stackoverflow.com/tags/inline-assembly/info)
Fifth, you should avoid using inline assembly when you can. For example, to do system calls, use the syscall function from unistd.h:
syscall(SYS_write, STDOUT_FILENO, str_ptr, (size_t)n_chars);
This does the right thing. But it doesn't inline into your code, so use wrapper macros from MUSL for example if you want to really inline a syscall instead of calling a libc function.
Sixth, always check if the system call you want to call is already available in the C standard library. In this case, it is, so you should just write
write(STDOUT_FILENO, str_ptr, n_chars);
and avoid all of this altogether.
Seventh, if you prefer to use stdio, use fwrite instead:
fwrite(str_ptr, 1, n_chars, stdout);
There are so many things wrong with your code (and so little reason to use inline asm for it) that it's not worth trying to actually correct all of them. Instead, use the write(2) system call the normal way, via the POSIX function / libc wrapper as documented in the man page, or use ISO C <stdio.h> fwrite(3).
#include <unistd.h>
static inline
void printStringWithLength(const char *str_ptr, int n_chars){
write(1, str_ptr, n_chars);
// TODO: check error return value
}
Why your code doesn't assemble:
In AT&T syntax, immediates always need a $ decorator. Your code will assemble if you use asm("int $0x80").
The assembler is complaining about 0x80, which is a memory reference to the absolute address 0x80. There is no form of int that takes the interrupt vector as anything other than an immediate. I'm not sure exactly why it complains about the size, since memory references don't have an implied size in AT&T syntax.
That will get it to assemble, at which point you'll get linker errors:
In function `printStringWithLength':
5 : <source>:5: undefined reference to `str_ptr'
6 : <source>:6: undefined reference to `n_chars'
collect2: error: ld returned 1 exit status
(from the Godbolt compiler explorer)
mov $str_ptr, %rcx
means to mov-immediate the address of the symbol str_ptr into %rcx. In AT&T syntax, you don't have to declare external symbols before using them, so unknown names are assumed to be global / static labels. If you had a global variable called str_ptr, that instruction would reference its address (which is a link-time constant, so can be used as an immediate).
As other have said, this is completely the wrong way to go about things with GNU C inline asm. See the inline-assembly tag wiki for more links to guides.
Also, you're using the wrong ABI. int $0x80 is the x86 32-bit system call ABI, so it doesn't work with 64-bit pointers. What are the calling conventions for UNIX & Linux system calls on x86-64
See also the x86 tag wiki.
In C, let's say you have a variable called variable_name. Let's say it's located at 0xaaaaaaaa, and at that memory address, you have the integer 123. So in other words, variable_name contains 123.
I'm looking for clarification around the phrasing "variable_name is located at 0xaaaaaaaa". How does the compiler recognize that the string "variable_name" is associated with that particular memory address? Is the string "variable_name" stored somewhere in memory? Does the compiler just substitute variable_name for 0xaaaaaaaa whenever it sees it, and if so, wouldn't it have to use memory in order to make that substitution?
Variable names don't exist anymore after the compiler runs (barring special cases like exported globals in shared libraries or debug symbols). The entire act of compilation is intended to take those symbolic names and algorithms represented by your source code and turn them into native machine instructions. So yes, if you have a global variable_name, and compiler and linker decide to put it at 0xaaaaaaaa, then wherever it is used in the code, it will just be accessed via that address.
So to answer your literal questions:
How does the compiler recognize that the string "variable_name" is associated with that particular memory address?
The toolchain (compiler & linker) work together to assign a memory location for the variable. It's the compiler's job to keep track of all the references, and linker puts in the right addresses later.
Is the string "variable_name" stored somewhere in memory?
Only while the compiler is running.
Does the compiler just substitute variable_name for 0xaaaaaaaa whenever it sees it, and if so, wouldn't it have to use memory in order to make that substitution?
Yes, that's pretty much what happens, except it's a two-stage job with the linker. And yes, it uses memory, but it's the compiler's memory, not anything at runtime for your program.
An example might help you understand. Let's try out this program:
int x = 12;
int main(void)
{
return x;
}
Pretty straightforward, right? OK. Let's take this program, and compile it and look at the disassembly:
$ cc -Wall -Werror -Wextra -O3 example.c -o example
$ otool -tV example
example:
(__TEXT,__text) section
_main:
0000000100000f60 pushq %rbp
0000000100000f61 movq %rsp,%rbp
0000000100000f64 movl 0x00000096(%rip),%eax
0000000100000f6a popq %rbp
0000000100000f6b ret
See that movl line? It's grabbing the global variable (in an instruction-pointer relative way, in this case). No more mention of x.
Now let's make it a bit more complicated and add a local variable:
int x = 12;
int main(void)
{
volatile int y = 4;
return x + y;
}
The disassembly for this program is:
(__TEXT,__text) section
_main:
0000000100000f60 pushq %rbp
0000000100000f61 movq %rsp,%rbp
0000000100000f64 movl $0x00000004,0xfc(%rbp)
0000000100000f6b movl 0x0000008f(%rip),%eax
0000000100000f71 addl 0xfc(%rbp),%eax
0000000100000f74 popq %rbp
0000000100000f75 ret
Now there are two movl instructions and an addl instruction. You can see that the first movl is initializing y, which it's decided will be on the stack (base pointer - 4). Then the next movl gets the global x into a register eax, and the addl adds y to that value. But as you can see, the literal x and y strings don't exist anymore. They were conveniences for you, the programmer, but the computer certainly doesn't care about them at execution time.
A C compiler first creates a symbol table, which stores the relationship between the variable name and where it's located in memory. When compiling, it uses this table to replace all instances of the variable with a specific memory location, as others have stated. You can find a lot more on it on the Wikipedia page.
All variables are substituted by the compiler. First they are substituted with references and later the linker places addresses instead of references.
In other words. The variable names are not available anymore as soon as the compiler has run through
This is what's called an implementation detail. While what you describe is the case in all compilers I've ever used, it's not required to be the case. A C compiler could put every variable in a hashtable and look them up at runtime (or something like that) and in fact early JavaScript interpreters did exactly that (now, they do Just-In-TIme compilation that results in something much more raw.)
Specifically for common compilers like VC++, GCC, and LLVM: the compiler will generally assign a variable to a location in memory. Variables of global or static scope get a fixed address that doesn't change while the program is running, while variables within a function get a stack address-that is, an address relative to the current stack pointer, which changes every time a function is called. (This is an oversimplification.) Stack addresses become invalid as soon as the function returns, but have the benefit of having effectively zero overhead to use.
Once a variable has an address assigned to it, there is no further need for the name of the variable, so it is discarded. Depending on the kind of name, the name may be discarded at preprocess time (for macro names), compile time (for static and local variables/functions), and link time (for global variables/functions.) If a symbol is exported (made visible to other programs so they can access it), the name will usually remain somewhere in a "symbol table" which does take up a trivial amount of memory and disk space.
Does the compiler just substitute variable_name for 0xaaaaaaaa whenever it sees it
Yes.
and if so, wouldn't it have to use memory in order to make that substitution?
Yes. But it's the compiler, after it compiled your code, why do you care about memory?