I have written the following code, can you explain me what does the assembly tell here.
typedef struct
{
int abcd[5];
} hh;
void main()
{
printf("%d", ((hh*)0)+1);
}
Assembly:
.file "aa.c"
.section ".rodata"
.align 8
.LLC0:
.asciz "%d\n"
.section ".text"
.align 4
.global main
.type main, #function
.proc 020
main:
save %sp, -112, %sp
sethi %hi(.LLC0), %g1
or %g1, %lo(.LLC0), %o0
mov 20, %o1
call printf, 0
nop
return %i7+8
nop
.size main, .-main
.ident "GCC: (GNU) 4.2.1"
Oh wow, SPARC assembly language, I haven't seen that in years.
I guess we go line by line? I'm going to skip some of the uninteresting boilerplate.
.section ".rodata"
.align 8
.LLC0:
.asciz "%d\n"
This is the string constant you used in printf (so obvious, I know!) The important things to notice are that it's in the .rodata section (sections are divisions of the eventual executable image; this one is for "read-only data" and will in fact be immutable at runtime) and that it's been given the label .LLC0. Labels that begin with a dot are private to the object file. Later, the compiler will refer to that label when it wants to load the address of the string constant.
.section ".text"
.align 4
.global main
.type main, #function
.proc 020
main:
.text is the section for actual machine code. This is the boilerplate header for defining the global function named main, which at the assembly level is no different from any other function (in C -- not necessarily so in C++). I don't remember what .proc 020 does.
save %sp, -112, %sp
Save the previous register window and adjust the stack pointer downward. If you don't know what a register window is, you need to read the architecture manual: http://sparc.org/wp-content/uploads/2014/01/v8.pdf.gz. (V8 is the last 32-bit iteration of SPARC, V9 is the first 64-bit one. This appears to be 32-bit code.)
sethi %hi(.LLC0), %g1
or %g1, %lo(.LLC0), %o0
This two-instruction sequence has the net effect of loading the address .LLC0 (that's your string constant) into register %o0, which is the first outgoing argument register. (The arguments to this function are in the incoming argument registers.)
mov 20, %o1
Load the immediate constant 100 into %o1, the second outgoing argument register. This is the value computed by ((foo *)0)+1. It's 20 because your struct foo is 20 bytes long (five 4-byte ints) and you asked for the second one within the array starting at address zero.
Incidentally, computing an offset from a pointer is only well-defined in C when there is actually a sufficiently large array at the address of the base pointer; ((foo *)0) is a null pointer, so there isn't an array there, so the expression ((foo *)0)+1 technically has undefined behavior. GCC 4.2.1, targeting hosted SPARC, happens to have interpreted it as "pretend there is an arbitrarily large array of foos at address zero and compute the expected offset for array member 1", but other (especially newer) compilers may do something completely different.
call printf, 0
nop
Call printf. I don't remember what the zero is for. The call instruction has a delay slot (again, read the architecture manual) which is filled in with a do-nothing instruction, nop.
return %i7+8
nop
Jump to the address in register %i7 plus eight. This has the effect of returning from the current function.
return also has a delay slot, which is filled in with another nop. There is supposed to be a restore instruction in this delay slot, matching the save at the top of the function, so that main's caller gets its register window back. I don't know why it's not there. Discussion in the comments talks about main possibly not needing to pop the register window, and/or your having declared main as void main() (which is not guaranteed to work with any C implementation, unless its documentation specifically says so, and is always bad style) ... but pushing and not popping the register window is such a troublesome thing to do on a SPARC that I don't find either explanation convincing. I might even call it a compiler bug.
The assembly calls printf, passing your text buffer and the number 20 on the stack (which is what you asked for in a roundabout way).
Related
I'm writing a simple program that converts brainfuck code into x86_64 assembly. Part of that involves creating a large zero-initialized array at the beginning of the program. Thus, each compiled program starts with the following assembly code:
.data
ARR:
.space 32430
.text
.globl _start
.type _start, #function
_start:
... #code as compiled from the brainfuck program
...
From there the compiled program is supposed to be able to access any part of that array, but it should segfault if it tries to access memory before or after it.
Because the array is followed directly by a .text section, which by my understanding is read only, and because it is the first section of the program, I expected that my desired behavior would follow naturally. Unfortunately, this is not the case: compiled programs are able to access non-zero initialized data to the left of (that is, at lower addresses than) the beginning of the array.
Why is this the case and is there anything I can include in the assembly code that would prevent it?
This is, of course, highly system-dependent, but since your observations suit a typical Linux/GNU system, I'll refer to such a system.
what I assume is that the linker isn't putting my segments where I think it is.
True, the linker puts the segments not in the order they appear in your code snippet, but rather .text first, .data second. We can see this e. g. with
> objdump -h ARR
ARR: file format elf32-i386
Sections:
Idx Name Size VMA LMA File off Algn
0 .text 00000042 08048074 08048074 00000074 2**2
CONTENTS, ALLOC, LOAD, READONLY, CODE
1 .data 00007eae 080490b8 080490b8 000000b8 2**2
CONTENTS, ALLOC, LOAD, DATA
compiled programs are able to access non-zero initialized data to the left of (that is, at lower addresses than) the beginning of the array.
Why is this the case …
As we also see in the above example, the .data section is linked at memory address 080490b8. Although memory pages have the length PAGE_SIZE (here getconf PAGE_SIZE yields 4096, i. e. 100016) and start at multiples of that size, the data starts at an address offset equal to the file offset 000000b8 (where the data is stored in the disk file), because the file pages containing the .data section are mapped into memory as copy-on-write pages. The non-zero initialized data below the .data section is just what happens to be in the first file page at bytes 0 to b716, including .text.
… is there anything I can include in the assembly code that would prevent it?
I'd prefer a solution that places my segments such that a bad array access causes a segfault.
As Margaret Bloom and Ped7g hinted at, you could allocate additional data below ARR and create an inaccessible guard page. This can be achieved with minimal effort by aligning ARR to the next page address. The example program below implements this and allows to test it by accepting an index argument (optionally negative) with which the ARR data is accessed; if within bounds, it should exit with status 0, otherwise segfault. Note: This method works only if the .text section does not end at a page boundary, because if it does, the .align 4096 is without effect; but since the assembly code is created with a converter program, that program should be able to check this and add a few extra .text bytes if needed.
.data
.align 4096
ARR:
.space 30000 # we'll actually get 32768
.text
.globl _start
.type _start, #function
_start:
mov (%esp),%ebx # argc
cmp $1,%ebx
jbe 9f
mov $0,%ax
mov $1,%ebx # sign 1
mov 8(%esp),%esi # argv[1]
0: movb (%esi),%cl # convert argument string to integer
jcxz 1f
sub $'0',%cl
js 2f
mov $10,%dx
mul %dx
add %cx,%ax
jmp 3f
2: neg %ebx # change sign
3: add $1,%esi
jmp 0b
1: mul %ebx # multiply with sign 1 or -1
movzx ARR(%eax),%ebx# load ARR[atoi(argv[1])]
9: mov $1,%eax
int $128 # _exit(ebx);
I have this short hello world program:
#include <stdio.h>
static const char* msg = "Hello world";
int main(){
printf("%s\n", msg);
return 0;
}
I compiled it into the following assembly code with gcc:
.file "hello_world.c"
.section .rodata
.LC0:
.string "Hello world"
.data
.align 4
.type msg, #object
.size msg, 4
msg:
.long .LC0
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushl %ebp
.cfi_def_cfa_offset 8
.cfi_offset 5, -8
movl %esp, %ebp
.cfi_def_cfa_register 5
andl $-16, %esp
subl $16, %esp
movl msg, %eax
movl %eax, (%esp)
call puts
movl $0, %eax
leave
.cfi_restore 5
.cfi_def_cfa 4, 4
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4"
.section .note.GNU-stack,"",#progbits
My question is: are all parts of this code essential if I were to write this program in assembly (instead of writing it in C and then compiling to assembly)? I understand the assembly instructions but there are certain pieces I don't understand. For instance, I don't know what .cfi* is, and I'm wondering if I would need to include this to write this program in assembly.
The absolute bare minimum that will work on the platform that this appears to be, is
.globl main
main:
pushl $.LC0
call puts
addl $4, %esp
xorl %eax, %eax
ret
.LC0:
.string "Hello world"
But this breaks a number of ABI requirements. The minimum for an ABI-compliant program is
.globl main
.type main, #function
main:
subl $24, %esp
pushl $.LC0
call puts
xorl %eax, %eax
addl $28, %esp
ret
.size main, .-main
.section .rodata
.LC0:
.string "Hello world"
Everything else in your object file is either the compiler not optimizing the code down as tightly as possible, or optional annotations to be written to the object file.
The .cfi_* directives, in particular, are optional annotations. They are necessary if and only if the function might be on the call stack when a C++ exception is thrown, but they are useful in any program from which you might want to extract a stack trace. If you are going to write nontrivial code by hand in assembly language, it will probably be worth learning how to write them. Unfortunately, they are very poorly documented; I am not currently finding anything that I think is worth linking to.
The line
.section .note.GNU-stack,"",#progbits
is also important to know about if you are writing assembly language by hand; it is another optional annotation, but a valuable one, because what it means is "nothing in this object file requires the stack to be executable." If all the object files in a program have this annotation, the kernel won't make the stack executable, which improves security a little bit.
(To indicate that you do need the stack to be executable, you put "x" instead of "". GCC may do this if you use its "nested function" extension. (Don't do that.))
It is probably worth mentioning that in the "AT&T" assembly syntax used (by default) by GCC and GNU binutils, there are three kinds of lines: A line
with a single token on it, ending in a colon, is a label. (I don't remember the rules for what characters can appear in labels.) A line whose first token begins with a dot, and does not end in a colon, is some kind of directive to the assembler. Anything else is an assembly instruction.
related: How to remove "noise" from GCC/clang assembly output? The .cfi directives are not directly useful to you, and the program would work without them. (It's stack-unwind info needed for exception handling and backtraces, so -fomit-frame-pointer can be enabled by default. And yes, gcc emits this even for C.)
As far as the number of asm source lines needed to produce a value Hello World program, obviously we want to use libc functions to do more work for us.
#Zwol's answer has the shortest implementation of your original C code.
Here's what you could do by hand, if you don't care about the exit status of your program, just that it prints your string.
# Hand-optimized asm, not compiler output
.globl main # necessary for the linker to see this symbol
main:
# main gets two args: argv and argc, so we know we can modify 8 bytes above our return address.
movl $.LC0, 4(%esp) # replace our first arg with the string
jmp puts # tail-call puts.
# you would normally put the string in .rodata, not leave it in .text where the linker will mix it with other functions.
.section .rodata
.LC0:
.asciz "Hello world" # asciz zero-terminates
The equivalent C (you just asked for the shortest Hello World, not one that had identical semantics):
int main(int argc, char **argv) {
return puts("Hello world");
}
Its exit status is implementation-defined but it definitely prints. puts(3) returns "a non-negative number", which could be outside the 0..255 range, so we can't say anything about the program's exit status being 0 / non-zero in Linux (where the process's exit status is the low 8 bits of the integer passed to the exit_group() system call (in this case by the CRT startup code that called main()).
Using JMP to implement the tail-call is a standard practice, and commonly used when a function doesn't need to do anything after another function returns. puts() will eventually return to the function that called main(), just like if puts() had returned to main() and then main() had returned. main()'s caller still has to deal with the args it put on the stack for main(), because they're still there (but modified, and we're allowed to do that).
gcc and clang don't generate code that modifies arg-passing space on the stack. It is perfectly safe and ABI-compliant, though: functions "own" their args on the stack, even if they were const. If you call a function, you can't assume that the args you put on the stack are still there. To make another call with the same or similar args, you need to store them all again.
Also note that this calls puts() with the same stack alignment that we had on entry to main(), so again we're ABI-compliant in preserving the 16B alignment required by modern version of the x86-32 aka i386 System V ABI (used by Linux).
.string zero-terminates strings, same as .asciz, but I had to look it up to check. I'd recommend just using .ascii or .asciz to make sure you're clear on whether your data has a terminating byte or not. (You don't need one if you use it with explicit-length functions like write())
In the x86-64 System V ABI (and Windows), args are passed in registers. This makes tail-call optimization a lot easier, because you can rearrange args or pass more args (as long as you don't run out of registers). This makes compilers willing to do it in practice. (Because as I said, they currently don't like to generate code that modifies the incoming arg space on the stack, even though the ABI is clear that they're allowed to, and compiler generated functions do assume that callees clobber their stack args.)
clang or gcc -O3 will do this optimization for x86-64, as you can see on the Godbolt compiler explorer:
#include <stdio.h>
int main() { return puts("Hello World"); }
# clang -O3 output
main: # #main
movl $.L.str, %edi
jmp puts # TAILCALL
# Godbolt strips out comment-only lines and directives; there's actually a .section .rodata before this
.L.str:
.asciz "Hello World"
Static data addresses always fit in the low 31 bits of address-space, and executable don't need position-independent code, otherwise the mov would be lea .LC0(%rip), %rdi. (You'll get this from gcc if it was configured with --enable-default-pie to make position-independent executables.)
How to load address of function or label into register in GNU Assembler
Hello World using 32-bit x86 Linux int 0x80 system calls directly, no libc
See Hello, world in assembly language with Linux system calls? My answer there was originally written for SO Docs, then moved here as a place to put it when SO Docs closed down. It didn't really belong here so I moved it to another question.
related: A Whirlwind Tutorial on Creating Really Teensy ELF Executables for Linux. The smallest binary file you can run that just makes an exit() system call. That is about minimizing the binary size, not the source size or even just the number of instructions that actually run.
For example: In the following code, how and where is the number '10' used for the comparison stored?
#include<stdio.h>
#include<conio.h>
int main()
{
int x = 5;
if (x > 10)
printf("X is greater than 10");
else if (x < 10)
printf("X is lesser than 10");
else
printf("x = 10");
getch();
return 0;
}
Pardon me for not giving enough details. Instead of initializing 'x' directly with '5', if we scan and get it from the user we know how memory is allocated for 'x'. But how memory is allocated for the literal number '10' which is not stored in any variable?
In your particular code, x is initialized to 5 and is never changed. An optimizing compiler is able to constant fold and propagate that information. So it probably would generate the equivalent of
int main() {
printf("X is lesser than 10");
getch();
return 0;
}
notice that the compiler would also have done dead code elimination.
So both constants 5 and 10 would have disappeared.
BTW, <conio.h> and getch are not in standard C99 or C11. My Linux system don't have them.
In general (and depending upon the target processor's instruction set and the ABI) small constants are often embedded in some single machine code instruction (as an immediate operand), as Kilian answered. Some large constants (e.g. floating point numbers, literal strings, most const global or static arrays and aggregates) might get inserted and compiled as read only data in the code segment (then the constant inside machine register-load instructions would be an address or some offset relative to PC for PIC); see also this. Some architectures (e.g. SPARC, RISC-V, ARM, and other RISC) are able to load a wide constant in a register by two consecutive instructions (loading the constant in two parts), and this impacts the relocation format for the linker (e.g. in object files and executables, often in ELF).
I suggest to ask your compiler to emit assembler code, and have a glance at that assembler code. If using GCC (e.g. on Linux, or with Cygwin or MinGW) try to compile with gcc -Wall -O -fverbose-asm -S ; on my Debian/Linux system if I replace getch by getchar in your code I am getting:
.section .rodata.str1.1,"aMS",#progbits,1
.LC0:
.string "X is lesser than 10"
.text
.globl main
.type main, #function
main:
.LFB11:
.cfi_startproc
subq $8, %rsp #,
.cfi_def_cfa_offset 16
movl $.LC0, %edi #,
movl $0, %eax #,
call printf #
movq stdin(%rip), %rdi # stdin,
call _IO_getc #
movl $0, %eax #,
addq $8, %rsp #,
.cfi_def_cfa_offset 8
ret
.cfi_endproc
.LFE11:
.size main, .-main
.ident "GCC: (Debian 4.9.2-10) 4.9.2"
.section .note.GNU-stack,"",#progbits
If you are using a 64 bits Windows system, your architecture is very likely to be x86-64. There are tons of documentation describing the ISA (see answers to this) and the x86 calling conventions (and also the Linux x86-64 ABI; you'll find the equivalent document for Windows).
BTW, you should not really care how such constants are implemented. The semantics of your code should not change, whatever the compiler choose to do for implementing them. So leave the optimizations (and such low level choices) to the compiler (i.e. your implementation of C).
The constant 10 is probably stored as an immediate constant in the opcode stream. Issuing a CMP AX,10, with the constant included in the opcode, is usually both smaller and faster than a CMP AX, [BX], where the comparison value must be loaded from memory.
If the constant is too large to fit into the opcode, the alternative is to store it in memory like a static variable, but if the instruction set allows embedded constants, a good compiler should use it - after all, that addressing mode was presumably added because it has advantages over the others.
static int func_name (const uint8_t * address)
{
int result;
asm ("movl $1f, %0; movzbl %1, %0; 1:"
: "=&a" (result) : "m" (*address));
return result;
}
I have gone through inline assembly references over internet.
But i am unable to figure out what this code is doing, eg. what is $1f ?
And what does "m" means? Isn't the normal inline convention to use "=r" and "r" ?
The code is functionally identical to return *address but not absolutely equivalent to this wrt. to the generated binary / object file.
In ELF, the usage of the forward reference (i.e. the mov $1f, ... to retrieve the address of the assembly local label) results in the creation of what's called a relocation. A relocation is an instruction to the linker (either at executable creation or later to the dynamic linker at executable/library loading) to insert a value only known at link/load time. In the object code, this looks like:
Disassembly of section .text:
0000000000000000 :
0: b8 00 00 00 00 mov $0x0,%eax
5: 0f b6 07 movzbl (%rdi),%eax
8: c3 retq
Notice the value (at offset 1 into the .text section) is zero here even though that's actually not correct - it depends on where in the running code the function will end up. Only the (dynamic) linker can ultimately know this, and the information that this piece of memory needs to be updated as it is loaded is actually placed into the object file:
$ readelf -a xqf.o
ELF Header:
[ ... ]
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[ 0] NULL 0000000000000000 00000000
0000000000000000 0000000000000000 0 0 0
[ 1] .text PROGBITS 0000000000000000 00000040
0000000000000009 0000000000000000 AX 0 0 16
[ 2] .rela.text RELA 0000000000000000 000004e0
0000000000000018 0000000000000018 10 1 8
[ ... ]
Relocation section '.rela.text' at offset 0x4e0 contains 1 entries:
Offset Info Type Sym. Value Sym. Name + Addend
000000000001 00020000000a R_X86_64_32 0000000000000000 .text + 8
[ ... ]
This ELF section entry says:
look at offset 1 into the .text section
there's a 32bit value that will be zero-extended to 64-bit (R_X86_64_32). This may have been intended for use in 32-bit code, but in a 64-bit non-PIE executable that's still the most efficient way to put an address into a register; smaller than lea 1f(%rip), %0 for a R_X86_64_PC32 RIP-relative relocation. And yes a RIP-relative LEA into a 32-bit register is legal, and saves a byte of machine code if you don't care about truncating the address.
the value you (as the linker) need to put there is that of .text + 8 (which will have to be computed at link / load time)
This entry is created thanks to the mov $1f, %0 instruction. If you leave that out (or just write return *address), it won't be there.
I've forced code generation for the above by removing the static qualifier; without doing so, a simple compile actually creates no code at all (static code gets eliminated if not used, and, a lot of the time, inlined if used).
Due to the fact that the function is static, as said, it'll normally be inlined at the call site by the compiler. The information where it's used therefore usually gets lost, as does the ability of a debugger to instrument it. But the trick shown here can recover this (indirectly), because there will be one relocation entry created per use of the function. In addition to that, methods like this can be used to establish instrumentation points within the binary; insert well-known/strictly-defined but functionally-meaningless small assembly statements at locations recoverable through the object file format, and then let e.g. the debugger / tracing utilities replace them with "more useful" things when needed.
$1f is the address of the 1 label. The f specifies to look for the first label named 1 in the forward direction. "m" is an input operand that is in memory. "=&a" is an output operand that uses the eax register. a specifies the register to use, = makes it an output operand, and & guarantees that other operands will not share the same register.
Here, %0 will be replaced with the first operand (the eax register) and %1 by the second operand (The address pointed to by address).
All these and more are explained in the GCC documentation on Inline assembly and asm contraints.
This piece of code (apart from being non-compilable due to two typos) is hardly useful.
This is what it turns into (use the -S switch):
_func_name:
movl 4(%esp), %edx ; edx = the "address" parameter
movl $1f, %eax ; eax = the address of the "1" label
movzbl (%edx), %eax; eax = byte from address in edx, IOW, "*address"
1:
ret
So the entire body of the function can be replaced with just
return *address;
This is a code snippet from the PintOS project.
The function here is used by the OS kernel to read a byte at address from the user address space. That is done by movzbl %1, %0 where 0% is result and 1% is address. But before that, the kernel has to move the address of $1f(which is the address of the instruction right after movzbl %1, %0) to the eax register. This move seems useless because some context information is missing. The kernel does that for the page fault interrupt handler to use it. Because address could be an invalid one offered by the user, and it might cause a page fault. When that happened, the interrupt handler would take over, set eip equal to eax(which is the memory address of $1f), and also set eax to -1 to indicate that the read failed. After that, the kernel was able to return from the handler to $1f and move on. Without saving the address of $1f, the handler would have no idea where it should return to, and could only go back to movzbl %1, %0 again and again.
The macro expansion of __read_mostly :
#define __read_mostly __attribute__((__section__(".data..read_mostly"))
This one is from cache.h
__init:
#define __init __section(.init.text) __cold notrace
from init.h
__exit:
#define __exit __section(.exit.text) __exitused __cold notrace
After searching through net i have not found any good explanation of
what is happening there.
Additonal question : I have heard about various "linker magic"
employed in kernel development. Any information
regarding this will be wonderful.
I have some ideas about these macros about what they do. Like __init supposed to indicate that the function code can be removed after initialization. __read_mostly is for indicating that the data is seldom written and by this it minimizes cache misses. But i have not idea about How they do it. I mean they are gcc extensions. So in theory they can be demonstrated by small userland c code.
UPDATE 1:
I have tried to test the __section__ with arbitrary section name. the test code :
#include <stdio.h>
#define __read_mostly __attribute__((__section__("MY_DATA")))
struct ro {
char a;
int b;
char * c;
};
struct ro my_ro __read_mostly = {
.a = 'a',
.b = 3,
.c = NULL,
};
int main(int argc, char **argv) {
printf("hello");
printf("my ro %c %d %p \n", my_ro.a, my_ro.b, my_ro.c);
return 0;
}
Now with __read_mostly the generated assembly code :
.file "ro.c"
.globl my_ro
.section MY_DATA,"aw",#progbits
.align 16
.type my_ro, #object
.size my_ro, 16
my_ro:
.byte 97
.zero 3
.long 3
.quad 0
.section .rodata
.LC0:
.string "hello"
.LC1:
.string "my ro %c %d %p \n"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
pushq %rbx
subq $24, %rsp
movl %edi, -20(%rbp)
movq %rsi, -32(%rbp)
movl $.LC0, %eax
movq %rax, %rdi
movl $0, %eax
.cfi_offset 3, -24
call printf
movq my_ro+8(%rip), %rcx
movl my_ro+4(%rip), %edx
movzbl my_ro(%rip), %eax
movsbl %al, %ebx
movl $.LC1, %eax
movl %ebx, %esi
movq %rax, %rdi
movl $0, %eax
call printf
movl $0, %eax
addq $24, %rsp
popq %rbx
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (GNU) 4.4.6 20110731 (Red Hat 4.4.6-3)"
.section .note.GNU-stack,"",#progbits
Now without the __read_mostly macro the assembly code remains more or less the same.
this is the diff
--- rm.S 2012-07-17 16:17:05.795771270 +0600
+++ rw.S 2012-07-17 16:19:08.633895693 +0600
## -1,6 +1,6 ##
.file "ro.c"
.globl my_ro
- .section MY_DATA,"aw",#progbits
+ .data
.align 16
.type my_ro, #object
.size my_ro, 16
So essentially only the a subsection is created, nothing fancy.
Even the objdump disassmbly does not show any difference.
So my final conclusion about them, its the linker's job do something for data section marked with a special name. I think linux kernel uses some kind of custom linker script do achieve these things.
One of the thing about __read_mostly, data which were put there can be grouped and managed in a way so that cache misses can be reduced.
Someone at lkml submitted a patch to remove __read_mostly. Which spawned a fascinated discussion on the merits and demerits of __read_mostly.
here is the link : https://lkml.org/lkml/2007/12/13/477
I will post further update on __init and __exit.
UPDATE 2
These macros __init , __exit and __read_mostly put the contents of data(in case of __read_mostly) and text(in cases of __init and __exit) are put into custom named sections. These sections are utilized by the linker. Now as linker is not used as its default behaviour for various reasons, A linker script is employed to achieve the purposes of these macros.
A background may be found how a custom linker script can be used to eliminate dead code(code which is linked to by linker but never executed). This issue is of very high importance in embedded scenarios. This document discusses how a linker script can be fine tuned to remove dead code : elinux.org/images/2/2d/ELC2010-gc-sections_Denys_Vlasenko.pdf
In case kernel the initial linker script can be found include/asm-generic/vmlinux.lds.h. This is not the final script. This is kind of starting point, the linker script is further modified for different platforms.
A quick look at this file the portions of interest can immediately found:
#define READ_MOSTLY_DATA(align) \
. = ALIGN(align); \
*(.data..read_mostly) \
. = ALIGN(align);
It seems this section is using the ".data..readmostly" section.
Also you can find __init and __exit section related linker commands :
#define INIT_TEXT \
*(.init.text) \
DEV_DISCARD(init.text) \
CPU_DISCARD(init.text) \
MEM_DISCARD(init.text)
#define EXIT_TEXT \
*(.exit.text) \
DEV_DISCARD(exit.text) \
CPU_DISCARD(exit.text) \
MEM_DISCARD(exit.text)
Linking seems pretty complex thing to do :)
GCC attributes are a general mechanism to give instructions to the compiler that are outside the specification of the language itself.
The common facility that the macros you list is the use of the __section__ attribute which is described as:
The section attribute specifies that a function lives in a particular section. For example, the declaration:
extern void foobar (void) __attribute__ ((section ("bar")));
puts the function foobar in the bar section.
So what does it mean to put something in a section? An object file is divided into sections: .text for executable machine code, .data for read-write data, .rodata for read-only data, .bss for data initialised to zero, etc. The names and purposes of these sections is a matter of platform convention, and some special sections can only be accessed from C using the __attribute__ ((section)) syntax.
In your example you can guess that .data..read_mostly is a subsection of .data for data that will be mostly read; .init.text is a text (machine code) section that will be run when the program is initialised, etc.
On Linux, deciding what to do with the various sections is the job of the kernel; when userspace requests to exec a program, it will read the program image section-by-section and process them appropriately: .data sections get mapped as read-write pages, .rodata as read-only, .text as execute-only, etc. Presumably .init.text will be executed before the program starts; that could either be done by the kernel or by userspace code placed at the program's entry point (I'm guessing the latter).
If you want to see the effect of these attributes, a good test is to run gcc with the -S option to output assembler code, which will contain the section directives. You could then run the assembler with and without the section directives and use objdump or even hex dump the resulting object file to see how it differs.
As far as I know, these macros are used exclusively by the kernel. In theory, they could apply to user-space, but I don't believe this is the case. They all group similar variable and code together for different effects.
init/exit
A lot of code is needed to setup the kernel; this happens before any user space is running at all. Ie, before the init task runs. In many cases, this code is never used again. So it would be a waste to consume un-swappable RAM after boot. The familiar kernel message Freeing init memory is a result of the init section. Some drivers maybe configured as modules. In these cases, they exit. However, if they are compiled into the kernel, the don't necessarily exit (they may shutdown). This is another section to group this type of code/data.
cold/hot
Each cache line has a fixed sized. You can maximize a cache by putting the same type of data/function in it. The idea is that often used code can go side by side. If the cache is four instructions, the end of one hot routine should merge with the beginning of the next hot routine. Similarly, it is good to keep seldom used code together, as we hope it never goes in the cache.
read_mostly
The idea here is similar to hot; the difference with data we can update the values. When this is done, the entire cache line becomes dirty and must be re-written to main RAM. This is needed for multi-CPU consistency and when that cache line goes stale. If nothing has changed in the difference between the CPU cache version and main memory, then nothing needs to happen on an eviction. This optimizes the RAM bus so that other important things can happen.
These items are strictly for the kernel. Similar tricks could (are?) be implemented for user space. That would depend on the loader in use; which is often different depending on the libc in use.