On Windows data can be loaded from DLLs, but it requires indirection through a pointer in the import address table. As a result, the compiler must know if an object that is being accessed is being imported from a DLL by using the __declspec(dllimport) type specifier.
This is unfortunate because it means a that a header for a Windows library designed to be used as either a static library or a dynamic library needs to know which version of the library the program is linking to. This requirement is not applicable to functions, which are transparently emulated for DLLs with a stub function calling the real function, whose address is stored in the import address table.
On Linux the dynamic linker (ld.so) copies the values of all linked data objects from a shared object into a private mapped region for each process. This doesn't require indirection because the address of the private mapped region is local to the module, so its address is decided when the program is linked (and in the case of position independent executables, relative addressing is used).
Why doesn't Windows do the same? Is there a situation where a DLL might be loaded more than once, and thus require multiple copies of linked data? Even if that was the case, it wouldn't be applicable to read only data.
It seems that the MSVCRT handles this issue by defining the _DLL macro when targeting the dynamic C runtime library (with the /MD or /MDd flag), then using that in all standard headers to conditionally declare all exported symbols with __declspec(dllimport). I suppose you could reuse this macro if you only supported statically linking when using the static C runtime and dynamically linking when using the dynamic C runtime.
References:
LNK4217 - Russ Keldorph's WebLog (emphasis mine)
__declspec(dllimport) can be used on both code and data, and its semantics are subtly different between the two. When applied to a routine call, it is purely a performance optimization. For data, it is required for correctness.
[...]
Importing data
If you export a data item from a DLL, you must declare it with __declspec(dllimport) in the code that accesses it. In this case, instead of generating a direct load from memory, the compiler generates a load through a pointer, resulting in one additional indirection. Unlike calls, where the linker will fix up the code correctly whether the routine was declared __declspec(dllimport) or not, accessing imported data requires __declspec(dllimport). If omitted, the code will wind up accessing the IAT entry instead of the data in the DLL, probably resulting in unexpected behavior.
Importing into an Application Using __declspec(dllimport)
Using __declspec(dllimport) is optional on function declarations, but the compiler produces more efficient code if you use this keyword. However, you must use `__declspec(dllimport) for the importing executable to access the DLL's public data symbols and objects.
Importing Data Using __declspec(dllimport)
When you mark the data as __declspec(dllimport), the compiler automatically generates the indirection code for you.
Importing Using DEF Files (interesting historical notes about accessing the IAT directly)
How do I share data in my DLL with an application or with other DLLs?
By default, each process using a DLL has its own instance of all the DLLs global and static variables.
Linker Tools Warning LNK4217
What happens when you get dllimport wrong? (seems to be unaware of data semantics)
How do I export data from a DLL?
CRT Library Features (documents the _DLL macro)
Linux and Windows use different strategies for accessing data stored in dynamic libraries.
On Linux, an undefined reference to an object is resolved to a library at link time. The linker finds the size of the object and reserves space for it in the .bss or the .rdata segment of the executable. When executed, the dynamic linker (ld.so) resolves the symbol to a dynamic library (again), and copies the object from the dynamic library to the process's memory.
On Windows, an undefined reference to an object is resolved to an import library at link time, and no space is reserved for it. When the module is executed, the dynamic linker resolves the symbol to a dynamic library, and creates a copy on write memory map in the process, backed by a shared data segment in the dynamic library.
The advantage of a copy on write memory map is that if the linked data is unchanged, then it can be shared with other processes. In practice this is a trifling benefit which greatly increases complexity, both for the toolchain and programs using dynamic libraries. For objects which are actually written this is always less efficient.
I suspect, although I have no evidence, that this decision was made for a particular and now outdated use case. Perhaps it was common practice to use large (for the time) read only objects in dynamic libraries on 16-bit Windows (in official Microsoft programs or otherwise). Either way, I doubt anyone at Microsoft has the expertise and time to change it now.
In order to investigate the issue I created a program which writes to an object from a dynamic library. It writes one byte per page (4096 bytes) in the object, then writes the entire object, then retries the initial one byte per page write. If the object is reserved for the process before main is called, the first and third loops should take approximately the same time, and the second loop should take longer than both. If the object is a copy on write map to a dynamic library, the first loop should take at least as long as the second, and the third should take less time than both.
The results are consistent with my hypothesis, and analyzing the disassembly confirms that Linux accesses the dynamic library data at a link time address, relative to the program counter. Surprisingly, Windows not only indirectly accesses the data, the pointer to the data and its length are reloaded from the import address table every loop iteration, with optimizations enabled. This was tested with Visual Studio 2010 on Windows XP, so maybe things have changed, although I wouldn't think that it has.
Here are the results for Linux:
$ dd bs=1M count=16 if=/dev/urandom of=libdat.dat
$ xxd -i libdat.dat libdat.c
$ gcc -O3 -g -shared -fPIC libdat.c -o libdat.so
$ gcc -O3 -g -no-pie -L. -ldat dat.c -o dat
$ LD_LIBRARY_PATH=. ./dat
local = 0x1601060
libdat_dat = 0x601040
libdat_dat_len = 0x601020
dirty= 461us write= 12184us retry= 456us
$ nm dat
[...]
0000000000601040 B libdat_dat
0000000000601020 B libdat_dat_len
0000000001601060 B local
[...]
$ objdump -d -j.text dat
[...]
400693: 8b 35 87 09 20 00 mov 0x200987(%rip),%esi # 601020 <libdat_dat_len>
[...]
4006a3: 31 c0 xor %eax,%eax # zero loop counter
4006a5: 48 8d 15 94 09 20 00 lea 0x200994(%rip),%rdx # 601040 <libdat_dat>
4006ac: 0f 1f 40 00 nopl 0x0(%rax) # align loop for efficiency
4006b0: 89 c1 mov %eax,%ecx # store data offset in ecx
4006b2: 05 00 10 00 00 add $0x1000,%eax # add PAGESIZE to data offset
4006b7: c6 04 0a 00 movb $0x0,(%rdx,%rcx,1) # write a zero byte to data
4006bb: 39 f0 cmp %esi,%eax # test loop condition
4006bd: 72 f1 jb 4006b0 <main+0x30> # continue loop if data is left
[...]
Here are the results for Windows:
$ cl /Ox /Zi /LD libdat.c /link /EXPORT:libdat_dat /EXPORT:libdat_dat_len
[...]
$ cl /Ox /Zi dat.c libdat.lib
[...]
$ dat.exe # note low resolution timer means retry is too small to measure
local = 0041EEA0
libdat_dat = 1000E000
libdat_dat_len = 1100E000
dirty= 20312us write= 3125us retry= 0us
$ dumpbin /symbols dat.exe
[...]
9000 .data
1000 .idata
5000 .rdata
1000 .reloc
17000 .text
[...]
$ dumpbin /disasm dat.exe
[...]
004010BA: 33 C0 xor eax,eax # zero loop counter
[...]
004010C0: 8B 15 8C 63 42 00 mov edx,dword ptr [__imp__libdat_dat] # store data pointer in edx
004010C6: C6 04 02 00 mov byte ptr [edx+eax],0 # write a zero byte to data
004010CA: 8B 0D 88 63 42 00 mov ecx,dword ptr [__imp__libdat_dat_len] # store data length in ecx
004010D0: 05 00 10 00 00 add eax,1000h # add PAGESIZE to data offset
004010D5: 3B 01 cmp eax,dword ptr [ecx] # test loop condition
004010D7: 72 E7 jb 004010C0 # continue loop if data is left
[...]
Here is the source code used for both tests:
#include <stdio.h>
#ifdef _WIN32
#include <windows.h>
typedef FILETIME time_l;
time_l time_get(void) {
FILETIME ret; GetSystemTimeAsFileTime(&ret); return ret;
}
long long int time_diff(time_l const *c1, time_l const *c2) {
return 1LL*c2->dwLowDateTime/100-c1->dwLowDateTime/100+c2->dwHighDateTime*100000-c1->dwHighDateTime*100000;
}
#else
#include <unistd.h>
#include <time.h>
#include <stdlib.h>
typedef struct timespec time_l;
time_l time_get(void) {
time_l ret; clock_gettime(CLOCK_MONOTONIC, &ret); return ret;
}
long long int time_diff(time_l const *c1, time_l const *c2) {
return 1LL*c2->tv_nsec/1000-c1->tv_nsec/1000+c2->tv_sec*1000000-c1->tv_sec*1000000;
}
#endif
#ifndef PAGESIZE
#define PAGESIZE 4096
#endif
#ifdef _WIN32
#define DLLIMPORT __declspec(dllimport)
#else
#define DLLIMPORT
#endif
extern DLLIMPORT unsigned char volatile libdat_dat[];
extern DLLIMPORT unsigned int libdat_dat_len;
unsigned int local[4096];
int main(void) {
unsigned int i;
time_l t1, t2, t3, t4;
long long int d1, d2, d3;
t1 = time_get();
for(i=0; i < libdat_dat_len; i+=PAGESIZE) {
libdat_dat[i] = 0;
}
t2 = time_get();
for(i=0; i < libdat_dat_len; i++) {
libdat_dat[i] = 0xFF;
}
t3 = time_get();
for(i=0; i < libdat_dat_len; i+=PAGESIZE) {
libdat_dat[i] = 0;
}
t4 = time_get();
d1 = time_diff(&t1, &t2);
d2 = time_diff(&t2, &t3);
d3 = time_diff(&t3, &t4);
printf("%-15s= %18p\n%-15s= %18p\n%-15s= %18p\n", "local", local, "libdat_dat", libdat_dat, "libdat_dat_len", &libdat_dat_len);
printf("dirty=%9lldus write=%9lldus retry=%9lldus\n", d1, d2, d3);
return 0;
}
I sincerely hope someone else benefits from my research. Thanks for reading!
Related
If in C I write:
int num;
Before I assign anything to num, is the value of num indeterminate?
Static variables (file scope and function static) are initialized to zero:
int x; // zero
int y = 0; // also zero
void foo() {
static int x; // also zero
}
Non-static variables (local variables) are indeterminate. Reading them prior to assigning a value results in undefined behavior.
void foo() {
int x;
printf("%d", x); // the compiler is free to crash here
}
In practice, they tend to just have some nonsensical value in there initially - some compilers may even put in specific, fixed values to make it obvious when looking in a debugger - but strictly speaking, the compiler is free to do anything from crashing to summoning demons through your nasal passages.
As for why it's undefined behavior instead of simply "undefined/arbitrary value", there are a number of CPU architectures that have additional flag bits in their representation for various types. A modern example would be the Itanium, which has a "Not a Thing" bit in its registers; of course, the C standard drafters were considering some older architectures.
Attempting to work with a value with these flag bits set can result in a CPU exception in an operation that really shouldn't fail (eg, integer addition, or assigning to another variable). And if you go and leave a variable uninitialized, the compiler might pick up some random garbage with these flag bits set - meaning touching that uninitialized variable may be deadly.
0 if static or global, indeterminate if storage class is auto
C has always been very specific about the initial values of objects. If global or static, they will be zeroed. If auto, the value is indeterminate.
This was the case in pre-C89 compilers and was so specified by K&R and in DMR's original C report.
This was the case in C89, see section 6.5.7 Initialization.
If an object that has automatic
storage duration is not initialized
explicitely, its value is
indeterminate. If an object that has
static storage duration is not
initialized explicitely, it is
initialized implicitely as if every
member that has arithmetic type were
assigned 0 and every member that has
pointer type were assigned a null
pointer constant.
This was the case in C99, see section 6.7.8 Initialization.
If an object that has automatic
storage duration is not initialized
explicitly, its value is
indeterminate. If an object that has
static storage duration is not
initialized explicitly, then: — if it
has pointer type, it is initialized to
a null pointer; — if it has arithmetic
type, it is initialized to (positive
or unsigned) zero; — if it is an
aggregate, every member is initialized
(recursively) according to these
rules; — if it is a union, the first
named member is initialized
(recursively) according to these
rules.
As to what exactly indeterminate means, I'm not sure for C89, C99 says:
3.17.2 indeterminate valueeither an unspecified value or a trap
representation
But regardless of what standards say, in real life, each stack page actually does start off as zero, but when your program looks at any auto storage class values, it sees whatever was left behind by your own program when it last used those stack addresses. If you allocate a lot of auto arrays you will see them eventually start neatly with zeroes.
You might wonder, why is it this way? A different SO answer deals with that question, see: https://stackoverflow.com/a/2091505/140740
It depends on the storage duration of the variable. A variable with static storage duration is always implicitly initialized with zero.
As for automatic (local) variables, an uninitialized variable has indeterminate value. Indeterminate value, among other things, mean that whatever "value" you might "see" in that variable is not only unpredictable, it is not even guaranteed to be stable. For example, in practice (i.e. ignoring the UB for a second) this code
int num;
int a = num;
int b = num;
does not guarantee that variables a and b will receive identical values. Interestingly, this is not some pedantic theoretical concept, this readily happens in practice as consequence of optimization.
So in general, the popular answer that "it is initialized with whatever garbage was in memory" is not even remotely correct. Uninitialized variable's behavior is different from that of a variable initialized with garbage.
Ubuntu 15.10, Kernel 4.2.0, x86-64, GCC 5.2.1 example
Enough standards, let's look at an implementation :-)
Local variable
Standards: undefined behavior.
Implementation: the program allocates stack space, and never moves anything to that address, so whatever was there previously is used.
#include <stdio.h>
int main() {
int i;
printf("%d\n", i);
}
compile with:
gcc -O0 -std=c99 a.c
outputs:
0
and decompiles with:
objdump -dr a.out
to:
0000000000400536 <main>:
400536: 55 push %rbp
400537: 48 89 e5 mov %rsp,%rbp
40053a: 48 83 ec 10 sub $0x10,%rsp
40053e: 8b 45 fc mov -0x4(%rbp),%eax
400541: 89 c6 mov %eax,%esi
400543: bf e4 05 40 00 mov $0x4005e4,%edi
400548: b8 00 00 00 00 mov $0x0,%eax
40054d: e8 be fe ff ff callq 400410 <printf#plt>
400552: b8 00 00 00 00 mov $0x0,%eax
400557: c9 leaveq
400558: c3 retq
From our knowledge of x86-64 calling conventions:
%rdi is the first printf argument, thus the string "%d\n" at address 0x4005e4
%rsi is the second printf argument, thus i.
It comes from -0x4(%rbp), which is the first 4-byte local variable.
At this point, rbp is in the first page of the stack has been allocated by the kernel, so to understand that value we would to look into the kernel code and find out what it sets that to.
TODO does the kernel set that memory to something before reusing it for other processes when a process dies? If not, the new process would be able to read the memory of other finished programs, leaking data. See: Are uninitialized values ever a security risk?
We can then also play with our own stack modifications and write fun things like:
#include <assert.h>
int f() {
int i = 13;
return i;
}
int g() {
int i;
return i;
}
int main() {
f();
assert(g() == 13);
}
Note that GCC 11 seems to produce a different assembly output, and the above code stops "working", it is undefined behavior after all: Why does -O3 in gcc seem to initialize my local variable to 0, while -O0 does not?
Local variable in -O3
Implementation analysis at: What does <value optimized out> mean in gdb?
Global variables
Standards: 0
Implementation: .bss section.
#include <stdio.h>
int i;
int main() {
printf("%d\n", i);
}
gcc -O0 -std=c99 a.c
compiles to:
0000000000400536 <main>:
400536: 55 push %rbp
400537: 48 89 e5 mov %rsp,%rbp
40053a: 8b 05 04 0b 20 00 mov 0x200b04(%rip),%eax # 601044 <i>
400540: 89 c6 mov %eax,%esi
400542: bf e4 05 40 00 mov $0x4005e4,%edi
400547: b8 00 00 00 00 mov $0x0,%eax
40054c: e8 bf fe ff ff callq 400410 <printf#plt>
400551: b8 00 00 00 00 mov $0x0,%eax
400556: 5d pop %rbp
400557: c3 retq
400558: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
40055f: 00
# 601044 <i> says that i is at address 0x601044 and:
readelf -SW a.out
contains:
[25] .bss NOBITS 0000000000601040 001040 000008 00 WA 0 0 4
which says 0x601044 is right in the middle of the .bss section, which starts at 0x601040 and is 8 bytes long.
The ELF standard then guarantees that the section named .bss is completely filled with of zeros:
.bss This section holds uninitialized data that contribute to the
program’s memory image. By definition, the system initializes the
data with zeros when the program begins to run. The section occu-
pies no file space, as indicated by the section type, SHT_NOBITS.
Furthermore, the type SHT_NOBITS is efficient and occupies no space on the executable file:
sh_size This member gives the section’s size in bytes. Unless the sec-
tion type is SHT_NOBITS , the section occupies sh_size
bytes in the file. A section of type SHT_NOBITS may have a non-zero
size, but it occupies no space in the file.
Then it is up to the Linux kernel to zero out that memory region when loading the program into memory when it gets started.
That depends. If that definition is global (outside any function) then num will be initialized to zero. If it's local (inside a function) then its value is indeterminate. In theory, even attempting to read the value has undefined behavior -- C allows for the possibility of bits that don't contribute to the value, but have to be set in specific ways for you to even get defined results from reading the variable.
The basic answer is, yes it is undefined.
If you are seeing odd behavior because of this, it may depended on where it is declared. If within a function on the stack then the contents will more than likely be different every time the function gets called. If it is a static or module scope it is undefined but will not change.
Because computers have finite storage capacity, automatic variables will typically be held in storage elements (whether registers or RAM) that have previously been used for some other arbitrary purpose. If a such a variable is used before a value has been assigned to it, that storage may hold whatever it held previously, and so the contents of the variable will be unpredictable.
As an additional wrinkle, many compilers may keep variables in registers which are larger than the associated types. Although a compiler would be required to ensure that any value which is written to a variable and read back will be truncated and/or sign-extended to its proper size, many compilers will perform such truncation when variables are written and expect that it will have been performed before the variable is read. On such compilers, something like:
uint16_t hey(uint32_t x, uint32_t mode)
{ uint16_t q;
if (mode==1) q=2;
if (mode==3) q=4;
return q; }
uint32_t wow(uint32_t mode) {
return hey(1234567, mode);
}
might very well result in wow() storing the values 1234567 into registers
0 and 1, respectively, and calling foo(). Since x isn't needed within
"foo", and since functions are supposed to put their return value into
register 0, the compiler may allocate register 0 to q. If mode is 1 or
3, register 0 will be loaded with 2 or 4, respectively, but if it is some
other value, the function may return whatever was in register 0 (i.e. the
value 1234567) even though that value is not within the range of uint16_t.
To avoid requiring compilers to do extra work to ensure that uninitialized
variables never seem to hold values outside their domain, and avoid needing
to specify indeterminate behaviors in excessive detail, the Standard says
that use of uninitialized automatic variables is Undefined Behavior. In
some cases, the consequences of this may be even more surprising than a
value being outside the range of its type. For example, given:
void moo(int mode)
{
if (mode < 5)
launch_nukes();
hey(0, mode);
}
a compiler could infer that because invoking moo() with a mode which is
greater than 3 will inevitably lead to the program invoking Undefined
Behavior, the compiler may omit any code which would only be relevant
if mode is 4 or greater, such as the code which would normally prevent
the launch of nukes in such cases. Note that neither the Standard, nor
modern compiler philosophy, would care about the fact that the return value
from "hey" is ignored--the act of trying to return it gives a compiler
unlimited license to generate arbitrary code.
If storage class is static or global then during loading, the BSS initialises the variable or memory location(ML) to 0 unless the variable is initially assigned some value. In case of local uninitialized variables the trap representation is assigned to memory location. So if any of your registers containing important info is overwritten by compiler the program may crash.
but some compilers may have mechanism to avoid such a problem.
I was working with nec v850 series when i realised There is trap representation which has bit patterns that represent undefined values for data types except for char. When i took a uninitialized char i got a zero default value due to trap representation. This might be useful for any1 using necv850es
As far as i had gone it is mostly depend on compiler but in general most cases the value is pre assumed as 0 by the compliers.
I got garbage value in case of VC++ while TC gave value as 0.
I Print it like below
int i;
printf('%d',i);
I did an experiment to see what kind of assembly language would be generate if I try to get the same function to compile in there twice. I did the following:
I created two simple test files and their corresponding headers. Let's call them a.c/a.h, and b.c/b.h. Here are the contents of those files:
a.h:
#ifndef __A_H__
#define __A_H__
int a( void );
#endif
b.h:
#ifndef __B_H__
#define __B_H__
int b( void );
#endif
a.c:
#include "a.h"
int a( void )
{
return 1;
}
b.c:
#include "b.h"
#include "a.h"
int b( void )
{
return 1 + a();
}
I then created a static archive for a:
gcc -c a.c -o a.o
ar -rsc a.a a.o
and the same for b, including the static archive for a this time:
gcc -c b.c -o b.o
ar -rsc b.a a.a b.o
At this point, I disassemble the static archive for b to verify that it has assembly code for both functions a() and b(). It does.
Now, I define one last file:
main.c:
#include <stdio.h>
#include "a.h"
#include "b.h"
int main( void )
{
printf( "%d %d\n", a(), b() );
return 0;
}
and I compile it thusly:
gcc main.c a.a b.a -o main
This works fine. When I disassemble it, I see the following definitions for a and b in the code:
140 0000000000400561 <a>:
141 400561: 55 push %rbp
142 400562: 48 89 e5 mov %rsp,%rbp
143 400565: b8 01 00 00 00 mov $0x1,%eax
144 40056a: 5d pop %rbp
145 40056b: c3 retq
146
147 000000000040056c <b>:
148 40056c: 55 push %rbp
149 40056d: 48 89 e5 mov %rsp,%rbp
150 400570: e8 ec ff ff ff callq 400561 <a>
151 400575: 83 c0 01 add $0x1,%eax
152 400578: 5d pop %rbp
153 400579: c3 retq
154 40057a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
As you can see, the code has clearly defined b as calling a rather than inlining it, however, there is only one definition of a in the code, no duplicates.
It seems that gcc has either:
Detected the duplicate object code and removed the duplicates
--or--
the b archive was used first, and it included the reference to int a(), so the a archive was ignored.
My question is: is this behavior circumstantial to my test or is it standard, and can I expect the same behavior from other compilers? Obviously duplicate code is one problem, however there could be duplicate global references as well. Is it safe/good practice to build a large application that has multiple dependency paths to the same static archive? Are there less obvious situations than just duplicate symbol names where issues can arise when doing this?
Asking this because I've been playing with this idea for a project I'm on, and want to make the right choices.
My question is: is this behavior circumstantial to my test or is it standard, and can I expect the same behavior from other compilers?
As far as the compiler itself is concerned, there is no issue: you have one definition for each function among your sources.
As far as ar is concerned, you also have no issue: neither of the archives you built contains any duplicate symbols.
Different linkers may exhibit different behaviors, however. It is conceivable that some would reject linking archives that contain duplicate external symbols. Typical UNIX linkers will handle the situation you present, but they may vary in some details, such as whether a duplicate copy of function a() is included in the binary.
Obviously duplicate code is one problem, however there could be duplicate global references as well. Is it safe/good practice to build a large application that has multiple dependency paths to the same static archive?
"Multiple paths to the same static archive" does not seem to be a good characterization of the situation you present. In neither case do you provide the same archive more than once. Rather, in the b case you provide different archives with duplicate members. Linkers generally do not have problems with specifying the same archive multiple times in the same link command. Under some circumstances it may even be necessary to do so; it should not present a problem.
Providing distinct archives with duplicate members probably will not present a problem, except possibly for bloating your code with duplicate function implementations. This is a bit less certain, but I doubt it would present a problem in practice.
Whether that's good practice is a matter of opinion, but I'm inclined to think not. It's also not clear to me what gain you seen in such an approach. On the other hand, I won't be sharpening any stakes or preparing any kindling if you decide to go ahead anyway.
I'm looking for a way to find the names of the variables accessed by a given instruction (that performs a memory access).
Using debugging symbols and, for example, addr2line or objdump it's easy to convert instruction addresses into source code files + line numbers, but unfortunately often a single source code line contains more than one variable so this method does not have sufficiently fine granularity.
I've found that objdump is able to convert instruction addresses to global variables. But I haven't yet found a way to do this for local variables. For example, in the example bellow, I'd like to know that instruction at address 0x4004c4 is accessing the local variable "local_hello" and that the instruction at address 0x4004c9 is accessing the local variable "local_hello2".
Hello.c:
int global_hello = 4;
int main(){
int local_hello = 3;
int local_hello2 = 0;
local_hello2 = global_hello + local_hello;
return local_hello2;
}
Using "objdump -S hello":
local_hello2 = global_hello + local_hello;
4004be: 8b 15 cc 03 20 00 mov 0x2003cc(%rip),%edx # 600890 <global_hello>
4004c4: 8b 45 fc mov -0x4(%rbp),%eax
4004c7: 01 d0 add %edx,%eax
4004c9: 89 45 f8 mov %eax,-0x8(%rbp)
This might work for simple programs with no or only moderate optimization levels but will become difficult with compiler optimzation.
You might want to look into gdb sources to learn about the efforts to connect variables to optimized compiler output.
What's your objective, after all?
I wonder if it's possible for a linux process to call code located in the memory of another process?
Let's say we have a function f() in process A and we want process B to call it. What I thought about is using mmap with MAP_SHARED and PROT_EXEC flags to map the memory containing the function code and pass the pointer to B, assuming, that f() will not call any other function from A binary. Will it ever work? If yes, then how do I determine the size of f() in memory?
=== EDIT ===
I know, that shared libraries will do exactly that, but I wonder if it's possible to dynamically share code between processes.
Yes, you can do that, but the first process must have first created the shared memory via mmap and either a memory-mapped file, or a shared area created with shm_open.
If you are sharing compiled code then that's what shared libraries were created for. You can link against them in the ordinary way and the sharing will happen automatically, or you can load them manually using dlopen (e.g. for a plugin).
Update:
As the code has been generated by a compiler then you will have relocations to worry about. The compiler does not produce code that will Just Work anywhere. It expects that the .data section is in a certain place, and that the .bss section has been zeroed. The GOT will need to be populated. Any static constructors will have to be called.
In short, what you want is probably dlopen. This system allows you to open a shared library like it was a file, and then extract function pointers by name. Each program that dlopens the library will share the code sections, thus saving memory, but each will have its own copy of the data section, so they do not interfere with each other.
Beware that you need to compile your library code with -fPIC or else you won't get any code sharing either (actually, the linkers and dynamic loaders for many architectures probably don't support libraries that aren't PIC anyway).
The standard approach is to put the code of f() in a shared library libfoo.so. Then you could either link to that library (e.g. by building program A with gcc -Wall a.c -lfoo -o a.bin), or load it dynamically (e.g. in program B) using dlopen(3) then retrieving the address of f using dlsym.
When you compile a shared library you want to :
compile each source file foo1.c with gcc -Wall -fPIC -c foo1.c -o foo1.pic.o into position independent code, and likewise for foo2.c into foo2.pic.o
link all of them into libfoo.so with gcc -Wall -shared foo*.pic.o -o libfoo.so ; notice that you can link additional shared libraries into lbfoo.so (e.g. by appending -lm to the linking command)
See also the Program Library Howto.
You could play insane tricks by mmap-ing some other /proc/1234/mem but that is not reasonable at all. Use shared libraries.
PS. you can dlopen a big lot (hundreds of thousands) of shared objects lib*.sofiles; you may want to dlclosethem (but practically you don't have to).
It would be possible to do so, but that's exactly what shared libraries are for.
Also, beware that you need to check that the address of the shared memory is the same for both processes, otherwise any references that are "absolute" (that is, a pointer to something in the shared code). And like with shared libaries, the bitness of the code will have to be the same, and as with all shared memory, you need to make sure that you don't "mess up" for the other process if you modify any of the shared memory.
Determining the size of a function ranges from "hard" to "nearly impossible", depending on the actual code generated, and the level of information you have available. Debug symbols will have the size of a function, but beware that I have seen compilers generate code where two functions share the same "return" piece of code (that is, the compiler generates a jump to another function that has the same bit of code to return the result, because it saves a few bytes of code, and there was already going to be a jump anyway [e.g. there is a if/else that the compiler has to jump around]).
not directly
that's what shared libraries are for
relocations
Oh no! Anyways...
Here's the insane, unreasonable, not-good, purely academic demonstration of this capability. It was fun for me, I hope it's fun for you.
Overview
Program A will use shm_open to create a shared memory object, and mmap to map it to its memory space. Then it it will copy some code from a function defined in A to the shared memory. Then program B will open up the shared memory, execute the function, and just for kicks, make a very simple modification to the code. Then A will execute the code to demonstrate the change took effect.
Again, this is no recommendation for how to solve a problem, it's an academic demonstration.
// A.c
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
int foo(int y) {
int x = 14;
return x + y;
}
int main(int argc, char *argv[]) {
const size_t mem_size = 0x1000;
// create shared memory objects
int shared_fd = shm_open("foobar2", O_RDWR | O_CREAT, 0777);
ftruncate(shared_fd, mem_size);
void *shared_mem =
mmap(NULL, mem_size, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_SHARED, shared_fd, 0);
// copy function to shared memory
const size_t fn_size = 24;
memcpy(shared_mem, &foo, fn_size);
// wait
getc(stdin);
// execute the shared function
int(*shared_foo)(int) = shared_mem;
printf("shared_foo(3) = %d\n", shared_foo(3));
// clean up
shm_unlink("foobar2");
}
Note the use of PROT_READ | PROT_WRITE | PROT_EXEC in the call to mmap. This program is compiled with
gcc A.c -lrt -o A
The constant fn_size was determined by looking at the output of objdump -dj .text A
...
000000000000088a <foo>:
88a: 55 push %rbp
88b: 48 89 e5 mov %rsp,%rbp
88e: 89 7d ec mov %edi,-0x14(%rbp)
891: c7 45 fc 0e 00 00 00 movl $0xe,-0x4(%rbp)
898: 8b 55 fc mov -0x4(%rbp),%edx
89b: 8b 45 ec mov -0x14(%rbp),%eax
89e: 01 d0 add %edx,%eax
8a0: 5d pop %rbp
8a1: c3 retq
...
I think that's 24 bytes, I dunno. I guess I could put anything larger than that and it would do the same thing. Anything shorter and I'll probably get an exception from the processor. Also, note that the value of x from foo (14, that's (apparently) 0e 00 00 00 in LE) is located at foo + 10. This will be the constant x_offset in program B.
// B.c
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
const int x_offset = 10;
int main(int argc, char *argv[]) {
// create shared memory objects
int shared_fd = shm_open("foobar2", O_RDWR | O_CREAT, 0777);
void *shared_mem = mmap(NULL, 0x1000, PROT_EXEC | PROT_WRITE, MAP_SHARED, shared_fd, 0);
int (*shared_foo)(int) = shared_mem;
int z = shared_foo(13);
printf("result: %d\n", z);
int *x_p = (int*)((char*)shared_mem + x_offset);
*x_p = 100;
shm_unlink("foobar");
}
Anyways first I run A, then I run B. The output of B is:
result: 27
Then I go back to A and push enter, then I get:
shared_foo(3) = 103
Good enough for me.
/dev/shm/foobar2
To completely eliminate the mystique of all this, after running A you can do something like
xxd /dev/shm/foobar2 | vim -
Then, edit that constant 0e 00 00 00 just like before, then save the file with the 'ol
:w !xxd -r > /dev/shm/foobar2
and push enter in A and see similar results as above.
I want a simple C method to be able to run hex bytecode on a Linux 64 bit machine. Here's the C program that I have:
char code[] = "\x48\x31\xc0";
#include <stdio.h>
int main(int argc, char **argv)
{
int (*func) ();
func = (int (*)()) code;
(int)(*func)();
printf("%s\n","DONE");
}
The code that I am trying to run ("\x48\x31\xc0") I obtained by writting this simple assembly program (it's not supposed to really do anything)
.text
.globl _start
_start:
xorq %rax, %rax
and then compiling and objdump-ing it to obtain the bytecode.
However, when I run my C program I get a segmentation fault. Any ideas?
Machine code has to be in an executable page. Your char code[] is in the read+write data section, without exec permission, so the code cannot be executed from there.
Here is a simple example of allocating an executable page with mmap:
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
int main ()
{
char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi]
0xC3 // ret
};
int (*sum) (int, int) = NULL;
// allocate executable buffer
sum = mmap (0, sizeof(code), PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
// copy code to buffer
memcpy (sum, code, sizeof(code));
// doesn't actually flush cache on x86, but ensure memcpy isn't
// optimized away as a dead store.
__builtin___clear_cache (sum, sum + sizeof(sum)); // GNU C
// run code
int a = 2;
int b = 3;
int c = sum (a, b);
printf ("%d + %d = %d\n", a, b, c);
}
See another answer on this question for details about __builtin___clear_cache.
Until recent Linux kernel versions (sometime before 5.4), you could simply compile with gcc -z execstack - that would make all pages executable, including read-only data (.rodata), and read-write data (.data) where char code[] = "..." goes.
Now -z execstack only applies to the actual stack, so it currently works only for non-const local arrays. i.e. move char code[] = ... into main.
See Linux default behavior against `.data` section for the kernel change, and Unexpected exec permission from mmap when assembly files included in the project for the old behaviour: enabling Linux's READ_IMPLIES_EXEC process for that program. (In Linux 5.4, that Q&A shows you'd only get READ_IMPLIES_EXEC for a missing PT_GNU_STACK, like a really old binary; modern GCC -z execstack would set PT_GNU_STACK = RWX metadata in the executable, which Linux 5.4 would handle as making only the stack itself executable. At some point before that, PT_GNU_STACK = RWX did result in READ_IMPLIES_EXEC.)
The other option is to make system calls at runtime to copy into an executable page, or change permissions on the page it's in. That's still more complicated than using a local array to get GCC to copy code into executable stack memory.
(I don't know if there's an easy way to enable READ_IMPLIES_EXEC under modern kernels. Having no GNU-stack attribute at all in an ELF binary does that for 32-bit code, but not 64-bit.)
Yet another option is __attribute__((section(".text"))) const char code[] = ...;
Working example: https://godbolt.org/z/draGeh.
If you need the array to be writeable, e.g. for shellcode that inserts some zeros into strings, you could maybe link with ld -N. But probably best to use -z execstack and a local array.
Two problems in the question:
exec permission on the page, because you used an array that will go in the noexec read+write .data section.
your machine code doesn't end with a ret instruction so even if it did run, execution would fall into whatever was next in memory instead of returning.
And BTW, the REX prefix is totally redundant. "\x31\xc0" xor eax,eax has exactly the same effect as xor rax,rax.
You need the page containing the machine code to have execute permission. x86-64 page tables have a separate bit for execute separate from read permission, unlike legacy 386 page tables.
The easiest way to get static arrays to be in read+exec memory was to compile with gcc -z execstack. (Used to make the stack and other sections executable, now only the stack).
Until recently (2018 or 2019), the standard toolchain (binutils ld) would put section .rodata into the same ELF segment as .text, so they'd both have read+exec permission. Thus using const char code[] = "..."; was sufficient for executing manually-specified bytes as data, without execstack.
But on my Arch Linux system with GNU ld (GNU Binutils) 2.31.1, that's no longer the case. readelf -a shows that the .rodata section went into an ELF segment with .eh_frame_hdr and .eh_frame, and it only has Read permission. .text goes in a segment with Read + Exec, and .data goes in a segment with Read + Write (along with the .got and .got.plt). (What's the difference of section and segment in ELF file format)
I assume this change is to make ROP and Spectre attacks harder by not having read-only data in executable pages where sequences of useful bytes could be used as "gadgets" that end with the bytes for a ret or jmp reg instruction.
// TODO: use char code[] = {...} inside main, with -z execstack, for current Linux
// Broken on recent Linux, used to work without execstack.
#include <stdio.h>
// can be non-const if you use gcc -z execstack. static is also optional
static const char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi] // retval = a+b;
0xC3 // ret
};
static const char ret0_code[] = "\x31\xc0\xc3"; // xor eax,eax ; ret
// the compiler will append a 0 byte to terminate the C string,
// but that's fine. It's after the ret.
int main () {
// void* cast is easier to type than a cast to function pointer,
// and in C can be assigned to any other pointer type. (not C++)
int (*sum) (int, int) = (void*)code;
int (*ret0)(void) = (void*)ret0_code;
// run code
int c = sum (2, 3);
return ret0();
}
On older Linux systems: gcc -O3 shellcode.c && ./a.out (Works because of const on global/static arrays)
On Linux before 5.5 (or so) gcc -O3 -z execstack shellcode.c && ./a.out (works because of -zexecstack regardless of where your machine code is stored). Fun fact: gcc allows -zexecstack with no space, but clang only accepts clang -z execstack.
These also work on Windows, where read-only data goes in .rdata instead of .rodata.
The compiler-generated main looks like this (from objdump -drwC -Mintel). You can run it inside gdb and set breakpoints on code and ret0_code
(I actually used gcc -no-pie -O3 -zexecstack shellcode.c hence the addresses near 401000
0000000000401020 <main>:
401020: 48 83 ec 08 sub rsp,0x8 # stack aligned by 16 before a call
401024: be 03 00 00 00 mov esi,0x3
401029: bf 02 00 00 00 mov edi,0x2 # 2 args
40102e: e8 d5 0f 00 00 call 402008 <code> # note the target address in the next page
401033: 48 83 c4 08 add rsp,0x8
401037: e9 c8 0f 00 00 jmp 402004 <ret0_code> # optimized tailcall
Or use system calls to modify page permissions
Instead of compiling with gcc -zexecstack, you can instead use mmap(PROT_EXEC) to allocate new executable pages, or mprotect(PROT_EXEC) to change existing pages to executable. (Including pages holding static data.) You also typically want at least PROT_READ and sometimes PROT_WRITE, of course.
Using mprotect on a static array means you're still executing the code from a known location, maybe making it easier to set a breakpoint on it.
On Windows you can use VirtualAlloc or VirtualProtect.
Telling the compiler that data is executed as code
Normally compilers like GCC assume that data and code are separate. This is like type-based strict aliasing, but even using char* doesn't make it well-defined to store into a buffer and then call that buffer as a function pointer.
In GNU C, you also need to use __builtin___clear_cache(buf, buf + len) after writing machine code bytes to a buffer, because the optimizer doesn't treat dereferencing a function pointer as reading bytes from that address. Dead-store elimination can remove the stores of machine code bytes into a buffer, if the compiler proves that the store isn't read as data by anything. https://codegolf.stackexchange.com/questions/160100/the-repetitive-byte-counter/160236#160236 and https://godbolt.org/g/pGXn3B has an example where gcc really does do this optimization, because gcc "knows about" malloc.
(And on non-x86 architectures where I-cache isn't coherent with D-cache, it actually will do any necessary cache syncing. On x86 it's purely a compile-time optimization blocker and doesn't expand to any instructions itself.)
Re: the weird name with three underscores: It's the usual __builtin_name pattern, but name is __clear_cache.
My edit on #AntoineMathys's answer added this.
In practice GCC/clang don't "know about" mmap(MAP_ANONYMOUS) the way they know about malloc. So in practice the optimizer will assume that the memcpy into the buffer might be read as data by the non-inline function call through the function pointer, even without __builtin___clear_cache(). (Unless you declared the function type as __attribute__((const)).)
On x86, where I-cache is coherent with data caches, having the stores happen in asm before the call is sufficient for correctness. On other ISAs, __builtin___clear_cache() will actually emit special instructions as well as ensuring the right compile-time ordering.
It's good practice to include it when copying code into a buffer because it doesn't cost performance, and stops hypothetical future compilers from breaking your code. (e.g. if they do understand that mmap(MAP_ANONYMOUS) gives newly-allocated anonymous memory that nothing else has a pointer to, just like malloc.)
With current GCC, I was able to provoke GCC into really doing an optimization we don't want by using __attribute__((const)) to tell the optimizer sum() is a pure function (that only reads its args, not global memory). GCC then knows sum() can't read the result of the memcpy as data.
With another memcpy into the same buffer after the call, GCC does dead-store elimination into just the 2nd store after the call. This results in no store before the first call so it executes the 00 00 add [rax], al bytes, segfaulting.
// demo of a problem on x86 when not using __builtin___clear_cache
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
int main ()
{
char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi]
0xC3 // ret
};
__attribute__((const)) int (*sum) (int, int) = NULL;
// copy code to executable buffer
sum = mmap (0,sizeof(code),PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANON,-1,0);
memcpy (sum, code, sizeof(code));
//__builtin___clear_cache(sum, sum + sizeof(code));
int c = sum (2, 3);
//printf ("%d + %d = %d\n", a, b, c);
memcpy(sum, (char[]){0x31, 0xc0, 0xc3, 0}, 4); // xor-zero eax, ret, padding for a dword store
//__builtin___clear_cache(sum, sum + 4);
return sum(2,3);
}
Compiled on the Godbolt compiler explorer with GCC9.2 -O3
main:
push rbx
xor r9d, r9d
mov r8d, -1
mov ecx, 34
mov edx, 7
mov esi, 4
xor edi, edi
sub rsp, 16
call mmap
mov esi, 3
mov edi, 2
mov rbx, rax
call rax # call before store
mov DWORD PTR [rbx], 12828721 # 0xC3C031 = xor-zero eax, ret
add rsp, 16
pop rbx
ret # no 2nd call, CSEd away because const and same args
Passing different args would have gotten another call reg, but even with __builtin___clear_cache the two sum(2,3) calls can CSE. __attribute__((const)) doesn't respect changes to the machine code of a function. Don't do it. It's safe if you're going to JIT the function once and then call many times, though.
Uncommenting the first __clear_cache results in
mov DWORD PTR [rax], -1019804531 # lea; ret
call rax
mov DWORD PTR [rbx], 12828721 # xor-zero; ret
... still CSE and use the RAX return value
The first store is there because of __clear_cache and the sum(2,3) call. (Removing the first sum(2,3) call does let dead-store elimination happen across the __clear_cache.)
The second store is there because the side-effect on the buffer returned by mmap is assumed to be important, and that's the final value main leaves.
Godbolt's ./a.out option to run the program still seems to always fail (exit status of 255); maybe it sandboxes JITing? It works on my desktop with __clear_cache and crashes without.
mprotect on a page holding existing C variables.
You can also give a single existing page read+write+exec permission. This is an alternative to compiling with -z execstack
You don't need __clear_cache on a page holding read-only C variables because there's no store to optimize away. You would still need it for initializing a local buffer (on the stack). Otherwise GCC will optimize away the initializer for this private buffer that a non-inline function call definitely doesn't have a pointer to. (Escape analysis). It doesn't consider the possibility that the buffer might hold the machine code for the function unless you tell it that via __builtin___clear_cache.
#include <stdio.h>
#include <sys/mman.h>
#include <stdint.h>
// can be non-const if you want, we're using mprotect
static const char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi] // retval = a+b;
0xC3 // ret
};
static const char ret0_code[] = "\x31\xc0\xc3";
int main () {
// void* cast is easier to type than a cast to function pointer,
// and in C can be assigned to any other pointer type. (not C++)
int (*sum) (int, int) = (void*)code;
int (*ret0)(void) = (void*)ret0_code;
// hard-coding x86's 4k page size for simplicity.
// also assume that `code` doesn't span a page boundary and that ret0_code is in the same page.
uintptr_t page = (uintptr_t)code & -4095ULL; // round down
mprotect((void*)page, 4096, PROT_READ|PROT_EXEC|PROT_WRITE); // +write in case the page holds any writeable C vars that would crash later code.
// run code
int c = sum (2, 3);
return ret0();
}
I used PROT_READ|PROT_EXEC|PROT_WRITE in this example so it works regardless of where your variable is. If it was a local on the stack and you left out PROT_WRITE, call would fail after making the stack read only when it tried to push a return address.
Also, PROT_WRITE lets you test shellcode that self-modifies, e.g. to edit zeros into its own machine code, or other bytes it was avoiding.
$ gcc -O3 shellcode.c # without -z execstack
$ ./a.out
$ echo $?
0
$ strace ./a.out
...
mprotect(0x55605aa3f000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC) = 0
exit_group(0) = ?
+++ exited with 0 +++
If I comment out the mprotect, it does segfault with recent versions of GNU Binutils ld which no longer put read-only constant data into the same ELF segment as the .text section.
If I did something like ret0_code[2] = 0xc3;, I would need __builtin___clear_cache(ret0_code+2, ret0_code+2) after that to make sure the store wasn't optimized away, but if I don't modify the static arrays then it's not needed after mprotect. It is needed after mmap+memcpy or manual stores, because we want to execute bytes that have been written in C (with memcpy).
You need to include the assembly in-line via a special compiler directive so that it'll properly end up in a code segment. See this guide, for example: http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html
Your machine code may be all right, but your CPU objects.
Modern CPUs manage memory in segments. In normal operation, the operating system loads a new program into a program-text segment and sets up a stack in a data segment. The operating system tells the CPU never to run code in a data segment. Your code is in code[], in a data segment. Thus the segfault.
This will take some effort.
Your code variable is stored in the .data section of your executable:
$ readelf -p .data exploit
String dump of section '.data':
[ 10] H1À
H1À is the value of your variable.
The .data section is not executable:
$ readelf -S exploit
There are 30 section headers, starting at offset 0x1150:
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[...]
[24] .data PROGBITS 0000000000601010 00001010
0000000000000014 0000000000000000 WA 0 0 8
All 64-bit processors I'm familiar with support non-executable pages natively in the pagetables. Most newer 32-bit processors (the ones that support PAE) provide enough extra space in their pagetables for the operating system to emulate hardware non-executable pages. You'll need to run either an ancient OS or an ancient processor to get a .data section marked executable.
Because these are just flags in the executable, you ought to be able to set the X flag through some other mechanism, but I don't know how to do so. And your OS might not even let you have pages that are both writable and executable.
You may need to set the page executable before you may call it.
On MS-Windows, see the VirtualProtect -function.
URL: http://msdn.microsoft.com/en-us/library/windows/desktop/aa366898%28v=vs.85%29.aspx
Sorry, I couldn't follow above examples which are complicated.
So, I created an elegant solution for executing hex code from C.
Basically, you could use asm and .word keywords to place your instructions in hex format.
See below example:
asm volatile(".rept 1024\n"
CNOP
".endr\n");
where CNOP is defined as below:
#define ".word 0x00010001 \n"
Basically, c.nop instruction was not supported by my current assembler. So, I defined CNOP as the hex equivalent of c.nop with proper syntax and used inside asm, with which I was aware of.
.rept <NUM> .endr will basically, repeat the instruction NUM times.
This solution is working and verified.