Linux: is it possible to share code between processes?

Linux: is it possible to share code between processes? - c

I wonder if it's possible for a linux process to call code located in the memory of another process?
Let's say we have a function f() in process A and we want process B to call it. What I thought about is using mmap with MAP_SHARED and PROT_EXEC flags to map the memory containing the function code and pass the pointer to B, assuming, that f() will not call any other function from A binary. Will it ever work? If yes, then how do I determine the size of f() in memory?
=== EDIT ===
I know, that shared libraries will do exactly that, but I wonder if it's possible to dynamically share code between processes.

Yes, you can do that, but the first process must have first created the shared memory via mmap and either a memory-mapped file, or a shared area created with shm_open.
If you are sharing compiled code then that's what shared libraries were created for. You can link against them in the ordinary way and the sharing will happen automatically, or you can load them manually using dlopen (e.g. for a plugin).
Update:
As the code has been generated by a compiler then you will have relocations to worry about. The compiler does not produce code that will Just Work anywhere. It expects that the .data section is in a certain place, and that the .bss section has been zeroed. The GOT will need to be populated. Any static constructors will have to be called.
In short, what you want is probably dlopen. This system allows you to open a shared library like it was a file, and then extract function pointers by name. Each program that dlopens the library will share the code sections, thus saving memory, but each will have its own copy of the data section, so they do not interfere with each other.
Beware that you need to compile your library code with -fPIC or else you won't get any code sharing either (actually, the linkers and dynamic loaders for many architectures probably don't support libraries that aren't PIC anyway).

The standard approach is to put the code of f() in a shared library libfoo.so. Then you could either link to that library (e.g. by building program A with gcc -Wall a.c -lfoo -o a.bin), or load it dynamically (e.g. in program B) using dlopen(3) then retrieving the address of f using dlsym.
When you compile a shared library you want to :
compile each source file foo1.c with gcc -Wall -fPIC -c foo1.c -o foo1.pic.o into position independent code, and likewise for foo2.c into foo2.pic.o
link all of them into libfoo.so with gcc -Wall -shared foo*.pic.o -o libfoo.so ; notice that you can link additional shared libraries into lbfoo.so (e.g. by appending -lm to the linking command)
See also the Program Library Howto.
You could play insane tricks by  mmap-ing some other /proc/1234/mem but that is not reasonable at all. Use shared libraries.
PS. you can dlopen a big lot (hundreds of thousands) of shared objects lib*.sofiles; you may want to dlclosethem (but practically you don't have to).

It would be possible to do so, but that's exactly what shared libraries are for.
Also, beware that you need to check that the address of the shared memory is the same for both processes, otherwise any references that are "absolute" (that is, a pointer to something in the shared code). And like with shared libaries, the bitness of the code will have to be the same, and as with all shared memory, you need to make sure that you don't "mess up" for the other process if you modify any of the shared memory.
Determining the size of a function ranges from "hard" to "nearly impossible", depending on the actual code generated, and the level of information you have available. Debug symbols will have the size of a function, but beware that I have seen compilers generate code where two functions share the same "return" piece of code (that is, the compiler generates a jump to another function that has the same bit of code to return the result, because it saves a few bytes of code, and there was already going to be a jump anyway [e.g. there is a if/else that the compiler has to jump around]).

not directly
that's what shared libraries are for
relocations
Oh no! Anyways...
Here's the insane, unreasonable, not-good, purely academic demonstration of this capability. It was fun for me, I hope it's fun for you.
Overview
Program A will use shm_open to create a shared memory object, and mmap to map it to its memory space. Then it it will copy some code from a function defined in A to the shared memory. Then program B will open up the shared memory, execute the function, and just for kicks, make a very simple modification to the code. Then A will execute the code to demonstrate the change took effect.
Again, this is no recommendation for how to solve a problem, it's an academic demonstration.
// A.c
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
int foo(int y) {
int x = 14;
return x + y;
}
int main(int argc, char *argv[]) {
const size_t mem_size = 0x1000;
// create shared memory objects
int shared_fd = shm_open("foobar2", O_RDWR | O_CREAT, 0777);
ftruncate(shared_fd, mem_size);
void *shared_mem =
mmap(NULL, mem_size, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_SHARED, shared_fd, 0);
// copy function to shared memory
const size_t fn_size = 24;
memcpy(shared_mem, &foo, fn_size);
// wait
getc(stdin);
// execute the shared function
int(*shared_foo)(int) = shared_mem;
printf("shared_foo(3) = %d\n", shared_foo(3));
// clean up
shm_unlink("foobar2");
}
Note the use of PROT_READ | PROT_WRITE | PROT_EXEC in the call to mmap. This program is compiled with
gcc A.c -lrt -o A
The constant fn_size was determined by looking at the output of objdump -dj .text A
...
000000000000088a <foo>:
88a: 55 push %rbp
88b: 48 89 e5 mov %rsp,%rbp
88e: 89 7d ec mov %edi,-0x14(%rbp)
891: c7 45 fc 0e 00 00 00 movl $0xe,-0x4(%rbp)
898: 8b 55 fc mov -0x4(%rbp),%edx
89b: 8b 45 ec mov -0x14(%rbp),%eax
89e: 01 d0 add %edx,%eax
8a0: 5d pop %rbp
8a1: c3 retq
...
I think that's 24 bytes, I dunno. I guess I could put anything larger than that and it would do the same thing. Anything shorter and I'll probably get an exception from the processor. Also, note that the value of x from foo (14, that's (apparently) 0e 00 00 00 in LE) is located at foo + 10. This will be the constant x_offset in program B.
// B.c
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
const int x_offset = 10;
int main(int argc, char *argv[]) {
// create shared memory objects
int shared_fd = shm_open("foobar2", O_RDWR | O_CREAT, 0777);
void *shared_mem = mmap(NULL, 0x1000, PROT_EXEC | PROT_WRITE, MAP_SHARED, shared_fd, 0);
int (*shared_foo)(int) = shared_mem;
int z = shared_foo(13);
printf("result: %d\n", z);
int *x_p = (int*)((char*)shared_mem + x_offset);
*x_p = 100;
shm_unlink("foobar");
}
Anyways first I run A, then I run B. The output of B is:
result: 27
Then I go back to A and push enter, then I get:
shared_foo(3) = 103
Good enough for me.
/dev/shm/foobar2
To completely eliminate the mystique of all this, after running A you can do something like
xxd /dev/shm/foobar2 | vim -
Then, edit that constant 0e 00 00 00 just like before, then save the file with the 'ol
:w !xxd -r > /dev/shm/foobar2
and push enter in A and see similar results as above.

Related

How to define C functions with LuaJIT?

This:
local ffi = require "ffi"
ffi.cdef[[
int return_one_two_four(){
return 124;
}
]]
local function print124()
print(ffi.C.return_one_two_four())
end
print124()
Throws an error:
Error: main.lua:10: cannot resolve symbol 'return_one_two_four': The specified procedure could not be found.
I have a sort-of moderate grasp on C and wanted to use some of it's good sides for a few things, but I couldn't find many examples on LuaJIT's FFI library. It seems like cdef is only used for function declarations and not definitions. How can I make functions in C and then use them in Lua?

LuaJIT is a Lua compiler, but not a C compiler. You have to compile your C code into a shared library first. For example with
gcc -shared -fPIC -o libtest.so test.c
luajit test.lua
with the files test.c and test.lua as below.
test.c
int return_one_two_four(){
return 124;
}
test.lua
local ffi = require"ffi"
local ltest = ffi.load"./libtest.so"
ffi.cdef[[
int return_one_two_four();
]]
local function print124()
print(ltest.return_one_two_four())
end
print124()
Live example on Wandbox
A JIT within LuaJIT
In the comments under the question, someone mentioned a workaround to write functions in machine code and have them executed within LuaJIT on Windows. Actually, the same is possible in Linux by essentially implementing a JIT within LuaJIT. While on Windows you can just insert opcodes into a string, cast it to a function pointer and call it, the same is not possible on Linux due to page restrictions. On Linux, memory is either writeable or executable, but never both at the same time, so we have to allocate a page in read-write mode, insert the assembly and then change the mode to read-execute. To this end, simply use the Linux kernel functions to get the page size and mapped memory. However, if you make even the tiniest mistake, like a typo in one of the opcodes, the program will segfault. I'm using 64-bit assembly because I'm using a 64-bit operating system.
Important: Before executing this on your machine, check the magic numbers in <bits/mman-linux.h>. They are not the same on every system.
local ffi = require"ffi"
ffi.cdef[[
typedef unsigned char uint8_t;
typedef long int off_t;
// from <sys/mman.h>
void *mmap(void *addr, size_t length, int prot, int flags,
int fd, off_t offset);
int munmap(void *addr, size_t length);
int mprotect(void *addr, size_t len, int prot);
// from <unistd.h>
int getpagesize(void);
]]
-- magic numbers from <bits/mman-linux.h>
local PROT_READ = 0x1 -- Page can be read.
local PROT_WRITE = 0x2 -- Page can be written.
local PROT_EXEC = 0x4 -- Page can be executed.
local MAP_PRIVATE = 0x02 -- Changes are private.
local MAP_ANONYMOUS = 0x20 -- Don't use a file.
local page_size = ffi.C.getpagesize()
local prot = bit.bor(PROT_READ, PROT_WRITE)
local flags = bit.bor(MAP_ANONYMOUS, MAP_PRIVATE)
local code = ffi.new("uint8_t *", ffi.C.mmap(ffi.NULL, page_size, prot, flags, -1, 0))
local count = 0
local asmins = function(...)
for _,v in ipairs{ ... } do
assert(count < page_size)
code[count] = v
count = count + 1
end
end
asmins(0xb8, 0x7c, 0x00, 0x00, 0x00) -- mov rax, 124
asmins(0xc3) -- ret
ffi.C.mprotect(code, page_size, bit.bor(PROT_READ, PROT_EXEC))
local fun = ffi.cast("int(*)(void)", code)
print(fun())
ffi.C.munmap(code, page_size)
Live example on Wandbox
How to find opcodes
I see that this answer has attracted some interest, so I want to add something which I was having a hard time with at first, namely how to find opcodes for the instructions you want to perform. There are some resources online most notably the Intel® 64 and IA-32 Architectures Software Developer Manuals but nobody wants to go through thousands of PDF pages just to find out how to do mov rax, 124. Therefore some people have made tables which list instructions and corresponding opcodes, e.g. http://ref.x86asm.net/, but looking up opcodes in a table is cumbersome as well because even mov can have many different opcodes depending on what the target and source operands are. So what I do instead is I write a short assembly file, for example
mov rax, 124
ret
You might wonder, why there are no functions and no things like segment .text in my assembly file. Well, since I don't want to ever link it, I can just leave all of that out and save some typing. Then just assemble it using
$ nasm -felf64 -l test.lst test.s
The -felf64 option tells the assembler that I'm using 64-bit syntax, the -l test.lst option that I want to have a listing of the generated code in the file test.lst. The listing looks similar to this:
$ cat test.lst
1 00000000 B87C000000 mov rax, 124
2 00000005 C3 ret
The third column contains the opcodes I am interested in. Just split these into units of 1 byte and insert them into you program, i.e. B87C000000 becomes 0xb8, 0x7c, 0x00, 0x00, 0x00 (hexadecimal numbers are luckily case-insensitive in Lua and I like lowercase better).

Technically you can do the sorts of things you want to do without too much trouble (as long as the code is simple enough).
Using something like this:
https://github.com/nucular/tcclua
With tcc (which is very small, and you can even deploy with it easily) its quite a nice way to have the best of both worlds, all in a single package :)

LuaJIT includes a recognizer for C declarations, but it isn't a full-fledged C compiler. The purpose of its FFI system is to be able to define what C functions a particular DLL exports so that it can load that DLL (via ffi.load) and allow you to call those functions from Lua.
LuaJIT can load pre-compiled code through a DLL C-based interface, but it cannot compile C itself.

Why does Windows require DLL data to be imported?

On Windows data can be loaded from DLLs, but it requires indirection through a pointer in the import address table. As a result, the compiler must know if an object that is being accessed is being imported from a DLL by using the __declspec(dllimport) type specifier.
This is unfortunate because it means a that a header for a Windows library designed to be used as either a static library or a dynamic library needs to know which version of the library the program is linking to. This requirement is not applicable to functions, which are transparently emulated for DLLs with a stub function calling the real function, whose address is stored in the import address table.
On Linux the dynamic linker (ld.so) copies the values of all linked data objects from a shared object into a private mapped region for each process. This doesn't require indirection because the address of the private mapped region is local to the module, so its address is decided when the program is linked (and in the case of position independent executables, relative addressing is used).
Why doesn't Windows do the same? Is there a situation where a DLL might be loaded more than once, and thus require multiple copies of linked data? Even if that was the case, it wouldn't be applicable to read only data.
It seems that the MSVCRT handles this issue by defining the _DLL macro when targeting the dynamic C runtime library (with the /MD or /MDd flag), then using that in all standard headers to conditionally declare all exported symbols with __declspec(dllimport). I suppose you could reuse this macro if you only supported statically linking when using the static C runtime and dynamically linking when using the dynamic C runtime.
References:
LNK4217 - Russ Keldorph's WebLog (emphasis mine)
__declspec(dllimport) can be used on both code and data, and its semantics are subtly different between the two. When applied to a routine call, it is purely a performance optimization. For data, it is required for correctness.
[...]
Importing data
If you export a data item from a DLL, you must declare it with __declspec(dllimport) in the code that accesses it. In this case, instead of generating a direct load from memory, the compiler generates a load through a pointer, resulting in one additional indirection. Unlike calls, where the linker will fix up the code correctly whether the routine was declared __declspec(dllimport) or not, accessing imported data requires __declspec(dllimport). If omitted, the code will wind up accessing the IAT entry instead of the data in the DLL, probably resulting in unexpected behavior.
Importing into an Application Using __declspec(dllimport)
Using __declspec(dllimport) is optional on function declarations, but the compiler produces more efficient code if you use this keyword. However, you must use `__declspec(dllimport) for the importing executable to access the DLL's public data symbols and objects.
Importing Data Using __declspec(dllimport)
When you mark the data as __declspec(dllimport), the compiler automatically generates the indirection code for you.
Importing Using DEF Files (interesting historical notes about accessing the IAT directly)
How do I share data in my DLL with an application or with other DLLs?
By default, each process using a DLL has its own instance of all the DLLs global and static variables.
Linker Tools Warning LNK4217
What happens when you get dllimport wrong? (seems to be unaware of data semantics)
How do I export data from a DLL?
CRT Library Features (documents the _DLL macro)

Linux and Windows use different strategies for accessing data stored in dynamic libraries.
On Linux, an undefined reference to an object is resolved to a library at link time. The linker finds the size of the object and reserves space for it in the .bss or the .rdata segment of the executable. When executed, the dynamic linker (ld.so) resolves the symbol to a dynamic library (again), and copies the object from the dynamic library to the process's memory.
On Windows, an undefined reference to an object is resolved to an import library at link time, and no space is reserved for it. When the module is executed, the dynamic linker resolves the symbol to a dynamic library, and creates a copy on write memory map in the process, backed by a shared data segment in the dynamic library.
The advantage of a copy on write memory map is that if the linked data is unchanged, then it can be shared with other processes. In practice this is a trifling benefit which greatly increases complexity, both for the toolchain and programs using dynamic libraries. For objects which are actually written this is always less efficient.
I suspect, although I have no evidence, that this decision was made for a particular and now outdated use case. Perhaps it was common practice to use large (for the time) read only objects in dynamic libraries on 16-bit Windows (in official Microsoft programs or otherwise). Either way, I doubt anyone at Microsoft has the expertise and time to change it now.
In order to investigate the issue I created a program which writes to an object from a dynamic library. It writes one byte per page (4096 bytes) in the object, then writes the entire object, then retries the initial one byte per page write. If the object is reserved for the process before main is called, the first and third loops should take approximately the same time, and the second loop should take longer than both. If the object is a copy on write map to a dynamic library, the first loop should take at least as long as the second, and the third should take less time than both.
The results are consistent with my hypothesis, and analyzing the disassembly confirms that Linux accesses the dynamic library data at a link time address, relative to the program counter. Surprisingly, Windows not only indirectly accesses the data, the pointer to the data and its length are reloaded from the import address table every loop iteration, with optimizations enabled. This was tested with Visual Studio 2010 on Windows XP, so maybe things have changed, although I wouldn't think that it has.
Here are the results for Linux:
$ dd bs=1M count=16 if=/dev/urandom of=libdat.dat
$ xxd -i libdat.dat libdat.c
$ gcc -O3 -g -shared -fPIC libdat.c -o libdat.so
$ gcc -O3 -g -no-pie -L. -ldat dat.c -o dat
$ LD_LIBRARY_PATH=. ./dat
local = 0x1601060
libdat_dat = 0x601040
libdat_dat_len = 0x601020
dirty= 461us write= 12184us retry= 456us
$ nm dat
[...]
0000000000601040 B libdat_dat
0000000000601020 B libdat_dat_len
0000000001601060 B local
[...]
$ objdump -d -j.text dat
[...]
400693: 8b 35 87 09 20 00 mov 0x200987(%rip),%esi # 601020 <libdat_dat_len>
[...]
4006a3: 31 c0 xor %eax,%eax # zero loop counter
4006a5: 48 8d 15 94 09 20 00 lea 0x200994(%rip),%rdx # 601040 <libdat_dat>
4006ac: 0f 1f 40 00 nopl 0x0(%rax) # align loop for efficiency
4006b0: 89 c1 mov %eax,%ecx # store data offset in ecx
4006b2: 05 00 10 00 00 add $0x1000,%eax # add PAGESIZE to data offset
4006b7: c6 04 0a 00 movb $0x0,(%rdx,%rcx,1) # write a zero byte to data
4006bb: 39 f0 cmp %esi,%eax # test loop condition
4006bd: 72 f1 jb 4006b0 <main+0x30> # continue loop if data is left
[...]
Here are the results for Windows:
$ cl /Ox /Zi /LD libdat.c /link /EXPORT:libdat_dat /EXPORT:libdat_dat_len
[...]
$ cl /Ox /Zi dat.c libdat.lib
[...]
$ dat.exe # note low resolution timer means retry is too small to measure
local = 0041EEA0
libdat_dat = 1000E000
libdat_dat_len = 1100E000
dirty= 20312us write= 3125us retry= 0us
$ dumpbin /symbols dat.exe
[...]
9000 .data
1000 .idata
5000 .rdata
1000 .reloc
17000 .text
[...]
$ dumpbin /disasm dat.exe
[...]
004010BA: 33 C0 xor eax,eax # zero loop counter
[...]
004010C0: 8B 15 8C 63 42 00 mov edx,dword ptr [__imp__libdat_dat] # store data pointer in edx
004010C6: C6 04 02 00 mov byte ptr [edx+eax],0 # write a zero byte to data
004010CA: 8B 0D 88 63 42 00 mov ecx,dword ptr [__imp__libdat_dat_len] # store data length in ecx
004010D0: 05 00 10 00 00 add eax,1000h # add PAGESIZE to data offset
004010D5: 3B 01 cmp eax,dword ptr [ecx] # test loop condition
004010D7: 72 E7 jb 004010C0 # continue loop if data is left
[...]
Here is the source code used for both tests:
#include <stdio.h>
#ifdef _WIN32
#include <windows.h>
typedef FILETIME time_l;
time_l time_get(void) {
FILETIME ret; GetSystemTimeAsFileTime(&ret); return ret;
}
long long int time_diff(time_l const *c1, time_l const *c2) {
return 1LL*c2->dwLowDateTime/100-c1->dwLowDateTime/100+c2->dwHighDateTime*100000-c1->dwHighDateTime*100000;
}
#else
#include <unistd.h>
#include <time.h>
#include <stdlib.h>
typedef struct timespec time_l;
time_l time_get(void) {
time_l ret; clock_gettime(CLOCK_MONOTONIC, &ret); return ret;
}
long long int time_diff(time_l const *c1, time_l const *c2) {
return 1LL*c2->tv_nsec/1000-c1->tv_nsec/1000+c2->tv_sec*1000000-c1->tv_sec*1000000;
}
#endif
#ifndef PAGESIZE
#define PAGESIZE 4096
#endif
#ifdef _WIN32
#define DLLIMPORT __declspec(dllimport)
#else
#define DLLIMPORT
#endif
extern DLLIMPORT unsigned char volatile libdat_dat[];
extern DLLIMPORT unsigned int libdat_dat_len;
unsigned int local[4096];
int main(void) {
unsigned int i;
time_l t1, t2, t3, t4;
long long int d1, d2, d3;
t1 = time_get();
for(i=0; i < libdat_dat_len; i+=PAGESIZE) {
libdat_dat[i] = 0;
}
t2 = time_get();
for(i=0; i < libdat_dat_len; i++) {
libdat_dat[i] = 0xFF;
}
t3 = time_get();
for(i=0; i < libdat_dat_len; i+=PAGESIZE) {
libdat_dat[i] = 0;
}
t4 = time_get();
d1 = time_diff(&t1, &t2);
d2 = time_diff(&t2, &t3);
d3 = time_diff(&t3, &t4);
printf("%-15s= %18p\n%-15s= %18p\n%-15s= %18p\n", "local", local, "libdat_dat", libdat_dat, "libdat_dat_len", &libdat_dat_len);
printf("dirty=%9lldus write=%9lldus retry=%9lldus\n", d1, d2, d3);
return 0;
}
I sincerely hope someone else benefits from my research. Thanks for reading!

C linker and duplicate symbols

I did an experiment to see what kind of assembly language would be generate if I try to get the same function to compile in there twice. I did the following:
I created two simple test files and their corresponding headers. Let's call them a.c/a.h, and b.c/b.h. Here are the contents of those files:
a.h:
#ifndef __A_H__
#define __A_H__
int a( void );
#endif
b.h:
#ifndef __B_H__
#define __B_H__
int b( void );
#endif
a.c:
#include "a.h"
int a( void )
{
return 1;
}
b.c:
#include "b.h"
#include "a.h"
int b( void )
{
return 1 + a();
}
I then created a static archive for a:
gcc -c a.c -o a.o
ar -rsc a.a a.o
and the same for b, including the static archive for a this time:
gcc -c b.c -o b.o
ar -rsc b.a a.a b.o
At this point, I disassemble the static archive for b to verify that it has assembly code for both functions a() and b(). It does.
Now, I define one last file:
main.c:
#include <stdio.h>
#include "a.h"
#include "b.h"
int main( void )
{
printf( "%d %d\n", a(), b() );
return 0;
}
and I compile it thusly:
gcc main.c a.a b.a -o main
This works fine. When I disassemble it, I see the following definitions for a and b in the code:
140 0000000000400561 <a>:
141 400561: 55 push %rbp
142 400562: 48 89 e5 mov %rsp,%rbp
143 400565: b8 01 00 00 00 mov $0x1,%eax
144 40056a: 5d pop %rbp
145 40056b: c3 retq
146
147 000000000040056c <b>:
148 40056c: 55 push %rbp
149 40056d: 48 89 e5 mov %rsp,%rbp
150 400570: e8 ec ff ff ff callq 400561 <a>
151 400575: 83 c0 01 add $0x1,%eax
152 400578: 5d pop %rbp
153 400579: c3 retq
154 40057a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
As you can see, the code has clearly defined b as calling a rather than inlining it, however, there is only one definition of a in the code, no duplicates.
It seems that gcc has either:
Detected the duplicate object code and removed the duplicates
--or--
the b archive was used first, and it included the reference to int a(), so the a archive was ignored.
My question is: is this behavior circumstantial to my test or is it standard, and can I expect the same behavior from other compilers? Obviously duplicate code is one problem, however there could be duplicate global references as well. Is it safe/good practice to build a large application that has multiple dependency paths to the same static archive? Are there less obvious situations than just duplicate symbol names where issues can arise when doing this?
Asking this because I've been playing with this idea for a project I'm on, and want to make the right choices.

My question is: is this behavior circumstantial to my test or is it standard, and can I expect the same behavior from other compilers?
As far as the compiler itself is concerned, there is no issue: you have one definition for each function among your sources.
As far as ar is concerned, you also have no issue: neither of the archives you built contains any duplicate symbols.
Different linkers may exhibit different behaviors, however. It is conceivable that some would reject linking archives that contain duplicate external symbols. Typical UNIX linkers will handle the situation you present, but they may vary in some details, such as whether a duplicate copy of function a() is included in the binary.
Obviously duplicate code is one problem, however there could be duplicate global references as well. Is it safe/good practice to build a large application that has multiple dependency paths to the same static archive?
"Multiple paths to the same static archive" does not seem to be a good characterization of the situation you present. In neither case do you provide the same archive more than once. Rather, in the b case you provide different archives with duplicate members. Linkers generally do not have problems with specifying the same archive multiple times in the same link command. Under some circumstances it may even be necessary to do so; it should not present a problem.
Providing distinct archives with duplicate members probably will not present a problem, except possibly for bloating your code with duplicate function implementations. This is a bit less certain, but I doubt it would present a problem in practice.
Whether that's good practice is a matter of opinion, but I'm inclined to think not. It's also not clear to me what gain you seen in such an approach. On the other hand, I won't be sharpening any stakes or preparing any kindling if you decide to go ahead anyway.

How to get c code to execute hex machine code?

I want a simple C method to be able to run hex bytecode on a Linux 64 bit machine. Here's the C program that I have:
char code[] = "\x48\x31\xc0";
#include <stdio.h>
int main(int argc, char **argv)
{
int (*func) ();
func = (int (*)()) code;
(int)(*func)();
printf("%s\n","DONE");
}
The code that I am trying to run ("\x48\x31\xc0") I obtained by writting this simple assembly program (it's not supposed to really do anything)
.text
.globl _start
_start:
xorq %rax, %rax
and then compiling and objdump-ing it to obtain the bytecode.
However, when I run my C program I get a segmentation fault. Any ideas?

Machine code has to be in an executable page. Your char code[] is in the read+write data section, without exec permission, so the code cannot be executed from there.
Here is a simple example of allocating an executable page with mmap:
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
int main ()
{
char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi]
0xC3 // ret
};
int (*sum) (int, int) = NULL;
// allocate executable buffer
sum = mmap (0, sizeof(code), PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
// copy code to buffer
memcpy (sum, code, sizeof(code));
// doesn't actually flush cache on x86, but ensure memcpy isn't
// optimized away as a dead store.
__builtin___clear_cache (sum, sum + sizeof(sum)); // GNU C
// run code
int a = 2;
int b = 3;
int c = sum (a, b);
printf ("%d + %d = %d\n", a, b, c);
}
See another answer on this question for details about __builtin___clear_cache.

Until recent Linux kernel versions (sometime before 5.4), you could simply compile with gcc -z execstack - that would make all pages executable, including read-only data (.rodata), and read-write data (.data) where char code[] = "..." goes.
Now -z execstack only applies to the actual stack, so it currently works only for non-const local arrays. i.e. move char code[] = ... into main.
See Linux default behavior against `.data` section for the kernel change, and Unexpected exec permission from mmap when assembly files included in the project for the old behaviour: enabling Linux's READ_IMPLIES_EXEC process for that program. (In Linux 5.4, that Q&A shows you'd only get READ_IMPLIES_EXEC for a missing PT_GNU_STACK, like a really old binary; modern GCC -z execstack would set PT_GNU_STACK = RWX metadata in the executable, which Linux 5.4 would handle as making only the stack itself executable. At some point before that, PT_GNU_STACK = RWX did result in READ_IMPLIES_EXEC.)
The other option is to make system calls at runtime to copy into an executable page, or change permissions on the page it's in. That's still more complicated than using a local array to get GCC to copy code into executable stack memory.
(I don't know if there's an easy way to enable READ_IMPLIES_EXEC under modern kernels. Having no GNU-stack attribute at all in an ELF binary does that for 32-bit code, but not 64-bit.)
Yet another option is __attribute__((section(".text"))) const char code[] = ...;
Working example: https://godbolt.org/z/draGeh.
If you need the array to be writeable, e.g. for shellcode that inserts some zeros into strings, you could maybe link with ld -N. But probably best to use -z execstack and a local array.
Two problems in the question:
exec permission on the page, because you used an array that will go in the noexec read+write .data section.
your machine code doesn't end with a ret instruction so even if it did run, execution would fall into whatever was next in memory instead of returning.
And BTW, the REX prefix is totally redundant. "\x31\xc0" xor eax,eax has exactly the same effect as xor rax,rax.
You need the page containing the machine code to have execute permission. x86-64 page tables have a separate bit for execute separate from read permission, unlike legacy 386 page tables.
The easiest way to get static arrays to be in read+exec memory was to compile with gcc -z execstack. (Used to make the stack and other sections executable, now only the stack).
Until recently (2018 or 2019), the standard toolchain (binutils ld) would put section .rodata into the same ELF segment as .text, so they'd both have read+exec permission. Thus using const char code[] = "..."; was sufficient for executing manually-specified bytes as data, without execstack.
But on my Arch Linux system with GNU ld (GNU Binutils) 2.31.1, that's no longer the case. readelf -a shows that the .rodata section went into an ELF segment with .eh_frame_hdr and .eh_frame, and it only has Read permission. .text goes in a segment with Read + Exec, and .data goes in a segment with Read + Write (along with the .got and .got.plt). (What's the difference of section and segment in ELF file format)
I assume this change is to make ROP and Spectre attacks harder by not having read-only data in executable pages where sequences of useful bytes could be used as "gadgets" that end with the bytes for a ret or jmp reg instruction.
// TODO: use char code[] = {...} inside main, with -z execstack, for current Linux
// Broken on recent Linux, used to work without execstack.
#include <stdio.h>
// can be non-const if you use gcc -z execstack. static is also optional
static const char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi] // retval = a+b;
0xC3 // ret
};
static const char ret0_code[] = "\x31\xc0\xc3"; // xor eax,eax ; ret
// the compiler will append a 0 byte to terminate the C string,
// but that's fine. It's after the ret.
int main () {
// void* cast is easier to type than a cast to function pointer,
// and in C can be assigned to any other pointer type. (not C++)
int (*sum) (int, int) = (void*)code;
int (*ret0)(void) = (void*)ret0_code;
// run code
int c = sum (2, 3);
return ret0();
}
On older Linux systems: gcc -O3 shellcode.c && ./a.out (Works because of const on global/static arrays)
On Linux before 5.5 (or so) gcc -O3 -z execstack shellcode.c && ./a.out (works because of -zexecstack regardless of where your machine code is stored). Fun fact: gcc allows -zexecstack with no space, but clang only accepts clang -z execstack.
These also work on Windows, where read-only data goes in .rdata instead of .rodata.
The compiler-generated main looks like this (from objdump -drwC -Mintel). You can run it inside gdb and set breakpoints on code and ret0_code
(I actually used gcc -no-pie -O3 -zexecstack shellcode.c hence the addresses near 401000
0000000000401020 <main>:
401020: 48 83 ec 08 sub rsp,0x8 # stack aligned by 16 before a call
401024: be 03 00 00 00 mov esi,0x3
401029: bf 02 00 00 00 mov edi,0x2 # 2 args
40102e: e8 d5 0f 00 00 call 402008 <code> # note the target address in the next page
401033: 48 83 c4 08 add rsp,0x8
401037: e9 c8 0f 00 00 jmp 402004 <ret0_code> # optimized tailcall
Or use system calls to modify page permissions
Instead of compiling with gcc -zexecstack, you can instead use mmap(PROT_EXEC) to allocate new executable pages, or mprotect(PROT_EXEC) to change existing pages to executable. (Including pages holding static data.) You also typically want at least PROT_READ and sometimes PROT_WRITE, of course.
Using mprotect on a static array means you're still executing the code from a known location, maybe making it easier to set a breakpoint on it.
On Windows you can use VirtualAlloc or VirtualProtect.
Telling the compiler that data is executed as code
Normally compilers like GCC assume that data and code are separate. This is like type-based strict aliasing, but even using char* doesn't make it well-defined to store into a buffer and then call that buffer as a function pointer.
In GNU C, you also need to use __builtin___clear_cache(buf, buf + len) after writing machine code bytes to a buffer, because the optimizer doesn't treat dereferencing a function pointer as reading bytes from that address. Dead-store elimination can remove the stores of machine code bytes into a buffer, if the compiler proves that the store isn't read as data by anything. https://codegolf.stackexchange.com/questions/160100/the-repetitive-byte-counter/160236#160236 and https://godbolt.org/g/pGXn3B has an example where gcc really does do this optimization, because gcc "knows about" malloc.
(And on non-x86 architectures where I-cache isn't coherent with D-cache, it actually will do any necessary cache syncing. On x86 it's purely a compile-time optimization blocker and doesn't expand to any instructions itself.)
Re: the weird name with three underscores: It's the usual __builtin_name pattern, but name is __clear_cache.
My edit on #AntoineMathys's answer added this.
In practice GCC/clang don't "know about" mmap(MAP_ANONYMOUS) the way they know about malloc. So in practice the optimizer will assume that the memcpy into the buffer might be read as data by the non-inline function call through the function pointer, even without __builtin___clear_cache(). (Unless you declared the function type as __attribute__((const)).)
On x86, where I-cache is coherent with data caches, having the stores happen in asm before the call is sufficient for correctness. On other ISAs, __builtin___clear_cache() will actually emit special instructions as well as ensuring the right compile-time ordering.
It's good practice to include it when copying code into a buffer because it doesn't cost performance, and stops hypothetical future compilers from breaking your code. (e.g. if they do understand that mmap(MAP_ANONYMOUS) gives newly-allocated anonymous memory that nothing else has a pointer to, just like malloc.)
With current GCC, I was able to provoke GCC into really doing an optimization we don't want by using __attribute__((const)) to tell the optimizer sum() is a pure function (that only reads its args, not global memory). GCC then knows sum() can't read the result of the memcpy as data.
With another memcpy into the same buffer after the call, GCC does dead-store elimination into just the 2nd store after the call. This results in no store before the first call so it executes the 00 00 add [rax], al bytes, segfaulting.
// demo of a problem on x86 when not using __builtin___clear_cache
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
int main ()
{
char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi]
0xC3 // ret
};
__attribute__((const)) int (*sum) (int, int) = NULL;
// copy code to executable buffer
sum = mmap (0,sizeof(code),PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANON,-1,0);
memcpy (sum, code, sizeof(code));
//__builtin___clear_cache(sum, sum + sizeof(code));
int c = sum (2, 3);
//printf ("%d + %d = %d\n", a, b, c);
memcpy(sum, (char[]){0x31, 0xc0, 0xc3, 0}, 4); // xor-zero eax, ret, padding for a dword store
//__builtin___clear_cache(sum, sum + 4);
return sum(2,3);
}
Compiled on the Godbolt compiler explorer with GCC9.2 -O3
main:
push rbx
xor r9d, r9d
mov r8d, -1
mov ecx, 34
mov edx, 7
mov esi, 4
xor edi, edi
sub rsp, 16
call mmap
mov esi, 3
mov edi, 2
mov rbx, rax
call rax # call before store
mov DWORD PTR [rbx], 12828721 # 0xC3C031 = xor-zero eax, ret
add rsp, 16
pop rbx
ret # no 2nd call, CSEd away because const and same args
Passing different args would have gotten another call reg, but even with __builtin___clear_cache the two sum(2,3) calls can CSE. __attribute__((const)) doesn't respect changes to the machine code of a function. Don't do it. It's safe if you're going to JIT the function once and then call many times, though.
Uncommenting the first __clear_cache results in
mov DWORD PTR [rax], -1019804531 # lea; ret
call rax
mov DWORD PTR [rbx], 12828721 # xor-zero; ret
... still CSE and use the RAX return value
The first store is there because of __clear_cache and the sum(2,3) call. (Removing the first sum(2,3) call does let dead-store elimination happen across the __clear_cache.)
The second store is there because the side-effect on the buffer returned by mmap is assumed to be important, and that's the final value main leaves.
Godbolt's ./a.out option to run the program still seems to always fail (exit status of 255); maybe it sandboxes JITing? It works on my desktop with __clear_cache and crashes without.
mprotect on a page holding existing C variables.
You can also give a single existing page read+write+exec permission. This is an alternative to compiling with -z execstack
You don't need __clear_cache on a page holding read-only C variables because there's no store to optimize away. You would still need it for initializing a local buffer (on the stack). Otherwise GCC will optimize away the initializer for this private buffer that a non-inline function call definitely doesn't have a pointer to. (Escape analysis). It doesn't consider the possibility that the buffer might hold the machine code for the function unless you tell it that via __builtin___clear_cache.
#include <stdio.h>
#include <sys/mman.h>
#include <stdint.h>
// can be non-const if you want, we're using mprotect
static const char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi] // retval = a+b;
0xC3 // ret
};
static const char ret0_code[] = "\x31\xc0\xc3";
int main () {
// void* cast is easier to type than a cast to function pointer,
// and in C can be assigned to any other pointer type. (not C++)
int (*sum) (int, int) = (void*)code;
int (*ret0)(void) = (void*)ret0_code;
// hard-coding x86's 4k page size for simplicity.
// also assume that `code` doesn't span a page boundary and that ret0_code is in the same page.
uintptr_t page = (uintptr_t)code & -4095ULL; // round down
mprotect((void*)page, 4096, PROT_READ|PROT_EXEC|PROT_WRITE); // +write in case the page holds any writeable C vars that would crash later code.
// run code
int c = sum (2, 3);
return ret0();
}
I used PROT_READ|PROT_EXEC|PROT_WRITE in this example so it works regardless of where your variable is. If it was a local on the stack and you left out PROT_WRITE, call would fail after making the stack read only when it tried to push a return address.
Also, PROT_WRITE lets you test shellcode that self-modifies, e.g. to edit zeros into its own machine code, or other bytes it was avoiding.
$ gcc -O3 shellcode.c # without -z execstack
$ ./a.out
$ echo $?
0
$ strace ./a.out
...
mprotect(0x55605aa3f000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC) = 0
exit_group(0) = ?
+++ exited with 0 +++
If I comment out the mprotect, it does segfault with recent versions of GNU Binutils ld which no longer put read-only constant data into the same ELF segment as the .text section.
If I did something like ret0_code[2] = 0xc3;, I would need __builtin___clear_cache(ret0_code+2, ret0_code+2) after that to make sure the store wasn't optimized away, but if I don't modify the static arrays then it's not needed after mprotect. It is needed after mmap+memcpy or manual stores, because we want to execute bytes that have been written in C (with memcpy).

You need to include the assembly in-line via a special compiler directive so that it'll properly end up in a code segment. See this guide, for example: http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html

Your machine code may be all right, but your CPU objects.
Modern CPUs manage memory in segments. In normal operation, the operating system loads a new program into a program-text segment and sets up a stack in a data segment. The operating system tells the CPU never to run code in a data segment. Your code is in code[], in a data segment. Thus the segfault.

This will take some effort.
Your code variable is stored in the .data section of your executable:
$ readelf -p .data exploit
String dump of section '.data':
[ 10] H1À
H1À is the value of your variable.
The .data section is not executable:
$ readelf -S exploit
There are 30 section headers, starting at offset 0x1150:
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[...]
[24] .data PROGBITS 0000000000601010 00001010
0000000000000014 0000000000000000 WA 0 0 8
All 64-bit processors I'm familiar with support non-executable pages natively in the pagetables. Most newer 32-bit processors (the ones that support PAE) provide enough extra space in their pagetables for the operating system to emulate hardware non-executable pages. You'll need to run either an ancient OS or an ancient processor to get a .data section marked executable.
Because these are just flags in the executable, you ought to be able to set the X flag through some other mechanism, but I don't know how to do so. And your OS might not even let you have pages that are both writable and executable.

You may need to set the page executable before you may call it.
On MS-Windows, see the VirtualProtect -function.
URL: http://msdn.microsoft.com/en-us/library/windows/desktop/aa366898%28v=vs.85%29.aspx

Sorry, I couldn't follow above examples which are complicated.
So, I created an elegant solution for executing hex code from C.
Basically, you could use asm and .word keywords to place your instructions in hex format.
See below example:
asm volatile(".rept 1024\n"
CNOP
".endr\n");
where CNOP is defined as below:
#define ".word 0x00010001 \n"
Basically, c.nop instruction was not supported by my current assembler. So, I defined CNOP as the hex equivalent of c.nop with proper syntax and used inside asm, with which I was aware of.
.rept <NUM> .endr will basically, repeat the instruction NUM times.
This solution is working and verified.

Function body on heap

A program has three sections: text, data and stack. The function body lives in the text section. Can we let a function body live on heap? Because we can manipulate memory on heap more freely, we may gain more freedom to manipulate functions.
In the following C code, I copy the text of hello function onto heap and then point a function pointer to it. The program compiles fine by gcc but gives "Segmentation fault" when running.
Could you tell me why?
If my program can not be repaired, could you provide a way to let a function live on heap?
Thanks!
Turing.robot
#include "stdio.h"
#include "stdlib.h"
#include "string.h"
void
hello()
{
printf( "Hello World!\n");
}
int main(void)
{
void (*fp)();
int size = 10000; // large enough to contain hello()
char* buffer;
buffer = (char*) malloc ( size );
memcpy( buffer,(char*)hello,size );
fp = buffer;
fp();
free (buffer);
return 0;
}

My examples below are for Linux x86_64 with gcc, but similar considerations should apply on other systems.
Can we let a function body live on heap?
Yes, absolutely we can. But usually that is called JIT (Just-in-time) compilation. See this for basic idea.
Because we can manipulate memory on heap more freely, we may gain more freedom to manipulate functions.
Exactly, that's why higher level languages like JavaScript have JIT compilers.
In the following C code, I copy the text of hello function onto heap and then point a function pointer to it. The program compiles fine by gcc but gives "Segmentation fault" when running.
Actually you have multiple "Segmentation fault"s in that code.
The first one comes from this line:
int size = 10000; // large enough to contain hello()
If you see x86_64 machine code generated by gcc of your
hello function, it compiles down to mere 17 bytes:
0000000000400626 <hello>:
400626: 55 push %rbp
400627: 48 89 e5 mov %rsp,%rbp
40062a: bf 98 07 40 00 mov $0x400798,%edi
40062f: e8 9c fe ff ff call 4004d0 <puts#plt>
400634: 90 nop
400635: 5d pop %rbp
400636: c3 retq
So, when you are trying to copy 10,000 bytes, you run into a memory
that does not exist and get "Segmentation fault".
Secondly, you allocate memory with malloc, which gives you a slice of
memory that is protected by CPU against execution on Linux x86_64, so
this would give you another "Segmentation fault".
Under the hood malloc uses system calls like brk, sbrk, and mmap to allocate memory. What you need to do is allocate executable memory using mmap system call with PROT_EXEC protection.
Thirdly, when gcc compiles your hello function, you don't really know what optimisations it will use and what the resulting machine code looks like.
For example, if you see line 4 of the compiled hello function
40062f: e8 9c fe ff ff call 4004d0 <puts#plt>
gcc optimised it to use puts function instead of printf, but that is
not even the main problem.
On x86 architectures you normally call functions using call assembly
mnemonic, however, it is not a single instruction, there are actually many different machine instructions that call can compile to, see Intel manual page Vol. 2A 3-123, for reference.
In you case the compiler has chosen to use relative addressing for the call assembly instruction.
You can see that, because your call instruction has e8 opcode:
E8 - Call near, relative, displacement relative to next instruction. 32-bit displacement sign extended to 64-bits in 64-bit mode.
Which basically means that instruction pointer will jump the relative amount of bytes from the current instruction pointer.
Now, when you relocate your code with memcpy to the heap, you simply copy that relative call which will now jump the instruction pointer relative from where you copied your code to into the heap, and that memory will most likely not exist and you will get another "Segmentation fault".
If my program can not be repaired, could you provide a way to let a function live on heap? Thanks!
Below is a working code, here is what I do:
Execute, printf once to make sure gcc includes it in our binary.
Copy the correct size of bytes to heap, in order to not access memory that does not exist.
Allocate executable memory with mmap and PROT_EXEC option.
Pass printf function as argument to our heap_function to make sure
that gcc uses absolute jumps for call instruction.
Here is a working code:
#include "stdio.h"
#include "string.h"
#include <stdint.h>
#include <sys/mman.h>
typedef int (*printf_t)(char* format, char* string);
typedef int (*heap_function_t)(printf_t myprintf, char* str, int a, int b);
int heap_function(printf_t myprintf, char* str, int a, int b) {
myprintf("%s", str);
return a + b;
}
int heap_function_end() {
return 0;
}
int main(void) {
// By printing something here, `gcc` will include `printf`
// function at some address (`0x4004d0` in my case) in our binary,
// with `printf_t` two argument signature.
printf("%s", "Just including printf in binary\n");
// Allocate the correct size of
// executable `PROT_EXEC` memory.
size_t size = (size_t) ((intptr_t) heap_function_end - (intptr_t) heap_function);
char* buffer = (char*) mmap(0, (size_t) size,
PROT_EXEC | PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
memcpy(buffer, (char*)heap_function, size);
// Call our function
heap_function_t fp = (heap_function_t) buffer;
int res = fp((void*) printf, "Hello world, from heap!\n", 1, 2);
printf("a + b = %i\n", res);
}
Save in main.c and run with:
gcc -o main main.c && ./main

In principle in concept it is doable. However... You are copying from "hello" which basically contains assembly instructions that possibly call or reference or jump to other addresses. Some of these addresses get fixed up when the application loads. Just copying that and calling into it would then crash. Also some systems like windows have data execution protection that would prevent code in data form being executed, as a security measure. Also, how large is "hello"? Trying to copy past the end of it would likely also crash. And you are also dependent on how the compiler implements "hallo". Needless to say, this would be very compiler and platform dependent, if it worked.

I can imagine that this might work on a very simple architecture or with a compiler designed to make it easy.
A few of the many requirements for this work:
All memory references would need to be absolute ... no pc-relative addresses, except . . .
Certain control transfers would need to be pc-relative (so your copied function's local branches work) but it would be nice if other ones would just happen to be absolute, so your module's external control transfers, like printf(), would work.
There are more requirements. Add to this the wierdness of doing this in what is likely to already be a highly complex dynamically linked environment (did you static link it?) and you simply are not ever going to get this to work.
And as Adam points out, there are security mechanisms in place, at least for the stack, to prevent dynamically constructed code from executing at all. You may need to figure out how to turn these off.
You might also be getting clobbered with the memcpy().
You might learn something by tracing this through step-by-step and watching it shoot itself in the head. If the memcpy hack is the problem, perhaps try something like:
f() {
...
}
g() {
...
}
memcpy(dst, f, (intptr_t)g - (intptr_t)f)

You program is segfaulting because you're memcpy'ing more than just "hello"; that function is not 10000 bytes long, so as soon as you get past hello itself, you segfault because you're accessing memory that doesn't belong to you.
You probably also need to use mmap() at some point to make sure the memory location you're trying to call is actually executable.
There are many systems that do what you seem to want (e.g., Java's JIT compiler creates native code in the heap and executes it), but your example will be way more complicated than that because there's no easy way to know the size of your function at runtime (and it's even harder at compile time, when the compiler hasn't yet decide what optimizations to apply). You can probably do what objdump does and read the executable at runtime to find the right "size", but I don't think that's what you're actually trying to achieve here.

After malloc you should check that the pointer is not null buffer = (char*) malloc ( size );
memcpy( buffer,(char*)hello,size ); and it might be your problem since you try to allocate a big area in memory. can you check that?

memcpy( buffer,(char*)hello,size );
hello is not a source get copied to buffer. You are cheating the compiler and it is taking it's revenge at run-time. By typecasting hello to char*, the program is making the compiler to believe it so, which is not the case actually. Never out-smart the compiler.