Sparc Function compilation alignment - c

I want my program such that each function in the binary has some space left after it ends. So that later if some minor change is required only that function is changed with the extra space acting as room for accounting the minor change. -falign-function can do the job but it will not give consistent space. Is there anyway to do it? Or better way to do it?

You could use an inline assembly statement to add a series of nops at the start (or end) of each function. Then later when you need to modify the function, you could remove some of the nops to keep the overall size of the function the same. For example:
int foo(...) {
__asm__ __volatile__("nop; nop; nop; nop;" ::);
...
}
Or you could even reserve large chunks of memory in the function with something like this:
__asm__ __volatile__("ba,a 1f; .skip 1000; 1: ;" ::);
This reserves a big chunk of memory and simply branches around it in the code.

If you are using a sufficiently new compiler, they have recently added a new option: -fprolog-pad=N and -fprolog-pad=M,N which means issue M nops before the function an N nops after it.

Related

How do I call hex data stored in an array with inline assembly?

I have an OS project that I am working on and I am trying to call data that I have read from the disk in C with inline assembly.
I have already tried reading the code and executing it with the assembly call instruction, using inline assembly.
void driveLoop() {
uint16_t sectors = 31;
uint16_t sector = 0;
uint16_t basesector = 40000;
uint32_t i = 40031;
uint16_t code[sectors][256];
int x = 0;
while(x==0) {
read(i);
for (int p=0; p < 256; p++) {
if (readOut[p] == 0) {
} else {
x = 1;
//kprint_int(i);
}
}
i++;
}
kprint("Found sector!\n");
kprint("Loading OS into memory...\n");
for (sector=0; sector<sectors; sector++) {
read(basesector+sector);
for (int p=0; p<256; p++) {
code[sector][p] = readOut[p];
}
}
kprint("Done loading.\n");
kprint("Attempting to call...\n");
asm volatile("call (%0)" : : "r" (&code));
When the inline assembly is called I expect it to run the code from the sectors I read from the "disk" (this is in a VM, because its a hobby OS). What it does instead is it just hangs.
I probably don't much understand how variables, arrays, and assembly work, so if you could fill me in, that would be nice.
EDIT: The data I am reading from the disk is a binary file that was added
to the disk image file with
cat kernel.bin >> disk.img
and the kernel.bin is compiled with
i686-elf-ld -o kernel.bin -Ttext 0x4C4B40 *insert .o files here* --oformat binary
What it does instead is it just hangs.
Run your OS inside BOCHS so you can use BOCHS's built-in debugger to see exactly where it's stuck.
Being able to debug lockups, including with interrupts disabled, is probably very useful...
asm volatile("call (%0)" : : "r" (&code)); is unsafe because of missing clobbers.
But even worse than that it will load a new EIP value from the first 4 bytes of the array, instead of setting EIP to that address. (Unless the data you're loading is an array of pointers, not actual machine code?)
You have the %0 in parentheses, so it's an addressing mode. The assembler will warn you about an indirect call without *, but will assemble it like call *(%eax), with EAX = the address of code[0][0]. You actually want a call *%eax or whatever register the compiler chooses, register-indirect not memory-indirect.
&code and code are both just a pointer to the start of the array; &code doesn't create an anonymous pointer object storing the address of another address. &code takes the address of the array as a whole. code in this context "decays" to a pointer to the first object.
https://gcc.gnu.org/wiki/DontUseInlineAsm (for this).
You can get the compiler to emit a call instruction by casting the pointer to a function pointer.
__builtin___clear_cache(&code[0][0], &code[30][255]); // don't optimize away stores into the buffer
void (*fptr)(void) = (void*)code; // casting to void* instead of the actual target type is simpler
fptr();
That will compile (with optimization enabled) to something like lea 16(%esp), %eax / call *%eax, for 32-bit x86, because your code[][] buffer is an array on the stack.
Or to have it emit a jmp instead, do it at the end of a void function, or return funcptr(); in a non-void function, so the compiler can optimize the call/ret into a jmp tailcall.
If it doesn't return, you can declare it with __attribute__((noreturn)).
Make sure the memory page / segment is executable. (Your uint16_t code[]; is a local, so gcc will allocate it on the stack. This might not be what you want. The size is a compile-time constant so you could make it static, but if you do that for other arrays in other sibling functions (not parent or child), then you lose out on the ability to reuse a big chunk of stack memory for different arrays.)
This is much better than your unsafe inline asm. (You forgot a "memory" clobber, so nothing tells the compiler that your asm actually reads the pointed-to memory). Also, you forgot to declare any register clobbers; presumably the block of code you loaded will have clobbered some registers if it returns, unless it's written to save/restore everything.
In GNU C you do need to use __builtin__clear_cache when casting a data pointer to a function pointer. On x86 it doesn't actually clear any cache, it's telling the compiler that the stores to that memory are not dead because it's going to be read by execution. See How does __builtin___clear_cache work?
Without that, gcc could optimize away the copying into uint16_t code[sectors][256]; because it looks like a dead store. (Just like with your current inline asm which only asks for the pointer in a register.)
As a bonus, this part of your OS becomes portable to other architectures, including ones like ARM without coherent instruction caches where that builtin expands to a actual instructions. (On x86 it purely affects the optimizer).
read(basesector+sector);
It would probably be a good idea for your read function to take a destination pointer to read into, so you don't need to bounce data through your readOut buffer.
Also, I don't see why you'd want to declare your code as a 2D array; sectors are an artifact of how you're doing your disk I/O, not relevant to using the code after it's loaded. The sector-at-a-time thing should only be in the code for the loop that loads the data, not visible in other parts of your program.
char code[sectors * 512]; would be good.

Hot Patching A Function

I'm trying to hot patch an exe in memory, the source is available but I'm doing this for learning purposes. (so please no comments suggesting i modify the original source or use detours or any other libs)
Below are the functions I am having problems with.
vm_t* VM_Create( const char *module, intptr_t (*systemCalls)(intptr_t *), vmInterpret_t interpret )
{
MessageBox(NULL, L"Oh snap! We hooked VM_Create!", L"Success!", MB_OK);
return NULL;
}
void Hook_VM_Create(void)
{
DWORD dwBackup;
VirtualProtect((void*)0x00477C3E, 7, PAGE_EXECUTE_READWRITE, &dwBackup);
//Patch the original VM_Create to jump to our detoured one.
BYTE *jmp = (BYTE*)malloc(5);
uint32_t offset = 0x00477C3E - (uint32_t)&VM_Create; //find the offset of the original function from our own
memset((void*)jmp, 0xE9, 1);
memcpy((void*)(jmp+1), &offset, sizeof(offset));
memcpy((void*)0x00477C3E, jmp, 5);
free(jmp);
}
I have a function VM_Create that I want to be called instead of the original function. I have not yet written a trampoline so it crashes (as expected). However the message box does not popup that I have detoured the original VM create to my own. I believe it is the way I'm overwriting the original instructions.
I can see a few issues.
I assume that 0x00477C3E is the address of the original VM_Create function. You really should not hard code this. Use &VM_Create instead. Of course this will mean that you need to use a different name for your replacement function.
The offset is calculated incorrectly. You have the sign wrong. What's more the offset is applied to the instruction pointer at the end of the instruction and not the beginning. So you need to shift it by 5 (the size of the instruction). The offset should be a signed integer also.
Ideally, if you take into account my first point the code would look like this:
int32_t offset = (int32_t)&New_VM_Create - ((int32_t)&VM_Create+5);
Thanks to Hans Passant for fixing my own silly sign error in the original version!
If you are working on a 64 bit machine you need to do your arithmetic in 64 bits and, once you have calculated the offset, truncate it to a 32 bit offset.
Another nuance is that you should reset the memory to being read-only after having written the new JMP instruction, and call FlushInstructionCache.

C preprocessor variable constant?

I'm writing a program where a constant is needed but the value for the constant will be determined during run time. I have an array of op codes from which I want to randomly select one and _emit it into the program's code. Here is an example:
unsigned char opcodes[] = {
0x60, // pushad
0x61, // popad
0x90 // nop
}
int random_byte = rand() % sizeof(opcodes);
__asm _emit opcodes[random_byte]; // optimal goal, but invalid
However, it seems _emit can only take a constant value. E.g, this is valid:
switch(random_byte) {
case 2:
__asm _emit 0x90
break;
}
But this becomes unwieldy if the opcodes array grows to any considerable length, and also essentially eliminates the worth of the array since it would have to be expressed in a less attractive manner.
Is there any way to neatly code this to facilitate the growth of the opcodes array? I've tried other approaches like:
#define OP_0 0x60
#define OP_1 0x61
#define OP_2 0x90
#define DO_EMIT(n) __asm _emit OP_##n
// ...
unsigned char abyte = opcodes[random_byte];
DO_EMIT(abyte)
In this case, the translation comes out as OP_abyte, so it would need a call like DO_EMIT(2), which forces me back to the switch statement and enumerating every element in the array.
It is also quite possible that I have an entirely invalid approach here. Helpful feedback is appreciated.
I'm not sure what compiler/assembler you are using, but you could do what you're after in GCC using a label. At the asm site, you'd write it as:
asm (
"target_opcode: \n"
".byte 0x90\n" ); /* Placeholder byte */
...and at the place where you want to modify that code, you'd use:
extern volatile unsigned char target_opcode[];
int random_byte = rand() % sizeof(opcodes);
target_opcode[0] = random_byte;
Perhaps you can translate this into your compiler's dialect of asm.
Note that all the usual caveats about self-modifying code apply: the code segment might not be writeable, and you may have to flush the I-cache before executing the modified code.
You won't be able to do any randomness in the C preprocessor AFAIK. The closest you could get is generating the random value outside. For instance:
cpp -DRND_VAL=$RANDOM ...
(possibly with a modulus to maintain the value within a range), at least in UNIX-based systems. Then, you can use the definition value, that will be essentially random.
How about
char operation[4]; // is it really only 1 byte all the time?
operation[0] = random_whatever();
operation[1] = 0xC3; // RET
void (*func)() = &operation[0];
func();
Note that in this example you'd need to add a RET instruction to the buffer, so that in the end you end up at the right instruction after calling func().
Using an _emit at runtime into your program code is kind of like compiling the program you're running while the program is running.
You should describe your end-goal rather than just your idea of using _emit at runtime- there might be abetter way to accomplish what you want. Maybe you can write your opcodes to a regular data array and somehow make that bit of memory executable. That might be a little tricky due to security considerations, but it can be done.

Performance of array of functions over if and switch statements

I am writing a very performance critical part of the code and I had this crazy idea about substituting case statements (or if statements) with array of function pointers.
Let me demonstrate; here goes the normal version:
while(statement)
{
/* 'option' changes on every iteration */
switch(option)
{
case 0: /* simple task */ break;
case 1: /* simple task */ break;
case 2: /* simple task */ break;
case 3: /* simple task */ break;
}
}
And here is the "callback function" version:
void task0(void) {
/* simple task */
}
void task1(void) {
/* simple task */
}
void task2(void) {
/* simple task */
}
void task3(void) {
/* simple task */
}
void (*task[4]) (void);
task[0] = task0;
task[1] = task1;
task[2] = task2;
task[3] = task3;
while(statement)
{
/* 'option' changes on every iteration */
/* and now we call the function with 'case' number */
(*task[option]) ();
}
So which version will be faster? Is the overhead of the function call eliminating speed benefit over normal switch (or if) statement?
Ofcourse the latter version is not so readable but I am looking for all the speed I can get.
I am about to benchmark this when I get things set up but if someone has an answer already, I wont bother.
I think at the end of the day your switch statements will be the fastest, because function pointers have the "overhead" of the lookup of the function and the function call itself. A switch is just a jmp table straight. It of course depends on different things which only testing can give you an answer to. That's my two cent worth.
The switch statement should be compiled into a branch table, which is essentially the same thing as your array of functions, if your compiler has at least basic optimization capability.
Which version will be faster depends. The naive implementation of switch is a huge if ... else if ... else if ... construction meaning it takes on average O(n) time to execute where n is the number of cases. Your jump table is O(1) so the more different cases there are and the more the later cases are used, the more likely the jump table is to be better. For a small number of cases or for switches where the first case is chosen more frequently than others, the naive implementation is better. The matter is complicated by the fact that the compiler may choose to use a jump table even when you have written a switch if it thinks that will be faster.
The only way to know which you should choose is to performance test your code.
First, I would randomly-pause it a few times, to make certain enough time is spent in this dispatching to even bother optimizing it.
Second, if it is, since each branch spends very few cycles, you want a jump table to get to the desired branch. The reason switch statements exist is to suggest to the compiler that it can generate one if the switch values are compact.
How long is the list of switch values? If it's short, the if-ladder could still be faster, especially if you put the most frequently used codes at the top. An alternative to an if-ladder (that I've never actually seen anyone use) is an if-tree, the code equivalent of a binary tree.
You probably don't want an array of function pointers. Yes, it's an array reference to get the function pointer, but there's several instructions' overhead in calling a function, and it sounds like that could overwhelm the small amount being done inside each function.
In any case, looking at the assembly language, or single-stepping at the instruction level, will give you a good idea how efficient it's being.
A good compiler will compile a switch with cases in a small numerical range as a single conditional to see if the value is in that range (which can sometimes be optimized out) followed by a jumptable jump. This will almost surely be faster than a function call (direct or indirect) because:
A jump is a lot less expensive than a call (which must save call-clobbered registers, adjust the stack, etc.).
The code in the switch statement cases can make use of expression values already cached in registers in the caller.
It's possible that an extremely advanced compiler could determine that the call-via-function pointer only refers to one of a small set of static-linkage functions, and thereby optimize things heavily, maybe even eliminating the calls and replacing them by jumps. But I wouldn't count on it.
I arrived at this post recently since I was wondering the same. I ended up taking the time to try it. It certainly depends greatly on what you're doing, but for my VM it was a decent speed up (15-25%), and allowed me to simplify some code (which is probably where a lot of the speedup came from). As an example (code simplified for clarity), a "for" loop was able to be easily implemented using a for loop:
void OpFor( Frame* frame, Instruction* &code )
{
i32 start = GET_OP_A(code);
i32 stop_value = GET_OP_B(code);
i32 step = GET_OP_C(code);
// instruction count (ie. block size)
u32 i_count = GET_OP_D(code);
// pointer to end of block (NOP if it branches)
Instruction* end = code + i_count;
if( step > 0 )
{
for( u32 i = start; i < stop_value; i += step )
{
// rewind instruction pointer
Instruction* cur = code;
// execute code inside for loop
while(cur != end)
{
cur->func(frame, cur);
++cur;
}
}
}
else
// same with <=
}

C - How to create a pattern in code segment to recognize it in memory dump?

I dump my RAM (a piece of it - code segment only) in order to find where is which C function being placed. I have no map file and I don't know what boot/init routines exactly do.
I load my program into RAM, then if I dump the RAM, it is very hard to find exactly where is what function. I'd like to use different patterns build in the C source, to recognize them in the memory dump.
I've tryed to start every function with different first variable containing name of function, like:
char this_function_name[]="main";
but it doesn't work, because this string will be placed in the data segment.
I have simple 16-bit RISC CPU and an experimental proprietary compiler (no GCC or any well-known). The system has 16Mb of RAM, shared with other applications (bootloader, downloader). It is almost impossible to find say a unique sequence of N NOPs or smth. like 0xABCD. I would like to find all functions in RAM, so I need unique identificators of functions visible in RAM-dump.
What would be the best pattern for code segment?
If it were me, I'd use the symbol table, e.g. "nm a.out | grep main". Get the real address of any function you want.
If you really have no symbol table, make your own.
struct tab {
void *addr;
char name[100]; // For ease of searching, use an array.
} symtab[] = {
{ (void*)main, "main" },
{ (void*)otherfunc, "otherfunc" },
};
Search for the name, and the address will immediately preceed it. Goto address. ;-)
If your compiler has inline asm you can use it to create a pattern. Write some NOP instructions which you can easily recognize by opcodes in memory dump:
MOV r0,r0
MOV r0,r0
MOV r0,r0
MOV r0,r0
How about a completely different approach to your real problem, which is finding a particular block of code: Use diff.
Compile the code once with the function in question included, and once with it commented out. Produce RAM dumps of both. Then, diff the two dumps to see what's changed -- and that will be the new code block. (You may have to do some sort of processing of the dumps to remove memory addresses in order to get a clean diff, but the order of instructions ought to be the same in either case.)
Numeric constants are placed in the code segment, encoded in the function's instructions. So you could try to use magic numbers like 0xDEADBEEF and so on.
I.e. here's the disassembly view of a simple C function with Visual C++:
void foo(void)
{
00411380 push ebp
00411381 mov ebp,esp
00411383 sub esp,0CCh
00411389 push ebx
0041138A push esi
0041138B push edi
0041138C lea edi,[ebp-0CCh]
00411392 mov ecx,33h
00411397 mov eax,0CCCCCCCCh
0041139C rep stos dword ptr es:[edi]
unsigned id = 0xDEADBEEF;
0041139E mov dword ptr [id],0DEADBEEFh
You can see the 0xDEADBEEF making it into the function's source. Note that what you actually see in the executable depends on the endianness of the CPU (tx. Richard).
This is a x86 example. But RISC CPUs (MIPS, etc) have instructions moving immediates into registers - these immediates can have special recognizable values as well (although only 16-bit for MIPS, IIRC).
Psihodelia - it's getting harder and harder to catch your intention. Is it just a single function you want to find? Then can't you just place 5 NOPs one after another and look for them? Do you control the compiler/assembler/linker/loader? What tools are at your disposal?
As you noted, this:
char this_function_name[]="main";
... will end up setting a pointer in your stack to a data segment containing the string. However, this:
char this_function_name[]= { 'm', 'a', 'i', 'n' };
... will likely put all these bytes in your stack so you will be able to recognize the string in your code (I just tried it on my platform).
Hope this helps
Why not get each function to dump its own address. Something like this:
void* fnaddr( char* fname, void* addr )
{
printf( "%s\t0x%p\n", fname, addr ) ;
return addr ;
}
void test( void )
{
static void* fnaddr_dummy = fnaddr( __FUNCTION__, test ) ;
}
int main (int argc, const char * argv[])
{
static void* fnaddr_dummy = fnaddr( __FUNCTION__, main ) ;
test() ;
test() ;
}
By making fnaddr_dummy static, the dump is done once per-function. Obviously you would need to adapt fnaddr() to support whatever output or logging means you have on your system. Unfortunately, if the system performs lazy initialisation, you'll only get the addresses of the functions that are actually called (which may be good enough).
You could start each function with a call to the same dummy function like:
void identifyFunction( unsigned int identifier)
{
}
Each of your functions would call the identifyFunction-function with a different parameter (1, 2, 3, ...). This will not give you a magic mapfile, but when you inspect the code dump you should be able to quickly find out where the identifyFunction is because there will be lots of jumps to that address. Next scan for those jump and check before the jump to see what parameter is passed. Then you can make your own mapfile. With some scripting this should be fairly automatic.

Resources