Multithreading with inline assembly and access to a c variable - c

I'm using inline assembly to construct a set of passwords, which I will use to brute force against a given hash. I used this website as a reference for the construction of the passwords.
This is working flawlessly in a singlethreaded environment. It produces an infinite amount of incrementing passwords.
As I have only basic knowledge of asm, I understand the idea. The gcc uses ATT, so I compile with -masm=intel
During the attempt to multithread the program, I realize that this approach might not work.
The following code uses 2 global C variables, and I assume that this might be the problem.
__asm__("pushad\n\t"
"mov edi, offset plaintext\n\t" <---- global variable
"mov ebx, offset charsetTable\n\t" <---- again
"L1: movzx eax, byte ptr [edi]\n\t"
" movzx eax, byte ptr [charsetTable+eax]\n\t"
" cmp al, 0\n\t"
" je L2\n\t"
" mov [edi],al\n\t"
" jmp L3\n\t"
"L2: xlat\n\t"
" mov [edi],al\n\t"
" inc edi\n\t"
" jmp L1\n\t"
"L3: popad\n\t");
It produces a non deterministic result in the plaintext variable.
How can i create a workaround, that every thread accesses his own plaintext variable? (If this is the problem...).
I tried modifying this code, to use extended assembly, but I failed every time. Probably due to the fact that all tutorials use ATT syntax.
I would really appreciate any help, as I'm stuck for several hours now :(
Edit: Running the program with 2 threads, and printing the content of plaintext right after the asm instruction, produces:
b
b
d
d
f
f
...
Edit2:
pthread_create(&thread[i], NULL, crack, (void *) &args[i]))
[...]
void *crack(void *arg) {
struct threadArgs *param = arg;
struct crypt_data crypt; // storage for reentrant version of crypt(3)
char *tmpHash = NULL;
size_t len = strlen(param->methodAndSalt);
size_t cipherlen = strlen(param->cipher);
crypt.initialized = 0;
for(int i = 0; i <= LIMIT; i++) {
// intel syntax
__asm__ ("pushad\n\t"
//mov edi, offset %0\n\t"
"mov edi, offset plaintext\n\t"
"mov ebx, offset charsetTable\n\t"
"L1: movzx eax, byte ptr [edi]\n\t"
" movzx eax, byte ptr [charsetTable+eax]\n\t"
" cmp al, 0\n\t"
" je L2\n\t"
" mov [edi],al\n\t"
" jmp L3\n\t"
"L2: xlat\n\t"
" mov [edi],al\n\t"
" inc edi\n\t"
" jmp L1\n\t"
"L3: popad\n\t");
tmpHash = crypt_r(plaintext, param->methodAndSalt, &crypt);
if(0 == memcmp(tmpHash+len, param->cipher, cipherlen)) {
printf("success: %s\n", plaintext);
break;
}
}
return 0;
}

Since you're already using pthreads, another option is making the variables that are modified by several threads into per-thread variables (threadspecific data). See pthread_getspecific OpenGroup manpage. The way this works is like:
In the main thread (before you create other threads), do:
static pthread_key_y tsd_key;
(void)pthread_key_create(&tsd_key); /* unlikely to fail; handle if you want */
and then within each thread, where you use the plaintext / charsetTable variables (or more such), do:
struct { char *plainText, char *charsetTable } *str =
pthread_getspecific(tsd_key);
if (str == NULL) {
str = malloc(2 * sizeof(char *));
str.plainText = malloc(size_of_plaintext);
str.charsetTable = malloc(size_of_charsetTable);
initialize(str.plainText); /* put the data for this thread in */
initialize(str.charsetTable); /* ditto */
pthread_setspecific(tsd_key, str);
}
char *plaintext = str.plainText;
char *charsetTable = str.charsetTable;
Or create / use several keys, one per such variable; in that case, you don't get the str container / double indirection / additional malloc.
Intel assembly syntax with gcc inline asm is, hm, not great; in particular, specifying input/output operands is not easy. I think to get that to use the pthread_getspecific mechanism, you'd change your code to do:
__asm__("pushad\n\t"
"push tsd_key\n\t" <---- threadspecific data key (arg to call)
"call pthread_getspecific\n\t" <---- gets "str" as per above
"add esp, 4\n\t" <---- get rid of the func argument
"mov edi, [eax]\n\t" <---- first ptr == "plainText"
"mov ebx, [eax + 4]\n\t" <---- 2nd ptr == "charsetTable"
...
That way, it becomes lock-free, at the expense of using more memory (one plaintext / charsetTable per thread), and the expense of an additional function call (to pthread_getspecific()). Also, if you do the above, make sure you free() each thread's specific data via pthread_atexit(), or else you'll leak.
If your function is fast to execute, then a lock is a much simpler solution because you don't need all the setup / cleanup overhead of threadspecific data; if the function is either slow or very frequently called, the lock would become a bottleneck though - in that case the memory / access overhead for TSD is justified. Your mileage may vary.

Protect this function with mutex outside of inline Assembly block.

Related

How are oversized struct returned on the stack?

It is said that returning an oversized struct by value (as opposed to returning a pointer to the struct) from a function incurs unnecessary copy on the stack. By "oversized", I mean a struct that cannot fit in the return registers.
However, to quote Wikipedia
When an oversized struct return is needed, another pointer to a caller-provided space is prepended as the first argument, shifting all other arguments to the right by one place.
and
When returning struct/class, the calling code allocates space and passes a pointer to this space via a hidden parameter on the stack. The called function writes the return value to this address.
It appears that at least on x86 architectures, the struct in question is directly written by the callee to the memory appointed by the caller, so why would there be a copy then? Does returning oversized structs really incur copy on the stack?
If the function inlines, the copying through the return-value object can be fully optimized away. Otherwise, maybe not, and arg copying definitely can't be.
It appears that at least on x86 architectures, the struct in question is directly written by the callee to the memory appointed by the caller, so why would there be a copy then? Does returning oversized structs really incur copy on the stack?
It depends what the caller does with the return value,; if it's assigned to a provably private object (escape analysis), that object can be the return-value object, passed as the hidden pointer.
But if the caller actually wants to assign the return value to other memory, then it does need a temporary.
struct large retval = some_func(); // no extra copying at all
*p = some_func() // caller will make space for a local return-value object & copy.
(Unless the compiler knows that p is just pointing to a local struct large tmp;, and escape analysis can prove that there's no way some global variable could have a pointer to that same tmp var.)
long version, same thing with more details:
In the C abstract machine, there's a "return value object", and return foo copies the named variable foo to that object, even if it's a large struct. Or return (struct lg){1,2}; copies an anonymous struct. The return-value object itself is anonymous; nothing can take its address. (You can't int *p = &foo(123);). This makes it easier to optimize away.
In the caller, that anonymous return-value object can be assigned to whatever you want, which would be another copy if compilers didn't optimize anything. (All of this applies for any type, even int). Of course, compilers that aren't total garbage will avoid some, ideally all, of that copying, when doing so can't possibly change the observable results. And that depends on the design of the calling convention. As you say, most conventions, including all the mainstream x86 and x86-64 conventions, pass a "hidden pointer" arg for return values they choose not to return in register(s) for whatever reason (size, C++ having a non-trivial constructor).
struct large retval = foo(...);
For such calling conventions, the above code is effectively transformed to
struct large retval;
foo(&retval, ...);
So it's C return-value object actually is a local in the stack-frame of its caller. foo() is allowed to store into that return-value object whenever it wants during execution, including before reading some other objects. This allows optimization within the callee (foo) as well, so a struct large tmp = ... / return tmp can be optimized away to just store into the return-value object.
So there's zero extra copying when the caller does just want to assign the function return value to a newly declared local var. (Or to a local var which it can prove is still private, via escape analysis. i.e. not pointed-to by any global vars).
But what if the caller wants to store the return value somewhere else?
void caller2(struct large *lgp) {
*lgp = foo();
}
Can *lgp be the return-value object, or do we need to introduce a local temporary?
void caller2(struct large *lgp) {
// foo_asm(lgp); // nope, possibly unsafe
struct large retval; foo(&retval); *lgp = retval; // safe
}
If you want functions to be able to write large structs to arbitrary locations, you have to "sign off" on it by making that effect visible in your source.
What prevents the usage of a function argument as hidden pointer? for more details about why *lgp can't be the return-value object / hidden pointer, and another example. "A function is allowed to assume its return-value object (pointed-to by a hidden pointer) is not the same object as anything else". Also details of whether struct large *restrict lgp would make it safe: probably yes if the function doesn't longjmp (otherwise stores to the supposedly anonymous retval object might end up as visible side effects without return having been reached), but GCC doesn't look for that optimization.
Why is tailcall optimization not performed for types of class MEMORY? - return bar() where bar returns the same struct should be possible as an optimized tailcall, causing extra copying. This can even introduce extra copying of the whole struct, as well as failing to optimize call bar / ret into jmp bar.
how c compiler treats a struct return value from a function, in ASM - thresholds for returning in registers. e.g. i386 System V always returns structs in memory, even struct {int x;};.
Is it possible within a function to get the memory address of the variable initialized by the return value?
C/C++ returning struct by value under the hood an actual example (but unfortunately using debug-mode compiler-generated asm, so it contains copying that isn't necessary).
How do objects work in x86 at the assembly level? example at the bottom of how x86-64 System V packs the bytes of a struct into RDX:RAX, or just RAX if less than 8 bytes.
An example showing early stores to the return-value object (instead of copying)
(all source + asm on the Godbolt compiler explorer)
// more or less extra size will get compilers to copy it around with SSE2 or not
struct large { int first, second; char pad[0];};
int *global_ptr;
extern int a;
NOINLINE // __attribute__((noinline))
struct large foo() {
struct large tmp = {1,2};
if (a)
tmp.second = *global_ptr;
return tmp;
}
(targeting GNU/Linux) clang -m32 -O3 -mregparm=1 creates an implementation that writes its return-value object before it's done reading everything else, exactly the case that would make it unsafe for the caller to pass a pointer to some globally-reachable memory.
The asm makes it clear that tmp is fully optimized away, or is the retval object.
# clang -O3 -m32 -mregparm=1
foo:
mov dword ptr [eax + 4], 2
mov dword ptr [eax], 1 # store tmp into the retval object
cmp dword ptr [a], 0
je .LBB0_2 # if (a == 0) goto ret
mov ecx, dword ptr [global_ptr] # load the global
mov ecx, dword ptr [ecx] # deref it
mov dword ptr [eax + 4], ecx # and store to the retval object
.LBB0_2:
ret
(-mregparm=1 means pass the first arg in EAX, less noisy and easier to quickly visually distinguish from stack space than passing on the stack. Fun fact: i386 Linux compiles the kernel with -mregparm=3. But fun fact #2: if a hidden pointer is passed on the stack (i.e. no regparm), that arg is callee pops, unlike the rest. The function will use ret 4 to do ESP+=4 after popping the return address into EIP.)
In a simple caller, the compiler just reserves some stack space, passes a pointer to it, and then can load member variables from that space.
int caller() {
struct large lg = {4, 5}; // initializer is dead, foo can't read its retval object
lg = foo();
return lg.second;
}
caller:
sub esp, 12
mov eax, esp
call foo
mov eax, dword ptr [esp + 4]
add esp, 12
ret
But with a less trivial caller:
int caller() {
struct large lg = {4, 5};
global_ptr = &lg.first;
// unknown(&lg); // or this: as a side effect, might set global_ptr = &tmp->first;
lg = foo(); // (except by inlining) the compiler can't know if foo() looks at global_ptr
return lg.second;
}
caller:
sub esp, 28 # reserve space for 2 structs, and alignment
mov dword ptr [esp + 12], 5
mov dword ptr [esp + 8], 4 # materialize lg
lea eax, [esp + 8]
mov dword ptr [global_ptr], eax # point global_ptr at it
lea eax, [esp + 16] # hidden first arg *not* pointing to lg
call foo
mov eax, dword ptr [esp + 20] # reload from the retval object
add esp, 28
ret
Extra copying with *lgp = foo();
int caller2(struct large *lgp) {
global_ptr = &lgp->first;
*lgp = foo();
return lgp->second;
}
# with GCC11.1 this time, SSE2 8-byte copying unlike clang
caller2: # incoming arg: struct large *lgp in EAX
push ebx #
mov ebx, eax # lgp, tmp89 # lgp needed after foo returns
sub esp, 24 # reserve space for a retval object (and waste 16 bytes)
mov DWORD PTR global_ptr, eax # global_ptr, lgp
lea eax, [esp+8] # hidden pointer to the retval object
call foo #
movq xmm0, QWORD PTR [esp+8] # 8-byte copy of both halves
movq QWORD PTR [ebx], xmm0 # *lgp_2(D), tmp86
mov eax, DWORD PTR [ebx+4] # lgp_2(D)->second, lgp_2(D)->second # reload int return value
add esp, 24
pop ebx
ret
The copy to *lgp needs to happen, but it's somewhat of a missed optimization to reload from there, instead of from [esp+12]. (Saves a byte of code size at the cost of more latency.)
Clang does the copy with two 4-byte integer register mov loads/stores, but one of them is into EAX so it already has the return value ready.
You might also want to look at the result of assigning to memory freshly allocated with malloc. Compilers know that nothing else can (legally) be pointing to the newly allocated memory: that would be use-after-free undefined behaviour. So they may allow passing on a pointer from malloc as the return-value object if it hasn't been passed to anything else yet.
Related fun fact: passing large structs by value always requires a copy (if the function doesn't inline). But as discussed in comments, the details depend on the calling convention. Windows differs from i386 / x86-64 System V calling conventions (all non-Windows OSes) on this:
SysV calling conventions copy the whole struct to the stack. (if they're too large to fit in a pair of registers for x86-64)
Windows x64 makes a copy and passes (like a normal arg) a pointer to that copy. The callee "owns" the arg and can modify it, so a tmp copy is still needed. (And no, const struct large foo has no effect.)
https://godbolt.org/z/ThMrE9rqT shows x86-64 GCC targeting Linux vs. x64 MSVC targeting Windows.
This really depends on your compiler, but in general the way this works is that the caller allocates the memory for the struct return value, but the callee also allocates stack space for any intermediate value of that structure. This intermediate allocation is used when the function is running, and then the struct is copied onto the caller's memory when the function returns.
For reference as to why your solution won't always work, consider a program which has two of the same struct and returns one based on some condition:
large_t returntype(int condition) {
large_t var1 = {5};
large_t var2 = {6};
// More intermediate code here
if(condition) return var1;
else return var2;
}
In this case, both may be required by the intermediate code, but the return value is not known at compile time, so the compiler doesn't know which to initialize on the caller's stack space. It's easier to just keep it local and copy on return.
EDIT: Your solution may be the case in simple functions, but it really depends on the optimizations performed by each individual compiler. If you're really interested in this, check out https://godbolt.org/

C99: compiler optimizations when accessing global variables and aliased memory pointers

I'm writing C code for an embedded system. In this system, there are memory mapped registers at some fixed address in the memory map, and of course some RAM where my data segment / heap is.
I'm finding problems generating optimal code when my code is intermixing accesses to global variables in the data segment and accesses to hardware registers. This is a simplified snippet:
#include <stdint.h>
uint32_t * const restrict HWREGS = 0x20000;
struct {
uint32_t a, b;
} Context;
void example(void) {
Context.a = 123;
HWREGS[0x1234] = 5;
Context.b = Context.a;
}
This is the code generated on x86 (see also on godbolt):
example:
mov DWORD PTR Context[rip], 123
mov DWORD PTR ds:149712, 5
mov eax, DWORD PTR Context[rip]
mov DWORD PTR Context[rip+4], eax
ret
As you can see, after having written the hardware register, Context.a is reloaded from RAM before being stored into Context.b. This doesn't make sense because Context is at a different memory address than HWREGS. In other words, the memory pointed by HWREGS and the memory pointed by &Context do not alias, but it looks like there is not way to tell that to the compiler.
If I change HWREGS definition as this:
extern uint32_t * const restrict HWREGS;
that is, I hide the fixed memory address to the compiler, I get this:
example:
mov rax, QWORD PTR HWREGS[rip]
mov DWORD PTR [rax+18640], 5
movabs rax, 528280977531
mov QWORD PTR Context[rip], rax
ret
Context:
.zero 8
Now the two writes to Context are optimized (even coalesced to a single write), but on the other hand the access to the hardware register does not happen anymore with a direct memory access but it goes through a pointer indirection.
Is there a way to obtain optimal code here? I would like GCC to know that HWREGS is at a fixed memory address and at the same time to tell it that it does not alias Context.
If you want to avoid compilers reloading regularly values from a memory region (possibly due to aliasing), then the best is not to use global variables, or at least not to use direct accesses to global variables. The register keyword seems ignored for global variables (especially here on HWREGS) for both GCC and Clang. Using the restrict keyword on function parameters solves this problem:
#include <stdint.h>
uint32_t * const HWREGS = 0x20000;
struct Context {
uint32_t a, b;
} context;
static inline void exampleWithLocals(uint32_t* restrict localRegs, struct Context* restrict localContext) {
localContext->a = 123;
localRegs[0x1234] = 5;
localContext->b = localContext->a;
}
void example() {
exampleWithLocals(HWREGS, &context);
}
Here is the result (see also on godbolt):
example:
movabs rax, 528280977531
mov DWORD PTR ds:149712, 5
mov QWORD PTR context[rip], rax
ret
context:
.zero 8
Please note that the strict aliasing rule do not help in this case since the type of read/written variables/fields is always uint32_t.
Besides this, based on its name, the variable HWREGS looks like a hardware register. Please note that it should be put volatile so that compiler do not keep it to registers nor perform any similar optimization (like assuming the pointed value is left unchanged if the code do not change it).

Why does this sentence eliminate the "unreferenced function argument warning"?

in SDCC Compiler User Guide I read the following:
void to_buffer( unsigned char c )
{
c; // to avoid warning: unreferenced function argument
__asm
; save used registers here.
; If we were still using r2,r3 we would have to push them here.
; if( head != (unsigned char)(tail-1) )
mov a,_tail
dec a
xrl a,_head
; we could do an ANL a,#0x0f here to use a smaller buffer (see below)
jz t_b_end$
;
; buf[ head++ ] = c;
mov a,dpl ; dpl holds lower byte of function argument
mov dpl,_head ; buf is 0x100 byte aligned so head can be used directly
mov dph,#(_buf>>8)
movx #dptr,a
inc _head
; we could do an ANL _head,#0x0f here to use a smaller buffer (see above)
t_b_end$:
; restore used registers here
__endasm;
}
I don't understand what does the sentence "c; // to avoid warning: unreferenced function argument" mean, is it a special usage of SDCC? Or a special usage of C language?
Compilers tend to warn you if you have an incoming function parameter that you don't use. In your case, c; is a void operation to access the variable and to avoid the warning. It is similar to
int func(char c)
{
(void)c;
//c is never used in the function
}
FWIW, in gcc, to enable the warning, -Wunused-parameter option is used. (enabled in -Wextra)

Which loop has better performance? Increment or decrement? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Is it faster to count down than it is to count up?
which loop has better performance? I have learnt from some where that second is better. But want to know reason why.
for(int i=0;i<=10;i++)
{
/*This is better ?*/
}
for(int i=10;i>=0;i--)
{
/*This is better ?*/
}
The second "may" be better, because it's easier to compare i with 0 than to compare i with 10 but I think you can use any one of these, because compiler will optimize them.
I do not think there is much difference between the performance of both loops.
I suppose, it becomes a different situation when the loops look like this.
for(int i = 0; i < getMaximum(); i++)
{
}
for(int i = getMaximum() - 1; i >= 0; i--)
{
}
As the getMaximum() function is called once or multiple times (assuming it is not an inline function)
Decrement loops down to zero can sometimes be faster if testing against zero is optimised in hardware. But it's a micro-optimisation, and you should profile to see whether it's really worth doing. The compiler will often make the optimisation for you, and given that the decrement loop is arguably a worse expression of intent, you're often better off just sticking with the 'normal' approach.
Incrementing and decrementing (INC and DEC, when translated into assembler commands) have the same speed of 1 CPU cycle.
However, the second can be theoretically faster on some (e.g. SPARC) architectures because no 10 has to be fetched from memory (or cache): most architectures have instructions that deal in an optimized fashion when compating with the special value 0 (usually having a special hardwired 0-register to use as operand, so no register has to be "wasted" to store the 10 for each iteration's comparison).
A smart compiler (especially if target instruction set is RISC) will itself detect this and (if your counter variable is not used in the loop) apply the second "decrement downto 0" form.
Please see answers https://stackoverflow.com/a/2823164/1018783 and https://stackoverflow.com/a/2823095/1018783 for further details.
The compiler should optimize both code to the same assembly, so it doesn't make a difference. Both take the same time.
A more valid discussion would be whether
for(int i=0;i<10;++i) //preincrement
{
}
would be faster than
for(int i=0;i<10;i++) //postincrement
{
}
Because, theoretically, post-increment does an extra operation (returns a reference to the old value). However, even this should be optimized to the same assembly.
Without optimizations, the code would look like this:
for ( int i = 0; i < 10 ; i++ )
0041165E mov dword ptr [i],0
00411665 jmp wmain+30h (411670h)
00411667 mov eax,dword ptr [i]
0041166A add eax,1
0041166D mov dword ptr [i],eax
00411670 cmp dword ptr [i],0Ah
00411674 jge wmain+68h (4116A8h)
for ( int i = 0; i < 10 ; ++i )
004116A8 mov dword ptr [i],0
004116AF jmp wmain+7Ah (4116BAh)
004116B1 mov eax,dword ptr [i]
004116B4 add eax,1
004116B7 mov dword ptr [i],eax
004116BA cmp dword ptr [i],0Ah
004116BE jge wmain+0B2h (4116F2h)
for ( int i = 9; i >= 0 ; i-- )
004116F2 mov dword ptr [i],9
004116F9 jmp wmain+0C4h (411704h)
004116FB mov eax,dword ptr [i]
004116FE sub eax,1
00411701 mov dword ptr [i],eax
00411704 cmp dword ptr [i],0
00411708 jl wmain+0FCh (41173Ch)
so even in this case, the speed is the same.
Again, the answer to all micro-performance questions is measure, measure in context of use and don't extrapolate to other contexts.
Counting instruction execution time hasn't been possible without extraordinary sophistication for quite a long time.
The mismatch between processors and memory speed and the introduction of cache to hide part of the latency (but not the bandwidth) makes the execution of a group of instructions very sensitive to memory access pattern. That is something you still can optimize for with a quite high level thinking. But it also means that something apparently worse if one doesn't take the memory access pattern into account can be better once that is done.
Then superscalar (the fact that the processor can do several things at once) and out of order execution (the fact that processor can execute an instruction before a previous one in the flow) makes basic counting meaningless even if you ignore memory access. You have to know which instructions need to be executed (so ignoring part of the structure isn't wise) and how the processor can group instructions if you want to get good a priori estimate.

C - Inline asm patching at runtime

I am writing a program in C and i use inline asm. In the inline assembler code is have some addresses where i want to patch them at runtime.
A quick sample of the code is this:
void __declspec(naked) inline(void)
{
mov eax, 0xAABBCCDD
call 0xAABBCCDD
}
An say i want to modify the 0xAABBCCDD value from the main C program.
What i tried to do is to Call VirtualProtect an is the pointer of the function in order to make it Writeable, and then call memcpy to add the appropriate values to the code.
DWORD old;
VirtualProtect(inline, len, PAGE_EXECUTE_READWRITE, &old);
However VirtualProtect fails and GetLastError() returns 487 which means accessing invalid address. Anyone have a clue about this problem??
Thanks
Doesn't this work?
int X = 0xAABBCCDD;
void __declspec(naked) inline(void)
{
mov eax, [X]
call [X]
}
How to do it to another process at runtime,
Create a variable that holds the program base address
Get the target RVA (Relative Virtual Address)
Then calculate the real address like this PA=RVA + BASE
then call it from your inline assembly
You can get the base address like this
DWORD dwGetModuleBaseAddress(DWORD dwProcessID)
{
TCHAR zFileName[MAX_PATH];
ZeroMemory(zFileName, MAX_PATH);
HANDLE hProcess = OpenProcess(PROCESS_ALL_ACCESS, true, dwProcessID);
HANDLE hSnapshot = CreateToolhelp32Snapshot(TH32CS_SNAPMODULE, dwProcessID);
DWORD dwModuleBaseAddress = 0;
if (hSnapshot != INVALID_HANDLE_VALUE)
{
MODULEENTRY32 ModuleEntry32 = { 0 };
ModuleEntry32.dwSize = sizeof(MODULEENTRY32);
if (Module32First(hSnapshot, &ModuleEntry32))
{
do
{
if (wcscmp(ModuleEntry32.szModule, L"example.exe") == 0)
{
dwModuleBaseAddress = (DWORD_PTR)ModuleEntry32.modBaseAddr;
break;
}
} while (Module32Next(hSnapshot, &ModuleEntry32));
}
CloseHandle(hSnapshot);
CloseHandle(hProcess);
}
return dwModuleBaseAddress;
}
Assuming you have a local variable and your base address
mov dword ptr ss : [ebp - 0x14] , eax;
mov eax, dword ptr BaseAddress;
add eax, PA;
call eax;
mov eax, dword ptr ss : [ebp - 0x14] ;
You have to restore the value of your Register after the call returns, since this value may be used somewhere down the code execution, assuming you're trying to patch an existing application that may depend on the eax register after your call. Although this method has it disadvantages, but at least it will give anyone idea on what to do.

Resources