How do i properly implement threads in Windows Kernel Driver? - c

I am trying to learn how to code windows kernel drivers.
In my driver i have 2 threads which are created at some point with PsCreateSystemThread
I have a global variable called Kill which signals the threads to terminate like this.
VOID AThread(IN PVOID Context)
{
for (;;)
{
if(Kill == True)
break;
KmWriteProcessMemory(rProcess, &NewValue, dwAAddr, sizeof(NewValue));
}
PsTerminateSystemThread(STATUS_SUCCESS);
}
In my unload function i am setting Kill = TRUE
VOID f_DriverUnload(PDRIVER_OBJECT pDriverObject)
{
Kill = TRUE;
IoDeleteSymbolicLink(&SymLinkName);
IoDeleteDevice(pDeviceObject);
DbgPrint("Driver Unloaded successfully..\r\n");
}
Most of the time there's no problem, but sometimes the machine will crash when i try to unload the driver. It happens more frequently when i have some kind of sleep function being used in the threads, so i'm assuming it's crashing because the threads have not yet terminated before the driver tries to unload.
I'm not too sure how to use synchronisation and such, and there's not a lot of clear information out there that i can find. So how do i properly implement threads and ensure they're terminated before the driver is unloaded?

Once the thread is created, you have HANDLE threadHandle result. Then you need to convert this handle to PETHREAD ThreadObject; :
ObReferenceObjectByHandle(threadHandle,
THREAD_ALL_ACCESS,
NULL,
KernelMode,
&ThreadObject,
NULL );
and close threadHandle:
ZwClose(threadHandle);
When you want to stop the thread, set the flag and wait for thread completion:
Kill = TRUE;
KeWaitForSingleObject(ThreadObject,
Executive,
KernelMode,
FALSE,
NULL );
ObDereferenceObject(ThreadObject);
Then f_DriverUnload function may exit.
You can see all this stuff here: https://github.com/Microsoft/Windows-driver-samples/tree/master/general/cancel/sys
See cancel.h and cancel.c files. Additionally, this code uses semaphore instead of global flag to stop the thread.

when you create thread which used your driver, the driver of course must not be unloaded, until thread not exit. for do this need call ObfReferenceObject for your driver object, before create thread. if create thread fail - call ObfDereferenceObject. and when thread exit - need call ObfDereferenceObject. but here is problem - how / from where call this ? call ObfDereferenceObject from the end of thread routine no sense - the driver can be unloaded inside ObfDereferenceObject and we return from call to not existing memory place. ideally will be if external code (windows itself) call this, just after thread return.
look for IoAllocateWorkItem for good example. work item - like thread, and driver must not be unloaded, until WorkerRoutine not return. and here system care about this - for this we pass DeviceObject to IoAllocateWorkItem: Pointer to the caller's driver object or to one of the caller's device objects. - the system reference this object (device or driver) when we call IoQueueWorkItem and this is guarantee that driver will be not unloaded during WorkerRoutine execution. when it return - windows call ObfDereferenceObject for passed device or driver object. and here all ok, because we return to system kernel code (not to driver) after this. but unfortunately PsCreateSystemThread not take pointer to driver object and not implement such functional.
another good example FreeLibraryAndExitThread - the driver is kernel mode dll by fact, which can be loaded and unloaded. and FreeLibraryAndExitThread exactly implement functional which we need, but for user mode dlls only. again no such api in kernel mode.
but anyway solution is possible. possible yourself jump (not call) to ObfDereferenceObject at the end of thread execution, but for this need use assembler code. not possible do this trick in c/c++.
first of all let declare pointer to driver object in global variable - we initialize it to valid value in driver entry point.
extern "C" PVOID g_DriverObject;
than some macros for get mangled c++ names, this need for use it in asm file:
#if 0
#define __ASM_FUNCTION __pragma(message(__FUNCDNAME__" proc\r\n" __FUNCDNAME__ " endp"))
#define _ASM_FUNCTION {__ASM_FUNCTION;}
#define ASM_FUNCTION {__ASM_FUNCTION;return 0;}
#define CPP_FUNCTION __pragma(message("extern " __FUNCDNAME__ " : PROC ; " __FUNCSIG__))
#else
#define _ASM_FUNCTION
#define ASM_FUNCTION
#define CPP_FUNCTION
#endif
in c++ we declare 2 functions for thread:
VOID _AThread(IN PVOID Context)_ASM_FUNCTION;
VOID __fastcall AThread(IN PVOID Context)
{
CPP_FUNCTION;
// some code here
// but not call PsTerminateSystemThread !!
}
(don't forget __fastcall on AThread - for x86 this need)
now we create thread with next code:
ObfReferenceObject(g_DriverObject);
HANDLE hThread;
if (0 > PsCreateSystemThread(&hThread, 0, 0, 0, 0, _AThread, ctx))
{
ObfDereferenceObject(g_DriverObject);
}
else
{
NtClose(hThread);
}
so you set thread entry point to _AThread which will be implemented in asm file. at begin you call ObfReferenceObject(g_DriverObject);. the _AThread will call you actual thread implementation AThread in c++. finally it return back to _AThread (because this you must not call PsTerminateSystemThread. anyway call this api is optional at all - when thread routine return control to system - this will be auto called). and _AThread at the end de-reference g_DriverObject and return to system.
so main trick in asm files. here 2 asm for x86 and x64:
x86:
.686p
extern _g_DriverObject:DWORD
extern __imp_#ObfDereferenceObject#4:DWORD
extern ?AThread##YIXPAX#Z : PROC ; void __fastcall AThread(void *)
_TEXT segment
?_AThread##YGXPAX#Z proc
pop ecx
xchg ecx,[esp]
call ?AThread##YIXPAX#Z
mov ecx,_g_DriverObject
jmp __imp_#ObfDereferenceObject#4
?_AThread##YGXPAX#Z endp
_TEXT ends
END
x64:
extern g_DriverObject:QWORD
extern __imp_ObfDereferenceObject:QWORD
extern ?AThread##YAXPEAX#Z : PROC ; void __cdecl AThread(void *)
_TEXT segment 'CODE'
?_AThread##YAXPEAX#Z proc
sub rsp,28h
call ?AThread##YAXPEAX#Z
add rsp,28h
mov rcx,g_DriverObject
jmp __imp_ObfDereferenceObject
?_AThread##YAXPEAX#Z endp
_TEXT ENDS
END

Related

Lazy-init an array with multi-threaded readers: is it safe without barriers or atomics?

I've been having an implementation discussion where the idea that a CPU can choose to completely reorder the storing of memory has come up.
I was initializing a static array in C using code similar to:
static int array[10];
static int array_initialized = 0;
void initialize () {
array[0] = 1;
array[1] = 2;
...
array_initialized = -1;
}
and it is used later similar to:
int get_index(int index) {
if (!array_initialized) initialize();
if (index < 0 || index > 9) return -1;
return array[index];
}
is it possible for the CPU to reorder memory access in a multi-core intel architecture (or other architecture) such that it sets array_initialized before the initialize function has finished setting the array elements? or so that another execution thread can see array_initialized as non-zero before the entire array has been initialized in its view of the memory?
TL:DR: to make lazy-init safe if you don't do it before starting multiple threads, you need an _Atomic flag.
is it possible for the CPU to reorder memory access in a multi-core Intel (x86) architecture
No, such reordering is possible at compile time only. x86 asm effectively has acquire/release semantics for normal loads/stores. (seq_cst + a store buffer with store forwarding).
https://preshing.com/20120625/memory-ordering-at-compile-time/
(or other architecture)
Yes, most other ISAs have a weaker asm memory model that does allow StoreStore reordering and LoadLoad reordering. (Effectively memory_order_relaxed, or sort of like memory_order_consume on ISAs other than Alpha AXP, but compilers don't try to maintain data dependencies.)
None of this really matters from C because the C memory model is very weak, allowing compile-time reordering and simultaneous read/write or write+write of any object is data-race UB.
Data Race UB is what lets a compiler keep static variables in registers for the life of a function / inside a loop when compiling for "normal" ISAs.
Having 2 threads run this function is C data-race UB if array_initialized isn't already set before either of them run. (e.g. by having the main thread run it once before starting any more threads). And remove the array_initialized flag entirely, unless you have a use for the lazy-init feature before starting any more threads.
It's 100% safe for a single thread, regardless of how many other threads are running: the C programming model guarantees that a single thread always sees its own operations in program order. (Just like asm for all normal ISAs; other than explicit parallelism in ISAs like Itanium, you always see your own operations in order. It's only other threads seeing your operations where things get weird).
Starting a new thread is (I think) always a "full barrier", or in C terms "synchronizes with" the new thread. Stuff in the new thread can't happen before anything in the parent thread. So just calling get_index once from the main thread makes it safe with no further barriers for other threads to run get_index after that.
You could make lazy init thread-safe with an _Atomic flag
This is similar to what gcc does for function-local static variables with non-constant initializers. Check out the code-gen for that if you're curious: a read-only check of an already-init flag and then a call to an init function that makes sure only one thread runs the initializer.
This requires an acquire load in the fast-path for the already-initialized state. That's free on x86 and SPARC-TSO (same asm as a normal load), but not on weaker ISAs. AArch64 has an acquire load instruction, other ISAs need some barrier instructions.
Turn your array_initialized flag into a 3-state _Atomic variable:
init not started (e.g. init == 0). Check for this with an acquire load.
init started but not finished (e.g. init == -1)
init finished (e.g. init == 1)
You can leave static int array[10]; itself non-atomic by making sure exactly 1 thread "claims" responsibility for doing the init, using atomic_compare_exchange_strong (which will succeed for exactly one thread). And then have other threads spin-wait for the INIT_FINISHED state.
Using initial state == 0 lets it be in the BSS, hopefully next to the data. Otherwise we might prefer INIT_FINISHED=0 for ISAs where branching on an int from memory being (non)zero is slightly more efficient than other numbers. (e.g. AArch64 cbnz, MIPS bne $reg, $zero).
We could get the best of both worlds (cheapest possible fast-path for the already-init case) while still having the flag in the BSS: Have the main thread write it with INIT_NOTSTARTED = -1 before starting any more threads.
Having the flag next to the array is helpful for a small array where the flag is probably in the same cache line as the data we want to index. Or at least the same 4k page.
#include <stdatomic.h>
#include <stdbool.h>
#ifdef __x86_64__
#include <immintrin.h>
#define SPINLOOP_BODY _mm_pause()
#else
#define SPINLOOP_BODY /**/
#endif
#ifdef __GNUC__
#define unlikely(expr) __builtin_expect(!!(expr), 0)
#define likely(expr) __builtin_expect(!!(expr), 1)
#define NOINLINE __attribute__((noinline))
#else
#define unlikely(expr) (expr)
#define likely(expr) (expr)
#define NOINLINE /**/
#endif
enum init_states {
INIT_NOTSTARTED = 0,
INIT_STARTED = -1,
INIT_FINISHED = 1 // optional: make this 0 to speed up the fast-path on some ISAs, and store an INIT_NOTSTARTED before the first call
};
static int array[10];
static _Atomic int array_initialized = INIT_NOTSTARTED;
// called either before or during init.
// One thread claims responsibility for doing the init, others spin-wait
NOINLINE // this is rare, make sure it doesn't bloat the fast-path
void initialize(void) {
bool winner = false;
// check read-only if another thread has already claimed init
if (array_initialized == INIT_NOTSTARTED) {
int expected = INIT_NOTSTARTED;
winner = atomic_compare_exchange_strong(&array_initialized, &expected, INIT_STARTED);
// seq_cst memory order is fine. Weaker might be ok but it only has to run once
}
if (winner) {
array[0] = 1;
// ...
atomic_store_explicit(&array_initialized, INIT_FINISHED, memory_order_release);
} else {
// spin-wait for the winner in other threads
// yield(); optional.
// Or use some kind of mutex or condition var if init is really slow
// otherwise just spin on a seq_cst load. (Or acquire is fine.)
while(array_initialized != INIT_FINISHED)
SPINLOOP_BODY; // x86 only
// winner's release store syncs with our load:
// array[] stores Happened Before this point so we can read it without UB
}
}
int get_index(int index) {
// atomic acquire load is fine, doesn't need seq_cst. Cheaper than seq_cst on PowerPC
if (unlikely(atomic_load_explicit(&array_initialized, memory_order_acquire) != INIT_FINISHED))
initialize();
if (unlikely(index < 0 || index > 9)) return -1;
return array[index];
}
This does compile to correct-looking and efficient asm on Godbolt. Without unlikely() macros, gcc/clang think that at least the stand-alone version of get_index has initialize() and/or return -1 as the most likely fast-path.
And compilers wanted to inline the init function, which would be silly because it only runs once per thread at most. Hopefully profile-guided optimization would correct that.

Hooking a function I don't know the parameters to

Lets say there is a DLL A.DLL with a known entry point DoStuff that I have in some way hooked out with my own DLL fakeA.dll such that the system is calling my DoStuff instead. How do I write such a function such that it can then call the same entry point of the hooked DLL (A.DLL) without knowing the arguments of the function? I.e. My function in fakeA.DLL would look something like
LONG DoStuff(
// don't know what to put here
)
{
FARPROC pfnHooked;
HINSTANCE hHooked;
LONG lRet;
// get hooked library and desired function
hHooked = LoadLibrary("A.DLL");
pfnHooked = GetProcAddress(hHooked, "DoStuff");
// how do I call the desired function without knowing the parameters?
lRet = pfnHooked( ??? );
return lRet;
}
My current thinking is that the arguments are on the stack so I'm guessing I would have to have a sufficiently large stack variable (a big ass struct for example) to capture whatever the arguments are and then just pass it along to pfnHooked? I.e.
// actual arg stack limit is >1MB but we'll assume 1024 bytes is sufficient
typedef struct { char unknownData[1024]; } ARBITARY_ARG;
ARBITARY_ARG DoStuff(ARBITARY_ARG args){
ARBITARY_ARG aRet;
...
aRet = pfnHooked(args);
return aRet;
}
Would this work? If so, is there a better way?
UPDATE: After some rudimentary (and non-conclusive) testing passing in the arbitrary block as arguments DOES work (which is not surprising, as the program will just read what it needs off the stack). However collecting the return value is harder as if it's too large it can cause an access violation. Setting the arbitrary return size to 8 bytes (or maybe 4 for x86) may be a solution to most cases (including void returns) however that's still guesswork. If I had some way of knowing the return type from the DLL (not necessarily at runtime) that would be grand.
This should be a comment but the meta answer is yes you can hook the function without knowing the calling convention and arguments, on an x64/x86 platform. Can it be purely done in C? No, it also needs a good deal of understanding of various calling convention and Assembly programming. The hooking framework will have some of it's bits written in Assembly.
Most hooking framework inherently do that by creating a trampoline that redirects the execution flow from the called function's preamble to stub code that is generally independent of the function it is hooking. In user mode you're guaranteed stack to be always present so you can push your own local variables too on the same stack as long as you can pop them and restore the stack to it's original state.
You don't really need to copy the existing arguments to your own stack variable. You can just inspect the stack, definitely read a bit about calling convention and how stacks are constructed on different architectures for various types of invocation in assembly before you attempt anything.
yes, this is possible do generic hooking 100% correct - one common for multiple functions with different arguments count and calling conventions. for both x86/x64 (amd64) platforms.
but for this need use little asm stubs - of course it will be different for x86/x64 - but it will be very small - several lines of code only - 2 small stub procedures - one for filter pre-call and one for post-call. but most code implementation (95%+) will be platform independent and in c++ (of course this possible do and on c but compare c++ - c source code will be larger, ugly and harder to implement)
in my solution need allocate small executable blocks of code for every hooking api (one block per hooked api). in this block - store function name, original address (or to where transfer control after pre-call - this is depended from hooking method) and one relative call instruction to common asm pre-call stub. magic of this call not only that it transfer control to common stub, but that return address in stack will be point to block itself (ok , with some offset, but if we will use c++ and inheritance - it will be exactly point to some base class, from which we derive our executable block class). as result in common precall stub we will be have information - which api call we hook here and then pass this info to c++ common handler.
one note, because in x64 relative call can be only in range [rip-0x80000000, rip+0x7fffffff] need declare (allocate) this code blocks inside our PE in separate bss section and mark this section as RWE. we can not simply use VirtualAlloc for allocate storage, because returned address can be too far from our common precall stub.
in common asm precall stub code must save rcx,rdx,r8,r9 registers for x64 (this is absolute mandatory) and ecx,edx registers for x86. this is need for case if function use __fastcall calling conventions. however windows api for example almost not using __fastcall - only several __fastcall functions exist from thousands of win api (for ensure this and found this functions - go to LIB folder and search for __imp_# string (this is __fastcall common prefix) and then call c++ common handler, which must return address of original function(to where transfer control) to stub. stub restore rcx,rdx,r8,r9 (or ecx,edx) registers and jump (but not call !) to this address
if we want filter only pre-call this is all what we need. however in most case need filter (hook) and post-call - for view/modify function return value and out parameters. and this is also possible do, but need little more coding.
for hook post-call obviously we must replace the return address for hooked api. but on what we must change return address ? and where save original return address ? for this we can not use global variable. even can not use thread local (__declspec( thread ) or thread_local) because call can be reqursive. can not use volatile register (because it changed during api call) and can not use non-volatile register - because in this case we will be save it,for restore later - but got some question - where ?
only one (and nice) solution here - allocate small block of executable memory (RWE) which containing one relative call instruction to common post-call asm stub. and some data - saved original return address, function parameters(for check out parameters in post handler) and function name
here again, some issuer for x64 - this block must be not too far from common post stub (+/- 2GB) - so the best also allocate this stubs in separate .bss section (with the pre-call stubs).
how many need this ret-stubs ? one per api call (if we want control post call). so not more than api calls active at any time. usually say 256 pre-allocated blocks - more than enough. and even if we fail allocate this block in pre-call - we only not control it post call, but not crash. and we can not for all hooked api want control post-call but only for some.
for very fast and interlocked alloc/free this blocks - need build stack semantic over it. allocate by interlocked pop and free by interlocked push. and pre initialize (call instruction) this blocks at begin (while push all it to stack, for not reinitialize it every time in pre-call)
common post-call stub in asm is very simply - here we not need save any registers. we simply call c++ post handler with address of block (we pop it from stack - result of call instruction from block) and with original return value (rax or eax). strictly said - api function can return pair rax+rdx or eax+edx but 99.9%+ of windows api return value in single register and i assume that we will be hooking only this api. however if want, can little adjust code for handle this too (simply in most case this not need)
c++ post call handler restore original return address (by using _AddressOfReturnAddress()), can log call and/or modify out parameters and finally return to.. original caller of api. what our handler return - this and will be final return value of api call. usually we mast return original value.
c++ code
#if 0
#define __ASM_FUNCTION __pragma(message(__FUNCDNAME__" proc\r\n" __FUNCDNAME__ " endp"))
#define _ASM_FUNCTION {__ASM_FUNCTION;}
#define ASM_FUNCTION {__ASM_FUNCTION;return 0;}
#define CPP_FUNCTION __pragma(message("extern " __FUNCDNAME__ " : PROC ; " __FUNCTION__))
#else
#define _ASM_FUNCTION
#define ASM_FUNCTION
#define CPP_FUNCTION
#endif
class CODE_STUB
{
#ifdef _WIN64
PVOID pad;
#endif
union
{
DWORD code;
struct
{
BYTE cc[3];
BYTE call;
};
};
int offset;
public:
void Init(PVOID stub)
{
// int3; int3; int3; call stub
code = 0xe8cccccc;
offset = RtlPointerToOffset(&offset + 1, stub);
C_ASSERT(sizeof(CODE_STUB) == RTL_SIZEOF_THROUGH_FIELD(CODE_STUB, offset));
}
PVOID Function()
{
return &call;
}
// implemented in .asm
static void __cdecl retstub() _ASM_FUNCTION;
static void __cdecl callstub() _ASM_FUNCTION;
};
struct FUNC_INFO
{
PVOID OriginalFunc;
PCSTR Name;
void* __fastcall OnCall(void** stack);
};
struct CALL_FUNC : CODE_STUB, FUNC_INFO
{
};
C_ASSERT(FIELD_OFFSET(CALL_FUNC,OriginalFunc) == sizeof(CODE_STUB));
struct RET_INFO
{
union
{
struct
{
PCSTR Name;
PVOID params[7];
};
SLIST_ENTRY Entry;
};
INT_PTR __fastcall OnCall(INT_PTR r);
};
struct RET_FUNC : CODE_STUB, RET_INFO
{
};
C_ASSERT(FIELD_OFFSET(RET_FUNC, Entry) == sizeof(CODE_STUB));
#pragma bss_seg(".HOOKS")
RET_FUNC g_rf[1024];//max call count
CALL_FUNC g_cf[16];//max hooks count
#pragma bss_seg()
#pragma comment(linker, "/SECTION:.HOOKS,RWE")
class RET_FUNC_Manager
{
SLIST_HEADER _head;
public:
RET_FUNC_Manager()
{
PSLIST_HEADER head = &_head;
InitializeSListHead(head);
RET_FUNC* p = g_rf;
DWORD n = RTL_NUMBER_OF(g_rf);
do
{
p->Init(CODE_STUB::retstub);
InterlockedPushEntrySList(head, &p++->Entry);
} while (--n);
}
RET_FUNC* alloc()
{
return static_cast<RET_FUNC*>(CONTAINING_RECORD(InterlockedPopEntrySList(&_head), RET_INFO, Entry));
}
void free(RET_INFO* p)
{
InterlockedPushEntrySList(&_head, &p->Entry);
}
} g_rfm;
void* __fastcall FUNC_INFO::OnCall(void** stack)
{
CPP_FUNCTION;
// in case __fastcall function in x86 - param#1 at stack[-1] and param#2 at stack[-2]
// this need for filter post call only
if (RET_FUNC* p = g_rfm.alloc())
{
p->Name = Name;
memcpy(p->params, stack, sizeof(p->params));
*stack = p->Function();
}
return OriginalFunc;
}
INT_PTR __fastcall RET_INFO::OnCall(INT_PTR r)
{
CPP_FUNCTION;
*(void**)_AddressOfReturnAddress() = *params;
PCSTR name = Name;
char buf[8];
if (IS_INTRESOURCE(name))
{
sprintf(buf, "#%04x", (ULONG)(ULONG_PTR)name), name = buf;
}
DbgPrint("%p %s(%p, %p, %p ..)=%p\r\n", *params, name, params[1], params[2], params[3], r);
g_rfm.free(this);
return r;
}
struct DLL_TO_HOOK
{
PCWSTR szDllName;
PCSTR szFuncNames[];
};
void DoHook(DLL_TO_HOOK** pp)
{
PCSTR* ppsz, psz;
DLL_TO_HOOK *p;
ULONG n = RTL_NUMBER_OF(g_cf);
CALL_FUNC* pcf = g_cf;
while (p = *pp++)
{
if (HMODULE hmod = LoadLibraryW(p->szDllName))
{
ppsz = p->szFuncNames;
while (psz = *ppsz++)
{
if (pcf->OriginalFunc = GetProcAddress(hmod, psz))
{
pcf->Name = psz;
pcf->Init(CODE_STUB::callstub);
// do hook: pcf->OriginalFunc -> pcf->Function() - code for this skiped
DbgPrint("hook: (%p) <- (%p)%s\n", pcf->Function(), pcf->OriginalFunc, psz);
if (!--n)
{
return;
}
pcf++;
}
}
}
}
}
asm x64 code:
extern ?OnCall#FUNC_INFO##QEAAPEAXPEAPEAX#Z : PROC ; FUNC_INFO::OnCall
extern ?OnCall#RET_INFO##QEAA_J_J#Z : PROC ; RET_INFO::OnCall
?retstub#CODE_STUB##SAXXZ proc
pop rcx
mov rdx,rax
call ?OnCall#RET_INFO##QEAA_J_J#Z
?retstub#CODE_STUB##SAXXZ endp
?callstub#CODE_STUB##SAXXZ proc
mov [rsp+10h],rcx
mov [rsp+18h],rdx
mov [rsp+20h],r8
mov [rsp+28h],r9
pop rcx
mov rdx,rsp
sub rsp,18h
call ?OnCall#FUNC_INFO##QEAAPEAXPEAPEAX#Z
add rsp,18h
mov rcx,[rsp+8]
mov rdx,[rsp+10h]
mov r8,[rsp+18h]
mov r9,[rsp+20h]
jmp rax
?callstub#CODE_STUB##SAXXZ endp
asm x86 code
extern ?OnCall#FUNC_INFO##QAIPAXPAPAX#Z : PROC ; FUNC_INFO::OnCall
extern ?OnCall#RET_INFO##QAIHH#Z : PROC ; RET_INFO::OnCall
?retstub#CODE_STUB##SAXXZ proc
pop ecx
mov edx,eax
call ?OnCall#RET_INFO##QAIHH#Z
?retstub#CODE_STUB##SAXXZ endp
?callstub#CODE_STUB##SAXXZ proc
xchg [esp],ecx
push edx
lea edx,[esp + 8]
call ?OnCall#FUNC_INFO##QAIPAXPAPAX#Z
pop edx
pop ecx
jmp eax
?callstub#CODE_STUB##SAXXZ endp
you can ask from where i know this decorated names like ?OnCall#FUNC_INFO##QAIPAXPAPAX#Z ? look for very begin of c++ code - for several macros - and first time compile with #if 1 and look in output window. hope you understand (and you will be probably need use this names, but not my names - decoration can be different)
and how call void DoHook(DLL_TO_HOOK** pp) ? like that:
DLL_TO_HOOK dth_kernel32 = { L"kernel32", { "VirtualAlloc", "VirtualFree", "HeapAlloc", 0 } };
DLL_TO_HOOK dth_ntdll = { L"ntdll", { "NtOpenEvent", 0 } };
DLL_TO_HOOK* ghd[] = { &dth_ntdll, &dth_kernel32, 0 };
DoHook(ghd);
Lets say there is a DLL A.DLL with a known entry point DoStuff
If the entry point DoStuff is known it ought to be documented somewhere, at the very least in some C header file. So a possible approach might be to parse that header to get its signature (i.e. the C declaration of DoStuff). Maybe you could fill some database with the signature of all functions declared in all system header files, etc... Or perhaps use debug information if you have it.
If you call some function (in C) and don't give all the required parameters, the calling convention & ABI will still be used, and these (missing) parameters get garbage values (if the calling convention defines that parameter to be passed in a register, the garbage inside that register; if the convention defines that parameter to be passed on the call stack, the garbage inside that particular call stack slot). So you are likely to crash and surely have some undefined behavior (which is scary, since your program might seem to work but still be very wrong).
However, look also into libffi. Once you know (at runtime) what to pass to some arbitrary function, you can construct a call to it passing the right number and types of arguments.
My current thinking is that the arguments are on the stack
I think it is wrong (at least on many x86-64 systems). Some arguments are passed thru registers. Read about x86 calling conventions.
Would this work?
No, it won't work because some arguments are passed thru registers, and because the calling convention depends upon the signature of the called function (floating point values might be passed in different registers, or always on the stack; variadic functions have specific calling conventions; etc....)
BTW, some recent C optimizing compilers are able to do tail call optimizations, which might complicate things.
There is no standard way of doing this because lot of things like calling conventions, pointer sizes etc matter when passing arguments. You will have to read the ABI for your platform and write an implementation, which I fear again won't be possible in C. You will need some inline assembly.
One simple way to do it would be (for a platform like X86_64) -
MyDoStuff:
jmpq *__real_DoStuff
This hook does nothing but just calls the real function. If you want to do anything useful while hooking you will have to save restore some registers before the call (again what to save depends on the ABI)

How to fix a Hook in a C program (stack's restoration)

It's a kind of training task, because nowadays these methods (I guess) don't work anymore.
Win XP and MinGW compiler are used. No special compiler options are involved (just gcc with stating one source file).
First of all, saving an address to exit from the program and jumping to the some Hook function:
// Our system uses 4 bytes for addresses.
typedef unsigned long int DWORD;
// To save an address of the exit from the program.
DWORD addr_ret;
// An entry point.
int main()
{
// To make a direct access to next instructions.
DWORD m[1];
// Saving an address of the exit from the program.
addr_ret = (DWORD) m[4];
// Replacing the exit from the program with a jump to some Hook function.
m[4] = (DWORD) Hook;
// Status code of the program's execution.
return 0;
}
The goal of this code is to get an access to the system's privileges level, because when we return (should return) to the system, we just redirecting our program to some of our methods. The code of this method:
// Label's declaration to make a jump.
jmp_buf label;
void Hook()
{
printf ("Test\n");
// Trying to restore the stack using direct launch (without stack's preparation) of the function (we'll wee it later).
longjmp(label, 1);
// Just to make sure that we won't return here after jump's (from above) finish, because we are not getting stuck in the infinite loop.
while(1) {}
}
And finally I'll state a function which (in my opinion) should fix the stack pointer - ESP register:
void FixStack()
{
// A label to make a jump to here.
setjmp(label);
// A replacement of the exit from this function with an exit from the whole program.
DWORD m[1];
m[2] = addr_ret;
}
Of course we should use these includes for the stated program:
#include <stdio.h>
#include <setjmp.h>
The whole logic of the program works correctly in my system, but I can not restore my stack (ESP), so the program returns an incorrect return code.
Before the solution described above, I didn't use jumps and FixStack function. I mean that these lines were in the Hook function instead of jump and while cycle:
DWORD m[1];
m[2] = addr_ret;
But with this variant I was getting an incorrect value in ESP register before an exit from the program (it was on 8 bytes bigger then this register's value before an enter in this program). So I decided to add somehow these 8 bytes (avoiding any ASM code inside of the C program). It's the purpose of the jump into the FixStack function with an appropriate exit from it (to remove some values from stack). But, as I stated, it doesn't return a correct status of the program's execution using this command:
echo %ErrorLevel%
So my question is very wide: beginning from asking of some recommendations in a usage of debugging utilities (I was using only OllyDbg) and ending in possible solutions for the described Hook's implementation.
Ok, I could make my program work, as it was intended, finally. Now we can launch compiled (I use MinGW in Win XP) program without any errors and with correct return code.
Maybe will be helpful for someone:
#include <stdio.h>
#include <setjmp.h>
typedef unsigned long int DWORD;
DWORD addr_ret;
int FixStack()
{
DWORD m[1];
m[2] = addr_ret;
// This line is very necessary for correct running!
return 0;
}
void Hook()
{
printf("Test\n");
FixStack();
}
int main()
{
DWORD m[1];
addr_ret = (DWORD) m[4];
m[4] = (DWORD) Hook;
}
Of course it seems that you've realized that this will only work with a very specific build environment. It most definitely won't work on a 64-bit target (because the addresses aren't DWORD-ish).
Is there any reason why you don't want to use the facilities provided by the C standard library to do exactly this? (Or something very similar to this.)
#include <stdlib.h>
void Hook()
{
printf("Test\n");
}
int main()
{
atexit(Hook);
}

How to create a interrupt stack?

I want my interrupt service routine to use a different stack(may be of its own) & not use the caller thread's stack.
thread_entry (){
do_something();
--> Interrupt occurs
do_otherstuff();
}
void interrupt_routine ()
{
uint8_t read_byte; // I don't want this to be part of caller thread's stack
read_byte= hw_read();
}
Is it possible & how to achieve this?
The stacks required for OS and interrupt handlers is set up at initialization itself. This is again architecture specific code. For case of ARM processors it has a distinct R13 that is used when the processor is in the interrupt mode. Again this register is initialized at bootup. What is the problem you want to address with this design.
The GNU C library for Linux has methods to control the stack in which the signal executes. Refer to the documentation for full details.
The basic idea is that you allocate memory for the stack and the call the function
sigstack()
to specify that this stack is available to be used for signal handling. You then use the
sigaction()
function to register a handler for a particular signal and specify the flag value
SA_ONSTACK
that this handler runs on the special stack
Here is a code snippet showing the pattern, it's "borrowed" from the Linux Programming Interface examples
sigstack.ss_sp = malloc(SIGSTKSZ);
if (sigstack.ss_sp == NULL)
errExit("malloc");
sigstack.ss_size = SIGSTKSZ;
sigstack.ss_flags = 0;
if (sigaltstack(&sigstack, NULL) == -1)
errExit("sigaltstack");
printf("Alternate stack is at %10p-%p\n",
sigstack.ss_sp, (char *) sbrk(0) - 1);
sa.sa_handler = sigsegvHandler; /* Establish handler for SIGSEGV */
sigemptyset(&sa.sa_mask);
sa.sa_flags = SA_ONSTACK; /* Handler uses alternate stack */
if (sigaction(SIGSEGV, &sa, NULL) == -1)
errExit("sigaction");
Here's a simple x86 inline assembly implementation. You have a wrapper function which changes the stack, and calls your real routine.
const uint32_t interrupt_stack_size = 4096;
uint8_t interrupt_stack[interrupt_stack_size];
void interrupt_routine_wrap()
{
static int thread_esp;
// Stack grows towards lower addresses, so start at the bottom
static int irq_esp = (int) interrupt_stack + interrupt_stack_size;
// Store the old esp
asm mov dword ptr thread_esp, esp;
// Set the new esp
asm mov esp, dword ptr irq_esp;
// Execute the real interrupt routine
interrupt_routine();
// Restore old esp
asm mov esp, dword ptr thread_esp;
}
I'm completely ignoring the segment register here (ss), but different memory models may need to store that along with sp.
You can get rid of the inline assembly by using setjmp/longjmp to read/write all registers. That's a more portable way to do it.
Also note that I'm not preserving any registers here, and inline assembly may confuse the compiler. Perhaps it'd be worth it to add a pusha/popa pair around the wrapper routine. Compiler may do this for you if you specify the function as interrupt. Check the resulting binary to be certain.

Does Linux kernel have main function?

I am learning Device Driver and Kernel programming.According to Jonathan Corbet book we do not have main() function in device drivers.
#include <linux/init.h>
#include <linux/module.h>
static int my_init(void)
{
return 0;
}
static void my_exit(void)
{
return;
}
module_init(my_init);
module_exit(my_exit);
Here I have two questions :
Why we do not need main() function in Device Drivers?
Does Kernel have main() function?
start_kernel
On 4.2, start_kernel from init/main.c is a considerable initialization process and could be compared to a main function.
It is the first arch independent code to run, and sets up a large part of the kernel. So much like main, start_kernel is preceded by some lower level setup code (done in the crt* objects in userland main), after which the "main" generic C code runs.
How start_kernel gets called in x86_64
arch/x86/kernel/vmlinux.lds.S, a linker script, sets:
ENTRY(phys_startup_64)
and
phys_startup_64 = startup_64 - LOAD_OFFSET;
and:
#define LOAD_OFFSET __START_KERNEL_map
arch/x86/include/asm/page_64_types.h defines __START_KERNEL_map as:
#define __START_KERNEL_map _AC(0xffffffff80000000, UL)
which is the kernel entry address. TODO how is that address reached exactly? I have to understand the interface Linux exposes to bootloaders.
arch/x86/kernel/vmlinux.lds.S sets the very first bootloader section as:
.text : AT(ADDR(.text) - LOAD_OFFSET) {
_text = .;
/* bootstrapping code */
HEAD_TEXT
include/asm-generic/vmlinux.lds.h defines HEAD_TEXT:
#define HEAD_TEXT *(.head.text)
arch/x86/kernel/head_64.S defines startup_64. That is the very first x86 kernel code that runs. It does a lot of low level setup, including segmentation and paging.
That is then the first thing that runs because the file starts with:
.text
__HEAD
.code64
.globl startup_64
and include/linux/init.h defines __HEAD as:
#define __HEAD .section ".head.text","ax"
so the same as the very first thing of the linker script.
At the end it calls x86_64_start_kernel a bit awkwardly with and lretq:
movq initial_code(%rip),%rax
pushq $0 # fake return address to stop unwinder
pushq $__KERNEL_CS # set correct cs
pushq %rax # target address in negative space
lretq
and:
.balign 8
GLOBAL(initial_code)
.quad x86_64_start_kernel
arch/x86/kernel/head64.c defines x86_64_start_kernel which calls x86_64_start_reservations which calls start_kernel.
arm64 entry point
The very first arm64 that runs on an v5.7 uncompressed kernel is defined at https://github.com/cirosantilli/linux/blob/v5.7/arch/arm64/kernel/head.S#L72 so either the add x13, x18, #0x16 or b stext depending on CONFIG_EFI:
__HEAD
_head:
/*
* DO NOT MODIFY. Image header expected by Linux boot-loaders.
*/
#ifdef CONFIG_EFI
/*
* This add instruction has no meaningful effect except that
* its opcode forms the magic "MZ" signature required by UEFI.
*/
add x13, x18, #0x16
b stext
#else
b stext // branch to kernel start, magic
.long 0 // reserved
#endif
le64sym _kernel_offset_le // Image load offset from start of RAM, little-endian
le64sym _kernel_size_le // Effective size of kernel image, little-endian
le64sym _kernel_flags_le // Informative flags, little-endian
.quad 0 // reserved
.quad 0 // reserved
.quad 0 // reserved
.ascii ARM64_IMAGE_MAGIC // Magic number
#ifdef CONFIG_EFI
.long pe_header - _head // Offset to the PE header.
This is also the very first byte of an uncompressed kernel image.
Both of those cases jump to stext which starts the "real" action.
As mentioned in the comment, these two instructions are the first 64 bytes of a documented header described at: https://github.com/cirosantilli/linux/blob/v5.7/Documentation/arm64/booting.rst#4-call-the-kernel-image
arm64 first MMU enabled instruction: __primary_switched
I think it is __primary_switched in head.S:
/*
* The following fragment of code is executed with the MMU enabled.
*
* x0 = __PHYS_OFFSET
*/
__primary_switched:
At this point, the kernel appears to create page tables + maybe relocate itself such that the PC addresses match the symbols of the vmlinux ELF file. Therefore at this point you should be able to see meaningful function names in GDB without extra magic.
arm64 secondary CPU entry point
secondary_holding_pen defined at: https://github.com/cirosantilli/linux/blob/v5.7/arch/arm64/kernel/head.S#L691
Entry procedure further described at: https://github.com/cirosantilli/linux/blob/v5.7/arch/arm64/kernel/head.S#L691
Fundamentally, there is nothing special about a routine being named main(). As alluded to above, main() serves as the entry point for an executable load module. However, you can define different entry points for a load module. In fact, you can define more than one entry point, for example, refer to your favorite dll.
From the operating system's (OS) point of view, all it really needs is the address of the entry point of the code that will function as a device driver. The OS will pass control to that entry point when the device driver is required to perform I/O to the device.
A system programmer defines (each OS has its own method) the connection between a device, a load module that functions as the device's driver, and the name of the entry point in the load module.
Each OS has its own kernel (obviously) and some might/maybe start with main() but I would be surprised to find a kernel that used main() other than in a simple one, such as UNIX! By the time you are writing kernel code you have long moved past the requirement to name every module you write as main().
Hope this helps?
Found this code snippet from the kernel for Unix Version 6. As you can see main() is just another program, trying to get started!
main()
{
extern schar;
register i, *p;
/*
* zero and free all of core
*/
updlock = 0;
i = *ka6 + USIZE;
UISD->r[0] = 077406;
for(;;) {
if(fuibyte(0) < 0) break;
clearsig(i);
maxmem++;
mfree(coremap, 1, i);
i++;
}
if(cputype == 70)
for(i=0; i<62; i=+2) {
UBMAP->r[i] = i<<12;
UBMAP->r[i+1] = 0;
}
// etc. etc. etc.
Several ways to look at it:
Device drivers are not programs. They are modules that are loaded into another program (the kernel). As such, they do not have a main() function.
The fact that all programs must have a main() function is only true for userspace applications. It does not apply to the kernel, nor to device drivers.
With main() you propably mean what main() is to a program, namely its "entry point".
For a module that is init_module().
From Linux Device Driver's 2nd Edition:
Whereas an application performs a single task from beginning to end, a module registers itself in order to serve future requests, and its "main" function terminates immediately. In other words, the task of the function init_module (the module's entry point) is to prepare for later invocation of the module's functions; it's as though the module were saying, "Here I am, and this is what I can do." The second entry point of a module, cleanup_module, gets invoked just before the module is unloaded. It should tell the kernel, "I'm not there anymore; don't ask me to do anything else."
Yes, the Linux kernel has a main function, it is located in arch/x86/boot/main.c file. But Kernel execution starts from arch/x86/boot/header.S assembly file and the main() function is called from there by "calll main" instruction.
Here is that main function:
void main(void)
{
/* First, copy the boot header into the "zeropage" */
copy_boot_params();
/* Initialize the early-boot console */
console_init();
if (cmdline_find_option_bool("debug"))
puts("early console in setup code.\n");
/* End of heap check */
init_heap();
/* Make sure we have all the proper CPU support */
if (validate_cpu()) {
puts("Unable to boot - please use a kernel appropriate "
"for your CPU.\n");
die();
}
/* Tell the BIOS what CPU mode we intend to run in. */
set_bios_mode();
/* Detect memory layout */
detect_memory();
/* Set keyboard repeat rate (why?) and query the lock flags */
keyboard_init();
/* Query Intel SpeedStep (IST) information */
query_ist();
/* Query APM information */
#if defined(CONFIG_APM) || defined(CONFIG_APM_MODULE)
query_apm_bios();
#endif
/* Query EDD information */
#if defined(CONFIG_EDD) || defined(CONFIG_EDD_MODULE)
query_edd();
#endif
/* Set the video mode */
set_video();
/* Do the last things and invoke protected mode */
go_to_protected_mode();
}
While the function name main() is just a common convention (there is no real reason to use it in kernel mode) the linux kernel does have a main() function for many architectures, and of course usermode linux has a main function.
Note the OS runtime loads the main() function to start an app, when an operating system boots there is no runtime, the kernel is simply loaded to a address by the boot loader which is loaded by the MBR which is loaded by the hardware. So while a kernel may contain a function called main it need not be the entry point.
See Also:
http://msdn.microsoft.com/en-us/library/windows/desktop/ms633559%28v=vs.85%29.aspx
Linux kernel source:
x86: linux-3.10-rc6/arch/x86/boot/main.c
arm64: linux-3.10-rc6/arch/arm64/kernel/asm-offsets.c

Resources