I'm writing an program which enumerates hooks created by SetWindowsHookEx() Here is the process:
Use GetProcAddress() to obtain gSharedInfo exported in User32.dll(works, verified)
Read User-Mode memory at gSharedInfo + 8, the result should be a pointer of first handle entry. (works, verified)
Read User-Mode memory at [gSharedInfo] + 8, the result should be countof handles to enumerate. (works, verified)
Read data from address obtained in step 2, repeat count times
Check if HANDLEENTRY.bType is 5(which means it's a HHOOK). If so, print informations.
The problem is, although step 1-3 only mess around with user mode memory, step 4 requires the program to read kernel memory. After some research I found that ZwSystemDebugControl can be used to access Kernel Memory from user mode. So I wrote the following function:
BOOL GetKernelMemory(PVOID pKernelAddr, PBYTE pBuffer, ULONG uLength)
{
MEMORY_CHUNKS mc;
ULONG uReaded = 0;
mc.Address = (UINT)pKernelAddr; //Kernel Memory Address - input
mc.pData = (UINT)pBuffer;//User Mode Memory Address - output
mc.Length = (UINT)uLength; //length
ULONG st = -1;
ZWSYSTEMDEBUGCONTROL ZwSystemDebugControl = (ZWSYSTEMDEBUGCONTROL)GetProcAddress(
GetModuleHandleA("ntdll.dll"), "NtSystemDebugControl");
st = ZwSystemDebugControl(SysDbgCopyMemoryChunks_0, &mc, sizeof(MEMORY_CHUNKS), 0, 0, &uReaded);
return st == 0;
}
But the function above didn't work. uReaded is always 0 and st is always 0xC0000002. How do I resolve this error?
my full program:
http://pastebin.com/xzYfGdC5
MSFT did not implement NtSystemDebugControl syscall after windows XP.
The Meltdown vulnerability makes it possible to read Kernel memory from User Mode on most Intel CPUs with a speed of approximately 500kB/s. This works on most unpatched OS'es.
Related
I'm new to OpenCL and I'm working on converting an existing algorithm to OpenCL.
In this process, I am experiencing a phenomenon that I cannot solve on my own, and I would like to ask some help.
Here's details.
My kernel is applied to images of different size (to be precise, each layer of the Laplacian pyramid).
I get normal results for images of larger size such as 3072 x 3072, 1536 x 1536.
But I get abnormal results for smaller images such as 12 x 12, 6 x 6, 3 x 3, 2 x 2.
At first, I suspected that clEnqueueNDRangeKernel had a bottom limit for dimensions, causing this problem. So, I added printf to the beginning of the kernel as follows. It is confirmed that all necessary kernel instances are executed.
__kernel void GetValueOfB(/* parameters */)
{
uint xB = get_global_id(0);
uint yB = get_global_id(1);
printf("(%d, %d)\n", xB, yB);
// calculation code is omitted
}
So after wandering for a while, I added the same printf to the end of the kernel. When I did this, it was confirmed that printf works only for some pixel positions. For pixel positions not output by printf, the calculated values in the resulting image are incorrect, and as a result, I concluded that some kernel instances terminate abnormally before completing the calculations.
__kernel void GetValueOfB(/* parameters */)
{
uint xB = get_global_id(0);
uint yB = get_global_id(1);
printf("(%d, %d)\n", xB, yB);
// calculation code is omitted
printf("(%d, %d, %f)\n", xB, yB, result_for_this_position);
}
It seems that there is no problem with the calculation of the kernel. If I compile the kernel turning off the optimization with the -cl-opt-disable option, I get perfectly correct results for all images regardless of their size. In addition to that, with NVIDA P4000, it works correct. Of course, in theses cases, I confirmed that the printf added at the bottom of the Kernel works for all pixels.
Below I put additional information and attach a part of the code I wrote.
Any advice is welcomed and appreciated.
Thank you.
SDK: Intel® SDK For OpenCL™ Applications 2020.3.494
Platform: Intel(R) OpenCL HD Graphics
for all images
{
...
const size_t globalSize[2] = { size_t(vtMatB_GPU_LLP[nLayerIndex].cols), size_t(vtMatB_GPU_LLP[nLayerIndex].rows) };
err = clEnqueueNDRangeKernel(_pOpenCLManager->GetCommandQueue(), kernel, 2,
NULL, globalSize, NULL, 0, NULL, NULL);
if (CL_SUCCESS != err)
return -1;
// I tried with this but it didn't make any difference
//std::this_thread::sleep_for(std::chrono::seconds(1));
err = clFinish(_pOpenCLManager->GetCommandQueue());
if (CL_SUCCESS != err)
return -1;
err = clEnqueueReadBuffer(_pOpenCLManager->GetCommandQueue(), memMatB, CL_TRUE,
0, sizeof(float) * vtMatB_GPU_LLP[nLayerIndex].cols *
vtMatB_GPU_LLP[nLayerIndex].rows, vtMatB_GPU_LLP[nLayerIndex].data, 0, nullptr, nullptr);
if (CL_SUCCESS != err)
return -1;
...
}
And I tried with event, too, but it works the same way.
for all images
{
...
const size_t globalSize[2] = { size_t(vtMatB_GPU_LLP[nLayerIndex].cols), size_t(vtMatB_GPU_LLP[nLayerIndex].rows) };
cl_event event;
err = clEnqueueNDRangeKernel(_pOpenCLManager->GetCommandQueue(), kernel, 2, NULL, globalSize, NULL, 0, NULL, &event);
if (CL_SUCCESS != err)
return -1;
err = clWaitForEvents(1, &event);
if (CL_SUCCESS != err)
return -1;
err = clFinish(_pOpenCLManager->GetCommandQueue());
if (CL_SUCCESS != err)
return -1;
err = clEnqueueReadBuffer(_pOpenCLManager->GetCommandQueue(), memMatB, CL_TRUE,
0, sizeof(float) * vtMatB_GPU_LLP[nLayerIndex].cols *
vtMatB_GPU_LLP[nLayerIndex].rows, vtMatB_GPU_LLP[nLayerIndex].data, 0, nullptr, nullptr);
if (CL_SUCCESS != err)
return -1;
...
}
/////// Added contents ////////////////////////////////////////////
Would you guys please take look at this issue in the aspect of clFinsh, or clWaitEvent. Am I missing something in this regard?
Sometimes I get less correct values and sometimes I get more correct values.
To be more specific, let's say I'm applying the kernel to 12 x 12 size image. So there're 144 pixel values.
Sometime I get correct values for 56 pixels.
Sometime I get correct values for 89 pixels.
Some other time I get correct value for n(less then 144) pixels.
If I turn off the OpenCL optimization when compiling the kernel by specifying -cl-opt-disable option, I get correct values for all 144 pixels.
The other thing that makes me think the calculation code is correct is that the same OpenCL code with no modification(other then device select code) runs perfectly correctly with NVIDIA P4000.
At first, I was really suspicious about the calculation code, but more I inspect code, more I'm confident there's nothing wrong with calculation code.
I know there's still a chance that there is an error in the calculation code so that there happen some exceptions anywhere during calculations.
I have plain C++ code for same task. I'm comparing results from those two.
/////// Another added contents ////////////////////////////////////////////
I made a minimum code(except projects template) to reproduce the phenomenon.
What's odd more is that if I install "Intel® Distribution for GDB Target" I get correct results.
https://github.com/heysweetethan/GPUOpenCLProjectforWindows
OpenCL kernels run threads in parallel on a specified global range, which in your case is the image size, with one thread per pixel.
The threads are grouped in workgroups, Workgroup size should be a multiple of 32; ideally 64 to make full use of the hardware, or 8x8 pixels in 2D. These workgroups cannot be split, so the global range must be a multiple of workgroup size.
What happens if global range is not clearly divisible by workgroup size, or smaller than workgroup size, like 3x3 pixels? Then the last workgroup is still executed with all 8x8 threads. The first 3x3 work on valid data in memory, but all the other threads read/write unallocated memory. This can cause undefined behavior or even crashes.
If you cannot have global size as a multiple of workgroup size, there is still a solution: a guard clause in the very beginning of the kernel:
if(xB>=xImage||yB>=yImage) return;
This ensures that no threads access unallocated memory.
As you don't supply a complete reproducible code sample, here's a loose collection of comments/suggestions/advice:
1. printf in kernel code
Don't rely on large amounts of printf output from kernels. It's necessarily buffered, and some implementations don't guarantee delivery of messages - often there's a fixed size buffer and when that's full, messages are dropped.
Note that your post-calculation printf increases the total amount of output, for example.
The reliable way to check or print kernel output is to write it to a global buffer and print it in host code. For example, if you want to verify each work-item reaches a specific point in the code, consider creating a zero-initialised global buffer where you can set a flag in each work-item.
2. Events
As you asked about events, flushing, etc. Your clFinish call certainly should suffice to ensure everything has executed - if anything, it's overkill, but especially while you're debugging other issues it's a good way to rule out queuing issue.
The clWaitForEvents() call preceeding it is not a great idea, as you haven't called clFlush() after queueing the kernel whose event you're waiting for. It's fairly minor, but could be a problem on some implementations.
3. Small image sizes
You've not actually posted any of the code that deals with the images themselves, so I can only guess at potential issues there. It looks like you're not using workgroups, so you shouldn't be running into the usual multiple-of-group-size pitfall.
However, are you sure you're loading the source data correctly, and you're correctly indexing into it? There could be all sorts of pitfalls here, from alignment of pixel rows in the source data, enqueueing the kernel before filling the source buffers has completed, creating source buffers with the wrong flags, etc.
So in summary, I'd suggest:
Don't believe in-kernel-printf if something strange is going on. Switch to something more reliable for observing the behaviour of your kernel code.
At minimum, post all your OpenCL API calling host code. Buffer creation, setting arguments, etc. Any fragments of kernel code accessing the buffers are probably not a bad idea either.
Thanks to a person from intel community, I could understand the phenomenon.
Briefly, if you spend to much time on a single kernel instance, 'Timeout Detection and Recovery(TDR)' stops the kernel instance.
For more information about this, you could refer to the followings.
https://learn.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys
https://www.pugetsystems.com/labs/hpc/Working-around-TDR-in-Windows-for-a-better-GPU-computing-experience-777/
https://community.intel.com/t5/GPU-Compute-Software/It-s-like-OpenCL-kernel-instance-ends-abruptly/m-p/1386883#M478
I appreciate for all the people who gave me advices.
This question has been asked at least a dozen times but I cannot figure out where is my issue.
I am writing a kernel module that must read data from a reserved memory range. These data are written by an external device.
To control the the device, we have a second register within which I want to write some data.
And this is where I start to get lost...
This is the part of the code that, from my understanding, should create a virtual mapping from the reg input in my device tree:
// Read the control memory and map to virtual address
res_ctrl = platform_get_resource(pdev, IORESOURCE_MEM, 0);
if (!res_ctrl) {
dev_err(&pdev->dev, "can't get device resources\n");
return -ENOENT;
}
p3chvideo_device->pchv_ctrl.paddr = res_ctrl->start;
p3chvideo_device->pchv_ctrl.size = resource_size(res_ctrl);
struct resource* res_d = request_mem_region(p3chvideo_device->pchv_ctrl.paddr, p3chvideo_device->pchv_ctrl.size, "p3chv");
p3chvideo_device->pchv_ctrl.vaddr = ioremap_nocache(p3chvideo_device->pchv_ctrl.paddr, p3chvideo_device->pchv_ctrl.size);
if (!p3chvideo_device->pchv_ctrl.vaddr) {
pr_info("Control buffer allocated vaddr: 0x%0llX paddr: 0x%0llX (size: 0x%0llX)\n", p3chvideo_device->pchv_ctrl.vaddr, p3chvideo_device->pchv_ctrl.paddr, p3chvideo_device->pchv_ctrl.size);
return -EADDRNOTAVAIL;
}
//p3chvideo_device->pchv_ctrl.vaddr = devm_ioremap(&pdev->dev, p3chvideo_device->pchv_ctrl.paddr, p3chvideo_device->pchv_ctrl.size);//ioremap(p3chvideo_device->pchv_ctrl.paddr, p3chvideo_device->pchv_ctrl.size);
pr_info("Control buffer allocated vaddr: 0x%0llX paddr: 0x%0llX (size: 0x%0llX)\n", p3chvideo_device->pchv_ctrl.vaddr, p3chvideo_device->pchv_ctrl.paddr, p3chvideo_device->pchv_ctrl.size);
From the messages in the kernel, the register is correctly detected (offset, size). In addition, I do see in /proc/iomem the reserved memory.
However, when I try to write then read the results, it doesn't work, the value I read is different from the value I wrote... It is as if the register value wasn't altered by the write operation.
static void buffer_loaded_enable_interrupt(void) {
pr_info("buffer loaded enable interrupt 0x%0llX 0x%0X\n", p3chvideo_device->pchv_ctrl.vaddr + IRQ_ENABLE_BUFFER_LOADED, (u32)(1 << 0));
// Clear buffer Loaded Interrupt
//wmb();
pr_info("Stored value before: 0x%0X", readl(p3chvideo_device->pchv_ctrl.vaddr + IRQ_ENABLE_BUFFER_LOADED));
//*(p3chvideo_device->pchv_ctrl.vaddr + IRQ_ENABLE_BUFFER_LOADED) = (u32)(1 << 0);
iowrite32((u32)(1 << 0), p3chvideo_device->pchv_ctrl.vaddr + IRQ_ENABLE_BUFFER_LOADED);
udelay(100);
pr_info("Stored value: 0x%0X", readl(p3chvideo_device->pchv_ctrl.vaddr + IRQ_ENABLE_BUFFER_LOADED));
}
If I use a devmem approach, I can write ahd check the read and it works...
What am I missing?
I finally found the issue and I am posting it so if anyone has the same kind of problem, maybe this could help!
The problem was not in the code I put above but in the type of vaddr. It was set as a ssize_t.
ssize_t vaddr;
As soon as I set it to void __iomem *, everything worked as expected!
void __iomem *vaddr;
Let's hope this post will help someone like me :)
Thanks
I developing block layered device driver. So, I intercept WRITE request and encrypt data, and decrypt data in the end_bio() routine (during processing and READ request).
So all works fine in single stream. But I getting buffers content corruption if have tried to performs I/O from two and more processes simultaneously. I have not any local storage for buffers.
Do I'm need to count a BIO merging in my driver?
Is the Linux I/O subsystem have some requirements related to the a number of concurrent I/O request?
Is there some tips and tricks related stack using or compilation?
This is under kernel 4.15.
At the time I use next constriction to run over disk sectors:
/*
* A portion of the bio_copy_data() ...
*/
for (vcnt = 0, src_iter = src->bi_iter; ; vcnt++)
{
if ( !src_iter.bi_size)
{
if ( !(src = src->bi_next) )
break;
src_iter = src->bi_iter;
}
src_bv = bio_iter_iovec(src, src_iter);
src_p = bv_page = kmap_atomic(src_bv.bv_page);
src_p += src_bv.bv_offset;
nlbn = src_bv.bv_len512;
for ( ; nlbn--; lbn++ , src_p += 512 )
{
{
/* Simulate a processing of data in the I/O buffer */
char *srcp = src_p, *dstp = src_p;
int count = DUDRV$K_SECTORSZ;
while ( count--)
{
*(dstp++) = ~ (*(srcp++));
}
}
}
kunmap_atomic(bv_page);
**bio_advance_iter**(src, &src_iter, src_bv.bv_len);
}
Is this correct ? Or I'm need to use something like **bio_for_each_segment(bvl, bio, iter) ** ?
The root of the problem is a "feature" of the Block I/O methods. In particularly (see description at Linex site reference )
** Biovecs can be shared between multiple bios - a bvec iter can represent an
arbitrary range of an existing biovec, both starting and ending midway
through biovecs. This is what enables efficient splitting of arbitrary
bios. Note that this means we only use bi_size to determine when we've
reached the end of a bio, not bi_vcnt - and the bio_iovec() macro takes
bi_size into account when constructing biovecs.*
So, in my case this is a cause of the buffer with disk sector overrun.
The trick is set REQ_NOMERGE_FLAGS in the .bi_opf before sending BIO to backed device driver.
The second reason is the non-actual .bi_iter is returned by backed device driver. So, we need to save it (before submiting BIO request to backend) and restore it in the our "bio_endio()" routine.
Have you considered the use of vmap with global synchronization, instead?
The use of kmap_atomic has some restrictions:
Since the mapping is restricted to the CPU that issued it, it
performs well, but the issuing task is therefore required to stay on that
CPU until it has finished, lest some other task displace its mappings.
kmap_atomic() may also be used by interrupt contexts, since it is does not
sleep and the caller may not sleep until after kunmap_atomic() is called.
Reference: https://www.kernel.org/doc/Documentation/vm/highmem.txt
I'm working with stm32+rtos to implement a file system based on spi flash. For freertos, I adopted heap_1 implementation. This is how i create my task.
osThreadDef(Task_Embedded, Task_VATEmbedded, osPriorityNormal, 0, 2500);
VATEmbeddedTaskHandle = osThreadCreate(osThread(Task_Embedded), NULL);
I allocated 10000 bytes of memory to this thread.
and in this thread. I tried to write data into flash. In the first few called it worked successfully. but somehow it crash when i tried more time of write.
VATAPI_RESULT STM32SPIWriteSector(void *writebuf, uint8_t* SectorAddr, uint32_t buff_size){
if(STM32SPIEraseSector(SectorAddr) == VAT_SUCCESS){
DBGSTR("ERASE SECTOR - 0x%2x %2x %2x", SectorAddr[0], SectorAddr[1], SectorAddr[2]);
}else return VAT_UNKNOWN;
if(STM32SPIProgram_multiPage(writebuf, SectorAddr, buff_size) == VAT_SUCCESS){
DBGSTR("WRTIE SECTOR SUCCESSFUL");
return VAT_SUCCESS;
}else return VAT_UNKNOWN;
return VAT_UNKNOWN;
}
.
VATAPI_RESULT STM32SPIProgram_multiPage(uint8_t *writebuf, uint8_t *writeAddr, uint32_t buff_size){
VATAPI_RESULT nres;
uint8_t tmpaddr[3] = {writeAddr[0], writeAddr[1], writeAddr[2]};
uint8_t* sectorBuf = malloc(4096 * sizeof(uint8_t));
uint8_t* pagebuf = malloc(255* sizeof(uint8_t));
memset(§orBuf[0],0,4096);
memset(&pagebuf[0],0,255);
uint32_t i = 0, tmp_convert1, times = 0;
if(buff_size < Page_bufferSize)
times = 1;
else{
times = buff_size / (Page_bufferSize-1);
if((times%(Page_bufferSize-1))!=0)
times++;
}
/* Note : According to winbond flash feature, the last bytes of every 256 bytes should be 0, so we need to plus one byte on every 256 bytes*/
i = 0;
while(i < times){
memset(&pagebuf[0], 0, Page_bufferSize - 1);
memcpy(&pagebuf[0], &writebuf[i*255], Page_bufferSize - 1);
memcpy(§orBuf[i*Page_bufferSize], &pagebuf[0], Page_bufferSize - 1);
sectorBuf[((i+1)*Page_bufferSize)-1] = 0;
i++;
}
i = 0;
while(i < times){
if((nres=STM32SPIPageProgram(§orBuf[Page_bufferSize*i], &tmpaddr[0], Page_bufferSize)) != VAT_SUCCESS){
DBGSTR("STM32SPIProgram_allData write data fail on %d times!",i);
free(sectorBuf);
free(pagebuf);
return nres;
}
tmp_convert1 = (tmpaddr[0]<<16 | tmpaddr[1]<<8 | tmpaddr[2]) + Page_bufferSize;
tmpaddr[0] = (tmp_convert1&0xFF0000) >> 16;
tmpaddr[1] = (tmp_convert1&0xFF00) >>8;
tmpaddr[2] = 0x00;
i++;
}
free(sectorBuf);
free(pagebuf);
return nres;
}
I open the debugger and it seems like it crash when i malloced "sectorbuf" in function "STM32SPIProgram_multiPage", what Im confused is that i did free the memory after "malloc". anyone has idea about it?
arm-none-eabi-size "RTOS.elf"
text data bss dec hex filename
77564 988 100756 179308 2bc6c RTOS.elf
Reading the man
Memory Management
[...]
If RTOS objects are created dynamically then the standard C library malloc() and free() functions can sometimes be used for the purpose, but ...
they are not always available on embedded systems,
they take up valuable code space,
they are not thread safe, and
they are not deterministic (the amount of time taken to execute the function will differ from call to call)
... so more often than not an alternative memory allocation implementation is required.
One embedded / real time system can have very different RAM and timing requirements to another - so a single RAM allocation algorithm will only ever be appropriate for a subset of applications.
To get around this problem, FreeRTOS keeps the memory allocation API in its portable layer. The portable layer is outside of the source files that implement the core RTOS functionality, allowing an application specific implementation appropriate for the real time system being developed to be provided. When the RTOS kernel requires RAM, instead of calling malloc(), it instead calls pvPortMalloc(). When RAM is being freed, instead of calling free(), the RTOS kernel calls vPortFree().
[...]
(Emphasis mine.)
So the meaning is that if you use directly malloc, FreeRTOS is not able to handle the heap consumed by the system function. Same if you choose heap_3 management that is a simple malloc wrapper.
Take also note that the memory management you choose has no free capability.
heap_1.c
This is the simplest implementation of all. It does not permit memory to be freed once it has been allocated. Despite this, heap_1.c is appropriate for a large number of embedded applications. This is because many small and deeply embedded applications create all the tasks, queues, semaphores, etc. required when the system boots, and then use all of these objects for the lifetime of program (until the application is switched off again, or is rebooted). Nothing ever gets deleted.
The implementation simply subdivides a single array into smaller blocks as RAM is requested. The total size of the array (the total size of the heap) is set by configTOTAL_HEAP_SIZE - which is defined in FreeRTOSConfig.h. The configAPPLICATION_ALLOCATED_HEAP FreeRTOSConfig.h configuration constant is provided to allow the heap to be placed at a specific address in memory.
The xPortGetFreeHeapSize() API function returns the total amount of heap space that remains unallocated, allowing the configTOTAL_HEAP_SIZE setting to be optimised.
The heap_1 implementation:
Can be used if your application never deletes a task, queue, semaphore, mutex, etc. (which actually covers the majority of applications in which FreeRTOS gets used).
Is always deterministic (always takes the same amount of time to execute) and cannot result in memory fragmentation.
Is very simple and allocated memory from a statically allocated array, meaning it is often suitable for use in applications that do not permit true dynamic memory allocation.
(Emphasis mine.)
Side note: You have always to check malloc return value != NULL.
I am trying to implement the known method of "Dynamic Forking of Win32 EXE", which is knows as RunPE.
My problem is that i am can't get the right result of the "base address" as it mentioned in the 3rd point at http://www.security.org.sg/code/loadexe.html
This is my code:
DWORD* peb;
DWORD* baseAddress;
...snip...
GetThreadContext(hTarget, &contx)
peb = (DWORD *) contx.Ebx;
baseAddress = (DWORD *) contx.Ebx+8;
_tprintf(_T("The EBX [PEB] is: 0x%08X\nThe base address is: 0x%08X\nThe Entry Point is: 0x%08X\n"), peb, baseAddress, contx.Eax);
and the output is as follwos:
The EBX [PEB] is: 0x7FFD4000
The base address is: 0x7FFD4020
The Entry Point is: 0x00401000
I think that my problem is with the implementation of my baseAddress pointer, but i can't figure out exactly what is the issue. Or could be that i havn't understand the above article correctly and baseAddress isn't ImageBase, if so what is baseAddress ?
I have tried to run it under Win 7 64b and Win-XP and on both i am get the same incorrect results.
Note that the instructions say "at [EBX+8]". The brackets mean the value at that address location. There are several problems with
baseAddress = (DWORD *) contx.Ebx+8;
First, the compiler doesn't pay attention spacing, only to parenthesizing, so this means
baseAddress = ((DWORD *)contx.Ebx) + 8;
which is wrong because the 8 is counting DWORDs, rather than bytes. You want
baseAddress = (DWORD *)(contx.Ebx + 8);
but this just gets you the address where the baseAddress is stored, not the value of the baseAddress. For that you need
baseAddress = *(DWORD *)(contx.Ebx + 8);
However, this only works if contx.Ebx refers to an address in your process, but every process has its own address space, and you need to access the address space of the suspended process; for that you need to use ReadProcessMemory ( http://msdn.microsoft.com/en-us/library/windows/desktop/ms680553%28v=vs.85%29.aspx ):
ok = ReadProcessMemory(hTarget, (LPCVOID)(contx.Ebx + 8), (LPVOID)&baseAddress, sizeof baseAddress, NULL);
You're just doing pointer arithmetic, you're not actually dereferencing the memory, in this line:
baseAddress = (DWORD *) contx.Ebx+8;
You're just adding 8*sizeof(DWORD) = 32 to the value of contx.Ebx. What you really want to do is read the data at the address of contx.Ebx+8 in the new process's address space. In order to do that, you need to use ReadProcessMemory, and don't bother to cast -- you want to use raw offsets, not offsets multiplied by sizeof(DWORD), which is what happens when you do pointer arithmetic with DWORD* values.
However, I'd strongly caution you against digging into implementation details like this, which can and do change between different versions of Windows. Keep in mind that the article you linked to was written in 2004, and it was just a proof-of-concept, so there are likely to be lots of hidden gotchas and unexpected problems in Vista, Windows 7, Windows 8, and future versions.
The Windows API does not have a function that behaves like Unix's fork(2) function, and as a result, you should try as much as possible to avoid needing to fork -- use CreateProcess instead of fork+exec etc. The Cygwin implementation of fork is ugly and slow, and it can fail unexpectedly due to DLL memory mapping issues.
In addition to what others have said, there is a much simplier way to retreive the base address of the process's PEB structure. Use NtQueryInformationProcess() instead, setting its ProcessInformationClass parameter set to ProcessBasicInformation. The output will be a PROCESS_BASIC_INFORMATION structure:
typedef struct _PROCESS_BASIC_INFORMATION {
PVOID Reserved1;
PPEB PebBaseAddress;
PVOID Reserved2[2];
ULONG_PTR UniqueProcessId;
PVOID Reserved3;
} PROCESS_BASIC_INFORMATION;
The second member is what you are looking for.