accessing data from file path with"space" on windows - c

I'm facing a weird problem on windows
I'm using a library called STDCL which runs pretty well on linux,but on windows there is an error if the output .exe file path got "spaces"
example:
c:\my file\my file.exe //won't work
c:\my_file\my file.exe //will work
c:\my file\my file.exe //won't work
// and it is accessing data from dll(any where) containing STDCL library
c:\my_file\my file.exe //will work
// and it is accessing data from dll(any where) containing STDCL library
I got the source code to compile the library
or is there an easier way to force accepting the path of the .exe inside my .dll
edit: sample code
/* hello_stdcl.c */
#include <stdio.h>
#include <stdcl.h>
int main()
{
stdcl_init(); // this is only necessary for Windows
cl_uint n = 64;
#if(1)
/* use default contexts, if no GPU use CPU */
CLCONTEXT* cp = (stdgpu)? stdgpu : stdcpu;
unsigned int devnum = 0;
void* clh = clopen(cp,"matvecmult.cl",CLLD_NOW);
cl_kernel krn = clsym(cp,clh,"matvecmult_kern",0);
/* allocate OpenCL device-sharable memory */
cl_float* aa = (float*)clmalloc(cp,n*n*sizeof(cl_float),0);
cl_float* b = (float*)clmalloc(cp,n*sizeof(cl_float),0);
cl_float* c = (float*)clmalloc(cp,n*sizeof(cl_float),0);
clndrange_t ndr = clndrange_init1d( 0, n, 64);
/* initialize vectors a[] and b[], zero c[] */
int i,j;
for(i=0;i<n;i++) for(j=0;j<n;j++) aa[i*n+j] = 1.1f*i*j;
for(i=0;i<n;i++) b[i] = 2.2f*i;
for(i=0;i<n;i++) c[i] = 0.0f;
/* define the computational domain and workgroup size */
//clndrange_t ndr = clndrange_init1d( 0, n, 64);
/* non-blocking sync vectors a and b to device memory (copy to GPU)*/
clmsync(cp,devnum,aa,CL_MEM_DEVICE|CL_EVENT_NOWAIT);
clmsync(cp,devnum,b,CL_MEM_DEVICE|CL_EVENT_NOWAIT);
/* set the kernel arguments */
clarg_set(cp,krn,0,n);
clarg_set_global(cp,krn,1,aa);
clarg_set_global(cp,krn,2,b);
clarg_set_global(cp,krn,3,c);
/* non-blocking fork of the OpenCL kernel to execute on the GPU */
clfork(cp,devnum,krn,&ndr,CL_EVENT_NOWAIT);
/* non-blocking sync vector c to host memory (copy back to host) */
clmsync(cp,0,c,CL_MEM_HOST|CL_EVENT_NOWAIT);
/* force execution of operations in command queue (non-blocking call) */
clflush(cp,devnum,0);
/* block on completion of operations in command queue */
clwait(cp,devnum,CL_ALL_EVENT);
for(i=0;i<n;i++) printf("%d %f %f\n",i,b[i],c[i]);
clfree(aa);
clfree(b);
clfree(c);
clclose(cp,clh);
#endif
system("pause");
}
edit 2:
when I compile the code above ...take the result .exe file and put it in a path without spaces (short path) it works
if I put it in a path with spaces ...it simply crashes and when I debugged it was like memory issue (so it crashes with long path)
when I contacted the library creator he told me:
"windows getcwd() call returns an unusable path with spaces"
as I told before this library works fine on Linux,what may be the solution for this on Windows
system: win7 64 bit

Use quates for the binary name/path as "my file.exe"

Related

Program with while loop causes stack overflow, but only in x86 and only when injected into another process

I have an unfortunately convoluted problem that I am hopeful someone might be able to help me with.
I have written a reasonably large program that I have converted into position independent code (see here for reference: https://bruteratel.com/research/feature-update/2021/01/30/OBJEXEC/). Basically just meaning that the resulting exe (compiled using mingw) contains data only in the .text section, and thus can be injected into and ran from an arbitrary place in memory. I have successfully ported the program to this format and can compile it for both x86 and x64.
I created two "helper" exe's to run the PIC program, a local injector and a remote injector. The local injector runs the program by calling VirtualAlloc, memcpy, and CreateThread. The remote injector runs the program by calling CreateProcess (suspended), VirtualAllocEx, WriteProcessMemory, QueueAPCThread, and ResumeThread (the last two api's being called on pi.hThread which was returned from CreateProcess).
I am experiencing inconsistent results in the program depending on the architecture and method of execution.
x64 local: works
x64 inject: works
x86 local: works
x86 inject: fails; stack overflow
I have determined that my program is crashing in a while loop in a particular function. This function is used to format data contained in buffers (heap allocated) that are passed in as function args. The raw data buffer (IOBuf) contains a ~325k long string containing Base64 characters with spaces randomly placed throughout. The while loop in question iterates over this buffer and copies non-space characters to a second buffer (IntermedBuf), with the end goal being that IntermedBuf contains the full Base64 string in IOBuf minus the random spaces.
A few notes about the following code snippet:
Because the code is written to be position independent, all api's must be manually resolved which is why you see things like (SPRINTF)(Apis.sprintfFunc). I have resolved the addresses of each API in their respective DLL and have created typedef's for each API that is called. While odd, this is not in itself causing the issue as the code works fine in 3/4 of the situations.
Because this program is failing when injected, I cannot use print statements to debug, so I have added calls to MessageBoxA to pop up at certain places to determine contents of variables and/or if execution is reaching that part of the code.
The relevant code snippet is as follows:
char inter[] = {'I','n','t',' ',0};
char tools[100] = {0};
if (((STRCMP)Apis.strcmpFunc)(IntermedBuf, StringVars->b64Null) != 0)
{
int i = 0, j = 0, strLen = 0, lenIOBuf = ((STRLEN)Apis.strlenFunc)(IOBuf);
((SPRINTF)Apis.sprintfFunc)(tools, StringVars->poi, IOBuf);
((MESSAGEBOXA)Apis.MessageBoxAFunc)(NULL, tools, NULL, NULL);
((MEMSET)Apis.memsetFunc)(tools, 0, 100 * sizeof(char));
((SPRINTF)Apis.sprintfFunc)(tools, StringVars->poi, IntermedBuf);
((MESSAGEBOXA)Apis.MessageBoxAFunc)(NULL, tools, NULL, NULL);
char* locSpace;
while (j < lenIOBuf)
{
locSpace = ((STRSTR)Apis.strstrFunc)(IOBuf + j, StringVars->space);
if (locSpace == 0)
locSpace = IOBuf + lenIOBuf;
strLen = locSpace - IOBuf - j;
((MEMCPY)Apis.memcpyFunc)(IntermedBuf + i, IOBuf + j, strLen);
i += strLen, j += strLen + 1;
}
((MESSAGEBOXA)Apis.MessageBoxAFunc)(NULL, StringVars->here, NULL, NULL);
((MEMSET)Apis.memsetFunc)(IOBuf, 0, BUFFSIZE * sizeof(char));
The first two MessageBoxA calls successfully execute, each containing the address of IOBuf and IntermedBuf respectively. The last call to MessageBoxA, after the while loop, never comes, meaning the program is crashing in the while loop as it copies data from IOBuf to IntermedBuf.
I ran remote.exe which spawned a new WerFault.exe (I have tried with calc, notepad, several other processes with the same result) containing the PIC program, and stuck it into Windbg to try and get a better sense of what was happening. I found that after receiving the first two message boxes and clicking through them, WerFault crashes with a stack overflow caused by a call to strstr:
Examining the contents of the stack at crash time shows this:
Looking at the contents of IntermedBuf (which is one of the arguments passed to the strstr call) I can see that the program IS copying data from IOBuf to IntermedBuf and removing spaces as intended, however the program crashes after copying ~80k.
IOBuf (raw data):
IntermedBuf(After removing spaces)
My preliminary understanding of what is happening here is that strstr (and potentially memcpy) are pushing data to the stack with each call, and given the length of the loop (lengthIOBuf is ~325K, spaces occur randomly every 2-11 characters throught) the stack is overflowing before the while loop finishes and the stack unwinds. However this doesn't explain why this succeeds in x64 in both cases, and in x86 when the PIC program is running in a user-made program as opposed to injected into a legitimate process.
I have ran the x86 PIC program in the local injector, where it succeeds, and also attached Windbg to it in order to examine what is happening differently there. The stack similarly contains the same sort of pattern of characters as seen in the above screenshot, however later in the loop (because again the program succeeds), the stack appears to... jump? I examined the contents of the stack early into the while loop (having set bp on strstr) and see that it contains much the same pattern seen in the stack in the remote injector session:
I also added another MessageBox this time inside the while loop, set to pop when j > lenIOBuf - 500 in order to catch the program as it neared completion of the while loop.
char* locSpace;
while (j < lenIOBuf)
{
if (j > lenIOBuf - 500)
{
((MEMSET)Apis.memsetFunc)(tools, 0, 100 * sizeof(char));
((SPRINTF)Apis.sprintfFunc)(tools, StringVars->poi, IntermedBuf);
((MESSAGEBOXA)Apis.MessageBoxAFunc)(NULL, tools, NULL, NULL);
}
locSpace = ((STRSTR)Apis.strstrFunc)(IOBuf + j, StringVars->space);
if (locSpace == 0)
locSpace = IOBuf + lenIOBuf;
strLen = locSpace - IOBuf - j;
((MEMCPY)Apis.memcpyFunc)(IntermedBuf + i, IOBuf + j, strLen);
i += strLen, j += strLen + 1;
}
When this MessageBox popped, I paused execution and found that ESP was now 649fd80; previously it was around 13beb24?
So it appears that the stack relocated, or the local injector added more memory to the stack or something (I am embarassingly naive about this stuff). Looking at the "original" stack location at this stage in execution shows that the data there previously is still there at this point when the loop is near completion:
So bottom line, this code which runs successfully by all accounts in x64 local/remote and x86 local is crashing when ran in another process in x86. It appears that in the local injector case the stack fills in a similar fashion as in the remote injector where it crashes, however the local injector is relocating the stack or adding more stack space or something which isn't happening in the remote injector. Does anyone have any ideas why, or more importantly, how I could alter the code to achieve the goal of removing spaces from a large, arbitrary buffer in a different way where I might not encounter the overflow that I am currently?
Thanks for any help
typedef void*(WINAPI* MEMCPY)(void * destination, const void * source, size_t num);
typedef char*(WINAPI* STRSTR)(const char *haystack, const char *needle);
is wrong declarations. both this api used __cdecl calling convention - this mean that caller must up stack ( add esp,4*param_count) after call. but because you declare it as __stdcall (== WINAPI) compiler not generate add esp,4*param_count instruction. so you have unbalanced push for parameters.
you need use
typedef void * (__cdecl * MEMCPY)(void * _Dst, const void * _Src, _In_ size_t _MaxCount);
typedef char* (__cdecl* STRSTR)(_In_z_ char* const _String, _In_z_ char const* const _SubString);
and so on..
Familiar with what you are doing, and frankly I moved onto compiling some required functions (memcpy, etc) instead of manually looking them up and making external calls.
For example:
inline void* _memcpy(void* dest, const void* src, size_t count)
{
char *char_dest = (char *)dest;
char *char_src = (char *)src;
if ((char_dest <= char_src) || (char_dest >= (char_src+count)))
{
/* non-overlapping buffers */
while(count > 0)
{
*char_dest = *char_src;
char_dest++;
char_src++;
count--;
}
}
else
{
/* overlaping buffers */
char_dest = (char *)dest + count - 1;
char_src = (char *)src + count - 1;
while(count > 0)
{
*char_dest = *char_src;
char_dest--;
char_src--;
count--;
}
}
return dest;
}
inline char * _strstr(const char *s, const char *find)
{
char c, sc;
size_t len;
if ((c = *find++) != 0)
{
len = strlen(find);
do {
do {
if ((sc = *s++) == 0)
return 0;
} while (sc != c);
} while (strncmp(s, find, len) != 0);
s--;
}
return (char *)((size_t)s);
}
Credits for the above code from ReactOS. You can lookup the rest required (strlen, etc.)

MSP430 SD card application running on top of FATFS appears too restrictive . Is my understanding correct?

I am working my way through an SD card application code example provided by TI for their MSP530 LaunchPad microcontroller development kit. It appears that the example restricts the number of directories and number of files to 10 each (a total of 100 files) which seems overly restrictive for a 32GB SD card. The current code compiles to use less than half of the program space and less than half of available RAM. I am wondering if I misunderstand the code, or if the code is limited by some other reason, such as the available stack size in memory. Below is the code and my comments.
There are several layers: SDCardLogMode, sdcard (SDCardLib), and ff (HAL layer). I've reduced the code below to illustrate the constructions but not runnable - I am more interested if I understand it correctly and if my solution to increase the number of allowed files and directories is flawed.
SDCardLogMode.c there are two places of interest here. The first is the declaration of char dirs[10][MAX_DIR_LEN] and files[10][MAX_FILE_LEN]. the MAX LENs are 8 and 12 respectively and are the maximum allowed length of a name.
/*******************************************************************************
*
* SDCardLogMode.c
* ******************************************************************************/
#include "stdlib.h"
#include "string.h"
#include "SDCardLogMode.h"
#include "driverlib.h"
#include "sdcard.h"
#include "HAL_SDCard.h"
#pragma PERSISTENT(numLogFiles)
uint8_t numLogFiles = 0;
SDCardLib sdCardLib;
char dirs[10][MAX_DIR_LEN];
char files[10][MAX_FILE_LEN]; //10 file names. MAX_FILE_LEN =10
uint8_t dirNum = 0;
uint8_t fileNum = 0;
#define MAX_BUF_SIZE 32
char buffer[MAX_BUF_SIZE];
// FatFs Static Variables
static FIL fil; /* File object */
static char filename[31];
static FRESULT rc;
//....
Later in the same SDCardLogMode.c file is the following function (also reduced for readability). Here the interesting thing is that the code calls SDCardLib_getDirectory(&sdCardLib, "data_log", dirs, &dirNum, files, &fileNum) which consume the "data_log" path and produces dir, and updates &dirNum, files, and &fileNum. I do not believe &sdCardLib (which holds a handle to the FATFS and an interface pointer) is used in this function. At least not that I can tell.
What is puzzling is what's the point of calling SDCardLib_getDirectory() and then not using anything it produces? I did not find any downstream use of the dirs and files char arrays. Nor did I find any use of dirNum and fileNum either.
In the code snippets I show the code for SDCardLib_getDirectory(). I could not find where SDCardLib parameter is used. And as mentioned earlier, I found no use of files and dirs arrays. I can see where the file and directory count could be used to generate new names, but there are already static variables to hold the file count. Can anyone see a reason why the SDCard_getDirectory() was called?
/*
* Store TimeStamp from PC when logging starts to SDCard
*/
void storeTimeStampSDCard()
{
int i = 0;
uint16_t bw = 0;
unsigned long long epoch;
// FRESULT rc;
// Increment log file number
numLogFiles++;
,
//Detect SD card
SDCardLib_Status st = SDCardLib_detectCard(&sdCardLib);
if (st == SDCARDLIB_STATUS_NOT_PRESENT) {
SDCardLib_unInit(&sdCardLib);
mode = '0';
noSDCard = 1; //jn added
return;
}
// Read directory and file
rc = SDCardLib_getDirectory(&sdCardLib, "data_log", dirs, &dirNum, files, &fileNum);
//Create the directory under the root directory
rc = SDCardLib_createDirectory(&sdCardLib, "data_log");
if (rc != FR_OK && rc != FR_EXIST) {
SDCardLib_unInit(&sdCardLib);
mode = '0';
return;
}
//........
}
Now jumping to sdcard.c (SDCardLib layer) to look at SDCardLib_getDirectory() is interesting. It takes the array pointer assigns it to a one dimensional array (e.g. char (*fileList)[MAX_FILE_LEN] and indexes it each time it writes a filename). This code seem fragile since the SDCardLib_createDirectory() simply returns f_mkdir(directoryName), it does not check how many files already exist. Perhaps TI assumes this checking should be done at the application layer above SDCardLogMode....
void SDCardLib_unInit(SDCardLib * lib)
{
/* Unregister work area prior to discard it */
f_mount(0, NULL);
}
FRESULT SDCardLib_getDirectory(SDCardLib * lib,
char * directoryName,
char (*dirList)[MAX_DIR_LEN], uint8_t *dirNum,
char (*fileList)[MAX_FILE_LEN], uint8_t *fileNum)
{
FRESULT rc; /* Result code */
DIRS dir; /* Directory object */
FILINFO fno; /* File information object */
uint8_t dirCnt = 0; /* track current directory count */
uint8_t fileCnt = 0; /* track current directory count */
rc = f_opendir(&dir, directoryName);
for (;;)
{
rc = f_readdir(&dir, &fno); // Read a directory item
if (rc || !fno.fname[0]) break; // Error or end of dir
if (fno.fattrib & AM_DIR) //this is a directory
{
strcat(*dirList, fno.fname); //add this to our list of names
dirCnt++;
dirList++;
}
else //this is a file
{
strcat(*fileList, fno.fname); //add this to our list of names
fileCnt++;
fileList++;
}
}
*dirNum = dirCnt;
*fileNum = fileCnt;
return rc;
}
Below is SDCardLib_createDirectory(SDCardLib *lib, char *directoryName). It just creates a directory, it does not check on the existing number of files.
FRESULT SDCardLib_createDirectory(SDCardLib * lib, char * directoryName)
{
return f_mkdir(directoryName);
}
So coming back to my questions:
Did I understand this code correctly, does it really does limit the number of directories and files to 10 each?
If so, why would the number of files and directories be so limited? The particular MSP430 that this example code came with has 256KB of program space and 8KB of RAM. The compiled code consumes less than half of the available resources (68KB of program space and about 2.5KB of RAM). Is it because any larger would overflow the stack segment?
I want to increase the number of files that can be stored. If I look at the underlying FATFS code, it does not to impose a limit on the number of files or directories (at least not until the sd card is full). If I never intend to display or search the contents of a directory on the MSP430 my thought is to remove SDCard_getDirectory() and the two char arrays (files and dirs). Would there a reason why this would be a bad idea?
There are other microcontrollers with less memory.
The SDCardLib_getDirectory() function treats its dirList and fileList parameters as simple strings, i.e., it calls strcat() on the same pointers. This means that it can read as many names as fit into 10*8 or 10*12 bytes.
And calling strcat() without adding a delimiter means that it is impossible to get the individual names out of the string.
This codes demonstrates that it is possible to use the FatFs library, and in which order its functions need to be called, but it is not necessarily a good example of how to do that. I recommend that you write your own code.

CUDA How to access constant memory in device kernel when the constant memory is declared in the host code?

For the record this is homework so help as little or as much with that in mind. We are using constant memory to store a "mask matrix" that will be used to perform a convolution on a larger matrix. When I am in the host code I am copying the mask to constant memory using the cudaMemcpyToSymbol().
My question is once this is copied over and I launch my device kernel code how does the device know where to access the constant memory mask matrix. Is there a pointer that I need to pass in on kernel launch. Most of the code that the professor gave us is not supposed to be changed (there is no pointer to the mask passed in) but there is always the possibility that he made a mistake ( although it is most likely my understanding of something)
Is the constant memeory declaratoin supposed to be included in the seperate kernel.cu file?
I am minimizing the code to just show the things having to do with the constant memory. As such please don't point out if something is not initialized ect. There is code for that but that is not of concern at this time.
main.cu:
#include <stdio.h>
#include "kernel.cu"
__constant__ float M_d[FILTER_SIZE * FILTER_SIZE];
int main(int argc, char* argv[])
{
Matrix M_h, N_h, P_h; // M: filter, N: input image, P: output image
/* Allocate host memory */
M_h = allocateMatrix(FILTER_SIZE, FILTER_SIZE);
N_h = allocateMatrix(imageHeight, imageWidth);
P_h = allocateMatrix(imageHeight, imageWidth);
/* Initialize filter and images */
initMatrix(M_h);
initMatrix(N_h);
cudaError_t cudda_ret = cudaMemcpyToSymbol(M_d, M_h.elements, M_h.height * M_h.width * sizeof(float), 0, cudaMemcpyHostToDevice);
//char* cudda_ret_pointer = cudaGetErrorString(cudda_ret);
if( cudda_ret != cudaSuccess){
printf("\n\ncudaMemcpyToSymbol failed\n\n");
printf("%s, \n\n", cudaGetErrorString(cudda_ret));
}
// Launch kernel ----------------------------------------------------------
printf("Launching kernel..."); fflush(stdout);
//INSERT CODE HERE
//block size is 16x16
// \\\\\\\\\\\\\**DONE**
dim_grid = dim3(ceil(N_h.width / (float) BLOCK_SIZE), ceil(N_h.height / (float) BLOCK_SIZE));
dim_block = dim3(BLOCK_SIZE, BLOCK_SIZE);
//KERNEL Launch
convolution<<<dim_grid, dim_block>>>(N_d, P_d);
return 0;
}
kernel.cu: THIS IS WHERE I DO NOT KNOW HOW TO ACCESS THE CONSTANT MEMORY.
//__constant__ float M_c[FILTER_SIZE][FILTER_SIZE];
__global__ void convolution(Matrix N, Matrix P)
{
/********************************************************************
Determine input and output indexes of each thread
Load a tile of the input image to shared memory
Apply the filter on the input image tile
Write the compute values to the output image at the correct indexes
********************************************************************/
//INSERT KERNEL CODE HERE
//__shared__ float N_shared[BLOCK_SIZE][BLOCK_SIZE];
//int row = (blockIdx.y * blockDim.y) + threadIdx.y;
//int col = (blockIdx.x * blockDim.x) + threadIdx.x;
}
In "classic" CUDA compilation you must define all code and symbols (textures, constant memory, device functions) and any host API calls which access them (including kernel launches, binding to textures, copying to symbols) within the same translation unit. This means, effectively, in the same file (or via multiple include statements within the same file). This is because "classic" CUDA compilation doesn't include a device code linker.
Since CUDA 5 was released, there is the possibility of using separate compilation mode and linking different device code objects into a single fatbinary payload on architectures which support it. In that case, you need to declare any __constant__ variables using the extern keyword and define the symbol exactly once.
If you can't use separate compilation, then the usual workaround is to define the __constant__ symbol in the same .cu file as your kernel, and include a small host wrapper function which just calls cudaMemcpyToSymbol to set the __constant__ symbol in question. You would probably do the same with kernel calls and texture operations.
Below is a "minimum-sized" example showing the use of __constant__ symbols. You do not need to pass any pointer to the __global__ function.
#include <cuda.h>
#include <cuda_runtime.h>
#include <stdio.h>
__constant__ float test_const;
__global__ void test_kernel(float* d_test_array) {
d_test_array[threadIdx.x] = test_const;
}
#include <conio.h>
int main(int argc, char **argv) {
float test = 3.f;
int N = 16;
float* test_array = (float*)malloc(N*sizeof(float));
float* d_test_array;
cudaMalloc((void**)&d_test_array,N*sizeof(float));
cudaMemcpyToSymbol(test_const, &test, sizeof(float));
test_kernel<<<1,N>>>(d_test_array);
cudaMemcpy(test_array,d_test_array,N*sizeof(float),cudaMemcpyDeviceToHost);
for (int i=0; i<N; i++) printf("%i %f\n",i,test_array[i]);
getch();
return 0;
}

CUDA kernel launch parameters explained right?

Here I tried to self-explain the CUDA launch parameters model (or execution configuration model) using some pseudo codes, but I don't know if there were some big mistakes, So hope someone help to review it, and give me some advice. Thanks advanced.
Here it is:
/*
normally, we write kernel function like this.
note, __global__ means this function will be called from host codes,
and executed on device. and a __global__ function could only return void.
if there's any parameter passed into __global__ function, it should be stored
in shared memory on device. so, kernel function is so different from the *normal*
C/C++ functions. if I was the CUDA authore, I should make the kernel function more
different from a normal C function.
*/
__global__ void
kernel(float *arr_on_device, int n) {
int idx = blockIdx.x * blockDIm.x + threadIdx.x;
if (idx < n) {
arr_on_device[idx] = arr_on_device[idx] * arr_on_device[idx];
}
}
/*
after this definition, we could call this kernel function in our normal C/C++ codes !!
do you feel something wired ? un-consistant ?
normally, when I write C codes, I will think a lot about the execution process down to
the metal in my mind, and this one...it's like some fragile codes. break the sequential
thinking process in my mind.
in order to make things normal, I found a way to explain: I expand the *__global__ * function
to some pseudo codes:
*/
#define __foreach(var, start, end) for (var = start, var < end; ++var)
__device__ int
__indexing() {
const int blockId = blockIdx.x * gridDim.x + gridDim.x * gridDim.y * blockIdx.z;
return
blockId * (blockDim.x * blockDim.y * blockDim.z) +
threadIdx.z * (blockDim.x * blockDim.y) +
threadIdx.x;
}
global_config =:
{
/*
global configuration.
note the default values are all 1, so in the kernel codes,
we could just ignore those dimensions.
*/
gridDim.x = gridDim.y = gridDim.z = 1;
blockDim.x = blockDim.y = blockDim.z = 1;
};
kernel =:
{
/*
I thought CUDA did some bad evil-detail-covering things here.
it's said that CUDA C is an extension of C, but in my mind,
CUDA C is more like C++, and the *<<<>>>* part is too tricky.
for example:
kernel<<<10, 32>>>(); means kernel will execute in 10 blocks each have 32 threads.
dim3 dimG(10, 1, 1);
dim3 dimB(32, 1, 1);
kernel<<<dimG, dimB>>>(); this is exactly the same thing with above.
it's not C style, and C++ style ? at first, I thought this could be done by
C++'s constructor stuff, but I checked structure *dim3*, there's no proper
constructor for this. this just brroke the semantics of both C and C++. I thought
force user to use *kernel<<<dim3, dim3>>>* would be better. So I'd like to keep
this rule in my future codes.
*/
gridDim = dimG;
blockDim = dimB;
__foreach(blockIdx.z, 0, gridDim.z)
__foreach(blockIdx.y, 0, gridDim.y)
__foreach(blockIdx.x, 0, gridDim.x)
__foreach(threadIdx.z, 0, blockDim.z)
__foreach(threadIdx.y, 0, blockDim.y)
__foreach(threadIdx.x, 0, blockDim.x)
{
const int idx = __indexing();
if (idx < n) {
arr_on_device[idx] = arr_on_device[idx] * arr_on_device[idx];
}
}
};
/*
so, for me, gridDim & blockDim is like some boundaries.
e.g. gridDim.x is the upper bound of blockIdx.x, this is not that obvious for people like me.
*/
/* the declaration of dim3 from vector_types.h of CUDA/include */
struct __device_builtin__ dim3
{
unsigned int x, y, z;
#if defined(__cplusplus)
__host__ __device__ dim3(unsigned int vx = 1, unsigned int vy = 1, unsigned int vz = 1) : x(vx), y(vy), z(vz) {}
__host__ __device__ dim3(uint3 v) : x(v.x), y(v.y), z(v.z) {}
__host__ __device__ operator uint3(void) { uint3 t; t.x = x; t.y = y; t.z = z; return t; }
#endif /* __cplusplus */
};
typedef __device_builtin__ struct dim3 dim3;
CUDA DRIVER API
The CUDA Driver API v4.0 and above uses the following functions to control a kernel launch:
cuFuncSetCacheConfig
cuFuncSetSharedMemConfig
cuLaunchKernel
The following CUDA Driver API functions were used prior to the introduction of cuLaunchKernel in v4.0.
cuFuncSetBlockShape()
cuFuncSetSharedSize()
cuParamSet{Size,i,fv}()
cuLaunch
cuLaunchGrid
Additional information on these functions can be found in cuda.h.
CUresult CUDAAPI cuLaunchKernel(CUfunction f,
unsigned int gridDimX,
unsigned int gridDimY,
unsigned int gridDimZ,
unsigned int blockDimX,
unsigned int blockDimY,
unsigned int blockDimZ,
unsigned int sharedMemBytes,
CUstream hStream,
void **kernelParams,
void **extra);
cuLaunchKernel takes as parameters the entire launch configuration.
See NVIDIA Driver API[Execution Control]1 for more details.
CUDA KERNEL LAUNCH
cuLaunchKernel will
1. verify the launch parameters
2. change the shared memory configuration
3. change the local memory allocation
4. push a stream synchronization token into the command buffer to make sure two commands in the stream do not overlap
4. push the launch parameters into the command buffer
5. push the launch command into the command buffer
6. submit the command buffer to the device (on wddm drivers this step may be deferred)
7. on wddm the kernel driver will page all memory required in device memory
The GPU will
1. verify the command
2. send the commands to the compute work distributor
3. dispatch launch configuration and thread blocks to the SMs
When all thread blocks have completed the work distributor will flush the caches to honor the CUDA memory model and it will mark the kernel as completed so the next item in the stream can make forward progress.
The order that thread blocks are dispatched differs between architectures.
Compute capability 1.x devices store the kernel parameters in shared memory.
Compute capability 2.0-3.5 devices store the kenrel parameters in constant memory.
CUDA RUNTIME API
The CUDA Runtime is a C++ software library and build tool chain on top of the CUDA Driver API. The CUDA Runtime uses the following functions to control a kernel launch:
cudaConfigureCall
cudaFuncSetCacheConfig
cudaFuncSetSharedMemConfig
cudaLaunch
cudaSetupArgument
See NVIDIA Runtime API[Execution Control]2
The <<<>>> CUDA language extension is the most common method used to launch a kernel.
During compilation nvcc will create a new CPU stub function for each kernel function called using <<<>>> and it will replace the <<<>>> with a call to the stub function.
For example
__global__ void kernel(float* buf, int j)
{
// ...
}
kernel<<<blocks,threads,0,myStream>>>(d_buf,j);
generates
void __device_stub__Z6kernelPfi(float *__par0, int __par1){__cudaSetupArgSimple(__par0, 0U);__cudaSetupArgSimple(__par1, 4U);__cudaLaunch(((char *)((void ( *)(float *, int))kernel)));}
You can inspect the generated files by adding --keep to your nvcc command line.
cudaLaunch calls cuLaunchKernel.
CUDA DYNAMIC PARALLELISM
CUDA CDP works similar to the CUDA Runtime API described above.
By using <<<...>>>, you are launching a number of threads in the GPU. These threads are grouped into blocks and forms a large grid. All the threads will execute the invoked kernel function code.
In the kernel function, build-in variables like threadIdx and blockIdx enable the code know which thread it runs and do the scheduled part of the work.
edit
Basically, <<<...>>> simplifies the configuration procedure to launch a kernel. Without using it, one may have to call 4~5 APIs for a single kernel launch, just as the OpenCL way, which use only C99 syntax.
In fact you could check CUDA driver APIs. It may provide all those APIs so you don't need to use <<<>>>.
Basically, the GPU is divided into separate "device" GPUs (e.g. GeForce 690 has 2) -> multiple SM's (streaming multiprocessors) -> multiple CUDA cores. As far as I know, the dimensionality of a block or grid is just a logical assignment irrelevant of hardware, but the total size of a block (x*y*z) is very important.
Threads in a block HAVE TO be on the same SM, to use its facilities of shared memory and synchronization. So you cannot have blocks with more threads than CUDA cores are contained in a SM.
If we have a simple scenario where we have 16 SMs with 32 CUDA cores each, and we have 31x1x1 block size, and 20x1x1 grid size, we will forfeit at least 1/32 of the processing power of the card. Every time a block is run, a SM will have only 31 of its 32 cores busy. Blocks will load to fill up the SMs, we will have 16 blocks finish at roughly the same time, and as the first 4 SMs free up, they will start processing the last 4 blocks (NOT necessarily blocks #17-20).
Comments and corrections are welcome.

Finding the address range of the data segment

As a programming exercise, I am writing a mark-and-sweep garbage collector in C. I wish to scan the data segment (globals, etc.) for pointers to allocated memory, but I don't know how to get the range of the addresses of this segment. How could I do this?
If you're working on Windows, then there are Windows API that would help you.
//store the base address the loaded Module
dllImageBase = (char*)hModule; //suppose hModule is the handle to the loaded Module (.exe or .dll)
//get the address of NT Header
IMAGE_NT_HEADERS *pNtHdr = ImageNtHeader(hModule);
//after Nt headers comes the table of section, so get the addess of section table
IMAGE_SECTION_HEADER *pSectionHdr = (IMAGE_SECTION_HEADER *) (pNtHdr + 1);
ImageSectionInfo *pSectionInfo = NULL;
//iterate through the list of all sections, and check the section name in the if conditon. etc
for ( int i = 0 ; i < pNtHdr->FileHeader.NumberOfSections ; i++ )
{
char *name = (char*) pSectionHdr->Name;
if ( memcmp(name, ".data", 5) == 0 )
{
pSectionInfo = new ImageSectionInfo(".data");
pSectionInfo->SectionAddress = dllImageBase + pSectionHdr->VirtualAddress;
**//range of the data segment - something you're looking for**
pSectionInfo->SectionSize = pSectionHdr->Misc.VirtualSize;
break;
}
pSectionHdr++;
}
Define ImageSectionInfo as,
struct ImageSectionInfo
{
char SectionName[IMAGE_SIZEOF_SHORT_NAME];//the macro is defined WinNT.h
char *SectionAddress;
int SectionSize;
ImageSectionInfo(const char* name)
{
strcpy(SectioName, name);
}
};
Here's a complete, minimal WIN32 console program you can run in Visual Studio that demonstrates the use of the Windows API:
#include <stdio.h>
#include <Windows.h>
#include <DbgHelp.h>
#pragma comment( lib, "dbghelp.lib" )
void print_PE_section_info(HANDLE hModule) // hModule is the handle to a loaded Module (.exe or .dll)
{
// get the location of the module's IMAGE_NT_HEADERS structure
IMAGE_NT_HEADERS *pNtHdr = ImageNtHeader(hModule);
// section table immediately follows the IMAGE_NT_HEADERS
IMAGE_SECTION_HEADER *pSectionHdr = (IMAGE_SECTION_HEADER *)(pNtHdr + 1);
const char* imageBase = (const char*)hModule;
char scnName[sizeof(pSectionHdr->Name) + 1];
scnName[sizeof(scnName) - 1] = '\0'; // enforce nul-termination for scn names that are the whole length of pSectionHdr->Name[]
for (int scn = 0; scn < pNtHdr->FileHeader.NumberOfSections; ++scn)
{
// Note: pSectionHdr->Name[] is 8 bytes long. If the scn name is 8 bytes long, ->Name[] will
// not be nul-terminated. For this reason, copy it to a local buffer that's nul-terminated
// to be sure we only print the real scn name, and no extra garbage beyond it.
strncpy(scnName, (const char*)pSectionHdr->Name, sizeof(pSectionHdr->Name));
printf(" Section %3d: %p...%p %-10s (%u bytes)\n",
scn,
imageBase + pSectionHdr->VirtualAddress,
imageBase + pSectionHdr->VirtualAddress + pSectionHdr->Misc.VirtualSize - 1,
scnName,
pSectionHdr->Misc.VirtualSize);
++pSectionHdr;
}
}
// For demo purpopses, create an extra constant data section whose name is exactly 8 bytes long (the max)
#pragma const_seg(".t_const") // begin allocating const data in a new section whose name is 8 bytes long (the max)
const char const_string1[] = "This string is allocated in a special const data segment named \".t_const\".";
#pragma const_seg() // resume allocating const data in the normal .rdata section
int main(int argc, const char* argv[])
{
print_PE_section_info(GetModuleHandle(NULL)); // print section info for "this process's .exe file" (NULL)
}
This page may be helpful if you're interested in additional uses of the DbgHelp library.
You can read the PE image format here, to know it in details. Once you understand the PE format, you'll be able to work with the above code, and can even modify it to meet your need.
PE Format
Peering Inside the PE: A Tour of the Win32 Portable Executable File Format
An In-Depth Look into the Win32 Portable Executable File Format, Part 1
An In-Depth Look into the Win32 Portable Executable File Format, Part 2
Windows API and Structures
IMAGE_SECTION_HEADER Structure
ImageNtHeader Function
IMAGE_NT_HEADERS Structure
I think this would help you to great extent, and the rest you can research yourself :-)
By the way, you can also see this thread, as all of these are somehow related to this:
Scenario: Global variables in DLL which is used by Multi-threaded Application
The bounds for text (program code) and data for linux (and other unixes):
#include <stdio.h>
#include <stdlib.h>
/* these are in no header file, and on some
systems they have a _ prepended
These symbols have to be typed to keep the compiler happy
Also check out brk() and sbrk() for information
about heap */
extern char etext, edata, end;
int
main(int argc, char **argv)
{
printf("First address beyond:\n");
printf(" program text segment(etext) %10p\n", &etext);
printf(" initialized data segment(edata) %10p\n", &edata);
printf(" uninitialized data segment (end) %10p\n", &end);
return EXIT_SUCCESS;
}
Where those symbols come from: Where are the symbols etext ,edata and end defined?
Since you'll probably have to make your garbage collector the environment in which the program runs, you can get it from the elf file directly.
Load the file that the executable came from and parse the PE headers, for Win32. I've no idea about on other OSes. Remember that if your program consists of multiple files (e.g. DLLs) you may have multiple data segments.
For iOS you can use this solution. It shows how to find the text segment range but you can easily change it to find any segment you like.

Resources