CUDA device pointers - c

Quick question about the standard CUDA memory allocation model:
double* x_device;
cudaMalloc(&x_device,myArraySize);
The variable x_device is a pointer-to-double. After I call cudaMalloc, does x_device now point to a memory location on the cuda device? So, in other words, *x_device would result in a segfault because we can't directly access device memory from the host.
Incidental question, the compiler doesn't complain that I don't use (void**)&x_device, is this required? I sometimes see it in examples, sometimes not.
Thanks!

You are right: cudaMalloc allocates memory on the device. You can't use this pointer directly on the host, but only as argument to functions like cudaMemcpy, and as arguments to kernel calls.
More recent CUDA versions support unified memory addressing, there you can use cudaMallocManaged to allocate device memory, and access it on the host directly via the device pointer.
For the second question: C++ doesn't allow implicit casts between pointer types, so there leaving out the explicit cast (void**)&x_device will result in a compiler error.

Related

Pointer Array shared between OpenCL kernels

Is it possible to share an array of pointers between multiple kernels in OpenCL. If so, how would I go about implementing it? If I am not completely mistaken - which may though be the case - the only way of sharing things between kernels would be a shared cl_mem, however I also think these cannot contain pointers.
This is not possible in OpenCL 1.x because host and device have completely separate memory spaces, so a buffer containing host pointers makes no sense on the device side.
However, OpenCL 2.0 supports Shared Virtual Memory (SVM) and so memory containing pointers is legal because the host and device share an address space. There are three different levels of granularity though, which will limit what you can have those pointers point to. In the coarsest case they can only refer to locations within the same buffer or other SVM buffers currently owned by the device. Yes, cl_mem is still the way to pass in a buffer to a kernel, but in OpenCL 2.0 with SVM that buffer may contain pointers.
Edit/Addition: OP points out they just want to share pointers between kernels. If these are just device pointers, then you can store them in the buffer in one kernel and read them from the buffer in another kernel. They can only refer to __global, not __local memory. And without SVM they can't be used on the host. The host will of course need to allocate the buffer and pass it to both kernels for their use. As far as the host is concerned, it's just opaque memory. Only the kernels know they are __global pointers.
I ran into a similar problem, but I managed to get around it by using a simple pointer structure.I have doubts about the fact that someone says that buffers change their position in memory,perhaps this is true for some special cases.But this definitely cannot happen while the kernel is working with it. I have not tested it on different video cards, but on nvidia(cl 1.2) it works perfectly, so I can access data from an array that was not even passed as an argument into the kernel.
typedef struct
{
__global volatile point_dataT* point;//pointer to another struct in different buffer
} pointerBufT;
__kernel void tester(__global pointerBufT * pointer_buf){
printf("Test id: %u\n",pointer_buf[coord.x+coord.y*img_width].point->id);//Retrieving information from an array not passed to the kernel
}
I know that this is a late reply, but for some reason I have only come across negative answers to similar questions, or a suggestion to use indexes instead of pointers. While a structure with a pointer inside works great.

Passing information from UEFI to the OS

I am familiar with BIOS int 15 - E820 function, where you could choose a fixed physical location, put there whatever you wanted, the OS would not overwrite it, and you could just access that fixed memory address (may map it to a virtual pointer first etc).
But in the UEFI case, as much as I am aware, there is no memory area reserved for the user, so I couldn't rely on allocating a buffer at a specific memory address (if that's even possible?), therefore I have to use a UEFI memory memory function - which returns a pointer that is not fixed.
So my questions are -
Is it possible to allocate a buffer that will not be overwritten once the OS goes up?
How is it possible to pass the OS the pointer of the allocated buffer, so I could access it from the OS (again, since allocation, hopefully given that the buffer itself is not overwritten, is not in a fixed location).
Thank you!
Yes. Allocate memory memory of a non-reclaimable type, such as EfiRuntimeServicesData.
The mechanism UEFI uses is called configuration tables.
Note: EfiPersistentMemory is something completely different.
Configuration tables are installed by calling InstallConfigurationTable during boot services, with the two parameters being a GUID and a pointer to the physical address of the data structure you want to pass. This pair is then linked into an array pointed to by the UEFI System Table.
How you extract that information in Windows, I do not know. In Linux, the UEFI system table is globally accessible in kernel space (efi->systab), so the pointer can be extracted from there.

malloc in an embedded system without an operating system

This query is regarding allocation of memory using malloc.
Generally what we say is malloc allocates memory from heap.
Now say I have a plain embedded system(No operating system), I have normal program loaded where I do malloc in my program.
In this case where is the memory allocated from ?
malloc() is a function that is usually implemented by the runtime-library. You are right, if you are running on top of an operating system, then malloc will sometimes (but not every time) trigger a system-call that makes the OS map some memory into your program's address space.
If your program runs without an operating system, then you can think of your program as being the operating system. You have access to all addresses, meaning you can just assign an address to a pointer, then de-reference that pointer to read/write.
Of course you have to make sure that not other parts of your program just use the same memory, so you write your own memory-manager:
To put it simply you can set-aside a range of addresses which your "memory-manager" uses to store which address-ranges are already in use (the datastructures stored in there can be as easy as a linked list or much much more complex). Then you will write a function and call it e.g. malloc() which forms the functional part of your memory-manager. It looks into the mentioned datastructure to find an address of ranges that is as long as the argument specifies and return a pointer to it.
Now, if every function in your program calls your malloc() instead of randomly writing into custom addresses you've done the first step. You can write a free()-function which will look for the pointer it is given in the mentioned datastructure, and adapts the datastructure (in the naive linked-list it would merge two links).
The only real answer is "Wherever your compiler/library-implementation puts it".
In the embedded system I use, there is no heap, since we haven't written one.
From the heap as you say. The difference is that the heap is not provided by the OS. Your application's linker script will no doubt include an allocation for the heap. The run-time library will manage this.
In the case of the Newlib C library often used in GCC based embedded systems not running an OS or at least not running Linux, the library has a stub syscall function called sbrk(). It is the respnsibility of the developer to implement sbrk(), which must provide more memory the the heap manager on request. Typically it merely increments a pointer and returns a pointer to the start of the new block, thereafter the library's heap manager manages and maintains the new block which may or may not be contiguous with previous blocks. The previous link includes an example implementation.

Can we assign a value to a given memory location?

I want to assign some value (say 2345) to a memory location(say 0X12AED567). Can this be done?
In other words, how can I implement the following function?
void AssignValToPointer(uint32_t pointer, int value)
{
}
The fact that you are asking this question kind of indicates that you're in over your head. But here you go:
*(int *)0x12AED567 = 2345;
The answer depends on some factors. Is your program running on an operating system? If yes, does the OS implement memory segmentation?
If you answered yes to both questions, trying to access a memory area that is not mapped or that it doesn't have permission to write will cause a memory access violation (SIGSEGV on POSIX based systems). To accomplish that, you have to use a system specific function to map the region of memory that contains this exact address before trying to access it.
Just treat the memory location as a pointer
int* pMemory = OX12AED567;
*pMemory = 2345;
Note: This will only work if that memory location is accessible and writable by your program. Writing to an arbitrary memory location like this is inherently dangerous.
C99 standard draft
This is likely not possible without implementation defined behavior.
About casts like:
*(uint32_t *)0x12AED567 = 2345;
the C99 N1256 standard draft "6.3.2.3 Pointers" says:
5 An integer may be converted to any pointer type. Except as previously specified, the
result is implementation-defined, might not be correctly aligned, might not point to an
entity of the referenced type, and might be a trap representation. 56)
GCC implementation
GCC documents its int to pointer implementation at: https://gcc.gnu.org/onlinedocs/gcc-5.4.0/gcc/Arrays-and-pointers-implementation.html#Arrays-and-pointers-implementation
A cast from integer to pointer discards most-significant bits if the pointer representation is smaller than the integer type, extends according to the signedness of the integer type if the pointer representation is larger than the integer type, otherwise the bits are unchanged.
so the cast will work as expected for this implementation. I expect other compilers to do similar things.
mmap
On Linux you can request allocation of a specific virtual memory address](How does x86 paging work?) with the first argument of mmap, man mmap reads:
If addr is NULL, then the kernel chooses the (page-aligned) address at which to create the mapping; this is the most portable method of creating a new mapping. If addr is not NULL, then the kernel takes it as a hint about where to place the mapping; on Linux, the kernel will pick a nearby page boundary (but always above or equal to the value specified by /proc/sys/vm/mmap_min_addr) and attempt to create the mapping there. If another mapping already exists there, the kernel picks a new address that may or may not depend on the hint. The address of the new mapping is returned as the result of the call.
So you could request for an address and assert that you got what you wanted.
As far as C is concerned, that's undefined behaviour. My following suggestion is also undefined behaviour, but avoids all the type-based and aliasing-based problems: Use chars.
int a = get_value();
char const * const p = (const char * const)&a;
char * q = (char *)0x12345;
memcpy(q, p, sizeof(int));
Alternatively, you can access bytes q[i] directly. (This is the part that is UB: the pointer q was not obtained as the address-of an actual object or as the result of an allocation function. Sometimes this is OK; for instance if you're writing a free-standing program that runs in real mode and accesses the graphics hardware, you can write to the graphics memory directly at a well-known, hard-coded address.)
You've indicated, that the address is a physical address, and that your code is running in a process.
So if you're
in kind of high level operating system, e.g. Linux, you'd have to get a mapping into the physical address space. In Linux, /dev/mem does that for you.
within the kernel or without operating system and with a MMU, you have to translate the physical address into a virtual address. In the Linux kernel, phys_to_virt() does that for you. In the kernel, I assume, this address is always mapped.
within the kernel or without operating system and without MMU, you write directly to that address. There's no mapping to consider at all.
Now, you have a valid mapping or the physical address itself, that you pass to your function.
void AssignValToPointer(uint32_t pointer, int value)
{
* ((volatile int *) pointer) = value;
}
You might want to add the volatile keyword as the compiler might optimize the write-operation away if you do not read from that location afterwards (likely case when writing to a register of a memory-mapped hardware).
You might also want to use the uintptr_t data type instead of uint32_t for the pointer.
With the proviso that it's not portable or safe (at all):
*((int *)0x12AED567) = 2345;

CUDA Pointer Dereferencing Issue

I am developing a program using cuda sdk and 9600 1 GB NVidia Card . In
this program
0)A kernel passes a pointer of 2D int array of size 3000x6 in its input arguments.
1)The kenel has to sort it upto 3 levels (1st, 2nd & 3rd Column).
2)For this purpose, the kernel declares an array of int pointers of size 3000.
3)The kernel then populates the pointer array with the pointers pointing to the locations of input array in sorted order.
4)Finally the kernel copies the input array in an output array by dereferencing the pointers array.
This last step Fails an it halts the PC.
Q1)What are the guidelines of pointer de-referncing in cuda to fetch the contents of memory ?
, even a smallest array of 20x2 is not working correctly . the same code works outside cuda device memory ( ie, on standard C program )
Q2)Isn't it supposed to work the same as we do in standard C using '*' operator or there is some cudaapi to be used for it.?
I just started looking into cuda, but I literally just read this out of a book. It sounds like it directly applies to you.
"You can pass pointers allocated with cudaMalloc() to functions that execute on the device.(kernals, right?)
You can use pointers allocated with cudaMalloc() to read or write memory from code that executes on the device .(kernals again)
You can pass pointers allocated with cudaMalloc to functions that execute on the host. (regular C code)
You CANNOT use pointers allocated with cudaMalloc() to read or write memory from code that executes on the host."
^^ from "Cuda by Example" by Jason Sanders and Edward Kandrot published by Addison-Wesley yadda yadda no plagiarism here.
Since you are dereferencing inside the kernal, maybe the opposite of the last rule is also true. i.e. you cannot use pointers allocated by the host to read or write memory from code that executes on the device.
Edit: I also just noticed a function called cudaMemcpy
Looks like you would need to declare the 3000 int array twice in host code. One by calling malloc, the other by calling cudaMalloc. Pass the cuda one to the kernal as well as the input array to be sorted. Then after calling the kernal function:
cudaMemcpy(malloced_array, cudaMallocedArray, 3000*sizeof(int), cudaMemcpyDeviceToHost)
I literally just started looking into this like I said though so maybe theres a better solution.
CUDA code can use pointers in exactly the same manner as host code (e.g. dereference with * or [], normal pointer arithmetic and so on). However it is important to consider the location being accessed (i.e. the location to which the pointer points) must be visible to the GPU.
If you allocate host memory, using malloc() or std::vector for example, then that memory will not be visible to the GPU, it is host memory not device memory. To allocate device memory you should use cudaMalloc() - pointers to memory allocated using cudaMalloc() can be freely accessed from the device but not from the host.
To copy data between the two, use cudaMemcpy().
When you get more advanced the lines can be blurred a little, using "mapped memory" it is possible to allow the GPU to access parts of host memory but this must be handled in a particular way, see the CUDA Programming Guide for more information.
I'd strongly suggest you look at the CUDA SDK samples to see how all this works. Start with the vectorAdd sample perhaps, and any that are specific to your domain of expertise. Matrix multiplication and transpose are probably easy to digest too.
All the documentation, the toolkit and the code samples (SDK) are available on the CUDA developer web site.

Resources