Can someone please help me with a very simple example on how to use shared memory? The example included in the Cuda C programming guide seems cluttered by irrelevant details.
For example, if I copy a large array to the device global memory and want to square each element, how can shared memory be used to speed this up? Or is it not useful in this case?
In the specific case you mention, shared memory is not useful, for the following reason: each data element is used only once. For shared memory to be useful, you must use data transferred to shared memory several times, using good access patterns, to have it help. The reason for this is simple: just reading from global memory requires 1 global memory read and zero shared memory reads; reading it into shared memory first would require 1 global memory read and 1 shared memory read, which takes longer.
Here's a simple example, where each thread in the block computes the corresponding value, squared, plus the average of both its left and right neighbors, squared:
__global__ void compute_it(float *data)
{
int tid = threadIdx.x;
__shared__ float myblock[1024];
float tmp;
// load the thread's data element into shared memory
myblock[tid] = data[tid];
// ensure that all threads have loaded their values into
// shared memory; otherwise, one thread might be computing
// on unitialized data.
__syncthreads();
// compute the average of this thread's left and right neighbors
tmp = (myblock[tid > 0 ? tid - 1 : 1023] + myblock[tid < 1023 ? tid + 1 : 0]) * 0.5f;
// square the previousr result and add my value, squared
tmp = tmp*tmp + myblock[tid] * myblock[tid];
// write the result back to global memory
data[tid] = tmp;
}
Note that this is envisioned to work using only one block. The extension to more blocks should be straightforward. Assumes block dimension (1024, 1, 1) and grid dimension (1, 1, 1).
Think of shared memory as an explicitly managed cache - it's only useful if you need to access data more than once, either within the same thread or from different threads within the same block. If you're only accessing data once then shared memory isn't going to help you.
Related
Нhe atomic operation in my program works correctly as long as I don't increase the grid size or call the kernel again. How can this be? Perhaps shared memory isn't automatically freed?
__global__ void DevTest() {
__shared__ int* k1;
k1 = new int(0);
atomicAdd( k1, 1);
}
int main()
{
for (int i = 0; i < 100; i++) DevTest << < 50, 50 >> > ();
}
This:
__shared__ int* k1;
creates storage in shared memory for a pointer. The pointer is uninitialized there; it doesn't point to anything.
This:
k1 = new int(0);
sets that pointer to point to a location on the device heap, not in shared memory. The device heap is limited by default to 8MB. Furthermore, there is an allocation granularity, such that a single int allocation may use up more than 4 bytes of device heap space (it will).
It's generally good practice in C++ to have a corresponding delete for every new operation. Your kernel code does not have this, so as you increase the grid size, you will use up more and more of the device heap memory. You will eventually run into the 8MB limit.
So there are at least 2 options to fix this:
delete the allocation created with new at the end of the thread code
increase the limit on the device heap, instructions are linked in the documentation above
As an aside, shared memory is shared by all the threads in the threadblock. So for your kernel launch of <<<50,50>>> you have 50 threads in each threadblock. Each of those 50 threads will see the same k1 pointer, and each will try to set it to a separate location/allocation, as each executes the new operation. This doesn't make any sense, and it will prevent item 1 above from working correctly (49 of the 50 allocated pointer values will be lost).
So your code doesn't really make any sense. What you are showing here is not a sensible thing to do, and there is no simple way to fix it. You could do something like this:
__global__ void DevTest() {
__shared__ int* k1;
if (threadIdx.x == 0) k1 = new int(0);
__syncthreads();
atomicAdd( k1, 1);
__syncthreads();
if (threadIdx.x == 0) delete k1;
}
Even such an arrangement could theoretically (perhaps, on some future large GPU) eventually run into the 8MB limit if you launched enough blocks (i.e. if there were enough resident blocks, and taking into account an undefined allocation granularity). So the correct approach is probably to do both 1 and 2.
Who does handle the heap unallocation when Eigen::Map is used with a heap memory segment to create a MAtrix ?
I couldn't find any info concerning the internal Matrix data memory segment management when Eigen::Map is invoked to build a Matrix.
Here is the doc I went through : https://eigen.tuxfamily.org/dox/group__TutorialMapClass.html
Should I handle the memory segment deletion when I'am done with my Matrix "mf" in the code below ?
int rows(3), cols (3);
scomplex *m1Data = new scomplex[rows*cols];
for (int i = 0; i<cols*rows; i++)
{
m1Data[i] = scomplex( i, 2*i);
}
Map<MatrixXcf> mf(m1Data, rows, cols);
By now, if I settle a breakpoint in the function (./Eigen/src/core/util/Memory.h) :
EIGEN_DEVICE_FUNC inline void aligned_free(void *ptr)
it's not triggered when the main exits.
May I ask you whether I should considere that I must delete the memory segment when I don't use my matrix anymore ?
Cheers
Sylvain
The Map object does not take ownership/responsibility of the memory that is passed to it. It could just be a view into another matrix. In that case, you definitely would not want it to release the memory.
To quote the tutorial page you linked:
Occasionally you may have a pre-defined array of numbers that you want to use within Eigen as a vector or matrix. While one option is to make a copy of the data, most commonly you probably want to re-use this memory as an Eigen type.
So, bottom line, you have to delete the memory you allocated and used with the Map.
Suppose I want to use malloc() to allocated some memory in the process
for(i = 0; i < SOME_NUM; ++i)
int *x = malloc(sizeof(int *));
What is the biggest number that I can set SOME_NUM to?
In xv6 the physical memory is limited and you can see the constant PHYSTOP which is 224MB for simplicity reasons. Some of that memory is accommodating kernel code and other stuff, so the rest could be used by a process if needs to consume rest of physical memory.
Note: PHYSTOP could be changed, but then you will have to modify the mappages function to map all pages.
Note 2: pages are being allocated, so you could place PHYSTOP\ pagesize in loop. Well I'm cheating here because again, kernel data structures and code are already occupying a portion of physical memory.
I allocate a big region of memory lets say x of 1000 bytes.
// I am using c language and all of this is just pseudo code(function prototypes mostly) so far.
pointer = malloc( size(1000 units) ); // this pointer points to region of memory we created.
now we select this region by a pointer and allocate memory inside it to smaller blocks like
void *allocate_from_region( size_of_block1(300) ); //1000-300=700 (left free)
void *allocate_from_region( size_of_block2(100) ); //700-100 =600
void *allocate_from_region( size_of_block3(300) ); //600-300 =300
void *allocate_from_region( size_of_block4(100) ); //300-100 =200
void *allocate_from_region( size_of_block5(150) ); //200-150 =50
// here we almost finished space we have in region (only 50 is left free in region)
boolean free_from_region(pointer_to_block2); //free 100 more
//total free = 100+50 but are not contiguous in memory
void *allocate_from_region( size_of_block6(150) ); // this one will fail and gives null as it cant find 150 units memory(contiguous) in region.
boolean free_from_region(pointer_to_block3); // this free 300 more so total free = 100+300+50 but contiguous free is 100+300 (from block 2 and 3)
void *allocate_from_region( size_of_block6(150); // this time it is successful
Are there any examples about how to manage memory like this?
So far I have only did examples where I can allocate blocks next to each other in a region of memory and and end it once I ran out of memory inside the region.
But how to search for blocks which are free inside the region and then check if enough contiguous memory is available.
I am sure there should be some documentation or examples in c which shows how to do it.
Sure. What you are proposing is more-or-less exactly what some malloc implementations do. They maintain a "free list". Initially the single large block is on this list. When you make a request, the algorithm to allocate n bytes is:
search the free list to find a block at B of size m >= n
Remove B from the free list.
Return the block from B+n through B+m-1 (size m-n) to the free list (unless m-n==0)
Return a pointer to B
To free a block at B of size n, we must put it back on the free list. However this isn't the end. We must also "coalesce" it with adjacent free blocks, if any, either above or below or both. This is the algorithm.
Let p = B; m = n; // pointer to base and size of block to be freed
If there is a block of size x on the free list and at the address B + n,
remove it, set m=m+x. // coalescing block above
If there is a block of size y on the free list and at address B - y,
remove it and set p=B-y; m=m+y; // coalescing block below
Return block at p of size m to the free list.
The remaining question is how to set up the free list to make it quick to find blocks of the right size during allocation and to find adjacent blocks for coalescing during free operations. The simplest way is a singly linked list. But there are many possible alternatives that can yield better speed, usually at some cost of additional space for data structures.
Additionally there is the choice of which block to allocate when more than one is big enough. The usual choices are "first fit" and "best fit". For first fit, just take the first one discovered. Often the best technique is (rather than starting at the lowest addresses every time) to remember the free block after the one just allocated and use this as a starting point for the next search. This is called "rotating first fit."
For best, fit, traverse as many block as necessary to find the one that most closely matches the size requested.
If allocations are random, first fit actually performs a bit better than best fit in terms of memory fragmentation. Fragmentation is the bane of all non-compacting allocators.
I have a vector called d_index calculated in the CUDA device memory and I want to change just one value, like this...
d_index[columnsA-rowsA]=columnsA;
How can I do this without having to copy it to the system memory and then back to the device memory?
You could either call kernel on <<<1,1>>> grid, that changes only the desired element:
__global__ void change_elem(int *arr, int idx, int val) {
arr[idx] = val;
}
// ....
// Somewhere in CPU code
change_elem<<<1,1>>>(d_index, columnsA-rowsA, columnsA);
, or use something like:
int tmp = columnsA;
cudaMemcpy(&d_index[columnsA-rowsA], &tmp, sizeof(int), cudaMemcpyHostToDevice);
If you only do this once, I think there is no big difference which version to use. If you call this code often, you better consider including this array modification into some other kernel to avoid invocation overhead.
Host (CPU) code cannot directly access device memory, so you have two choices:
Launch a single thread kernel (e.g. update_array<<<1,1>>>(index, value))
Use cudaMemcpy() to the location
Use thrust device_vector
Of course updating a single value in an array is very inefficient, hopefully you've considered whether this is necessary or perhaps it could be avoided? For example, could you update the array as part of the GPU code?
I think since the d_index array is in the device memory, it can be directly accessed by every thread.