Нhe atomic operation in my program works correctly as long as I don't increase the grid size or call the kernel again. How can this be? Perhaps shared memory isn't automatically freed?
__global__ void DevTest() {
__shared__ int* k1;
k1 = new int(0);
atomicAdd( k1, 1);
}
int main()
{
for (int i = 0; i < 100; i++) DevTest << < 50, 50 >> > ();
}
This:
__shared__ int* k1;
creates storage in shared memory for a pointer. The pointer is uninitialized there; it doesn't point to anything.
This:
k1 = new int(0);
sets that pointer to point to a location on the device heap, not in shared memory. The device heap is limited by default to 8MB. Furthermore, there is an allocation granularity, such that a single int allocation may use up more than 4 bytes of device heap space (it will).
It's generally good practice in C++ to have a corresponding delete for every new operation. Your kernel code does not have this, so as you increase the grid size, you will use up more and more of the device heap memory. You will eventually run into the 8MB limit.
So there are at least 2 options to fix this:
delete the allocation created with new at the end of the thread code
increase the limit on the device heap, instructions are linked in the documentation above
As an aside, shared memory is shared by all the threads in the threadblock. So for your kernel launch of <<<50,50>>> you have 50 threads in each threadblock. Each of those 50 threads will see the same k1 pointer, and each will try to set it to a separate location/allocation, as each executes the new operation. This doesn't make any sense, and it will prevent item 1 above from working correctly (49 of the 50 allocated pointer values will be lost).
So your code doesn't really make any sense. What you are showing here is not a sensible thing to do, and there is no simple way to fix it. You could do something like this:
__global__ void DevTest() {
__shared__ int* k1;
if (threadIdx.x == 0) k1 = new int(0);
__syncthreads();
atomicAdd( k1, 1);
__syncthreads();
if (threadIdx.x == 0) delete k1;
}
Even such an arrangement could theoretically (perhaps, on some future large GPU) eventually run into the 8MB limit if you launched enough blocks (i.e. if there were enough resident blocks, and taking into account an undefined allocation granularity). So the correct approach is probably to do both 1 and 2.
Related
I have this function written in C, that reverse an array:
for (int i=0; i<N;i++){
A[i] = B[N-i-1];
}
I have to write a kernel function that not suffer of Uncoalasced Memory Access (given at B[N-i-1]) using tiling and the local memory. So the idea is: Doing reverse in a local memory and write result back in the array A. How i can do it? Im a newebie.
Assumption: input size match with global size.
You already have the fastest solution.
To understand why, we need to dig a bit deeper.
The memory bandwidh of a GPU is different for coalesced/misaligned read, and this is different for read/write operations. On most GPUs, while misaligned writes are almost as fast as coalesced reads, for misaligned writes there is a large performance penalty. So misaligned reads are ok, but misaligned writes should be avoided.
In your example, you have coalesced writes to A and misaligned reads from B, so you already get peak memory bandwidth.
kernel void reverse_kernel(global float* A, global float* B) { // equivalent to "for(uint i=0u; i<N; i++) {", but executed in parallel
const uint i = get_global_id(0);
A[i] = B[N-1-i];
}
For coalesced memory access, generally contiguous threads must access contiguous memory addresses. But there is some special cases where coalesced access is still allowed in differend access patterns: strided access and broadcasting. I am not sure if reverse access also falls into that class, but for the reads it doesn't matter anyways.
To your initial question with shared memory: This is how you would do it. But there probably won't be any significant speedup here. Speedup with shared memory is only large if the shared memory is accessed multiple times; here you write and read it only once.
kernel void reverse_kernel(global float* A, global float* B) { // equivalent to "for(uint i=0u; i<N; i++) {", but executed in parallel
const uint i = get_global_id(0);
const uint lid = get_local_id(0);
const uint gid = get_group_id(0);
local float cache[64]; // workgroup size on C++ size must be set to 64 as well
cache[lid] = B[64*gid+lid]; // coalesced read from B
barrier(CLK_LOCAL_MEM_FENCE);
A[i] = cache[64-1-lid]; // coalesced write to A
}
Suppose I want to use malloc() to allocated some memory in the process
for(i = 0; i < SOME_NUM; ++i)
int *x = malloc(sizeof(int *));
What is the biggest number that I can set SOME_NUM to?
In xv6 the physical memory is limited and you can see the constant PHYSTOP which is 224MB for simplicity reasons. Some of that memory is accommodating kernel code and other stuff, so the rest could be used by a process if needs to consume rest of physical memory.
Note: PHYSTOP could be changed, but then you will have to modify the mappages function to map all pages.
Note 2: pages are being allocated, so you could place PHYSTOP\ pagesize in loop. Well I'm cheating here because again, kernel data structures and code are already occupying a portion of physical memory.
I am using shared memory in Linux and I have some questions.
Since shmat accesses shared memory and uses memory, a "demanding page" is activated and memory is increased each time the pointer of shared memory is moved.
However, this is a problem because the memory capacity is considerably large.
So, is there a way to save memory like freeing the memory that you have read and transferred before moving the pointer?
The pseudo code is shown below.
My_struct *mem = NULL; // The structure size is 4K.
// shmat(...);
// For example, `VmRSS` was 10K. However, each iteration increases by 4K. (I know this is due to the `demand page`.)
// This is because the size of `My_struct` is 4K.
for (i=0; i<10000; i++)
{
printf("%s\n", mem[i].name); // Here, the memory usage increases every time it is repeated.
sleep(1); // It's just a pause to check VmRSS.
// If I run the loop 10,000 times, I need about 40M of memory.
// (In real world, it is dying because of OOM in loop)
}
shmdt(mem);
Thank you in advance.
According to the answer to this question:
Difference between malloc and calloc?
Isak Savo explains that:
calloc does indeed touch the memory (it writes zeroes on it) and thus you'll be sure the OS is backing the allocation with actual RAM (or swap). This is also why it is slower than malloc (not only does it have to zero it, the OS must also find a suitable memory area by possibly swapping out other processes)
So, I decided to try it myself:
#include <stdlib.h>
#include <stdio.h>
#define ONE_MB = 1048576
int main() {
int *p = calloc(ONE_MB, sizeof(int));
int n;
for(n = 0; n != EOF; n = getchar()) ; /* Gives me time to inspect the process */
free(p);
return 0;
}
After executing this application, Windows's Task Manager would tell me that only 352 KB were being used out of RAM.
It appears that the 1MB block I allocated is not being backed with RAM by the OS.
On the other hand, however, if I would call malloc and initialize the array manually:
#include <stdlib.h>
#include <stdio.h>
#define ONE_MB = 1048576
int main() {
int *p = malloc(sizeof(int) * ONE_MB);
int n;
/* Manual Initialization */
for(n = 0; n < ONE_MB; n++)
memory[n] = n;
for(n = 0; n != EOF; n = getchar()) ; /* Gives me time to inspect the process */
free(p);
return 0;
}
Task Manager would show me that there is actually 4.452KB of RAM being used by the application.
Was Isak incorrect about his argument? If so, what does calloc do then? Doesn't it zero the whole memory block, and therefore "touches" it, just as I did?
If that's the case, why isn't RAM being used in the first sample?
He was wrong in the point, that it is much slower because of it has to write 0 in the block first.
Any smart coded OS prepares such blocks for such purposes (where calloc() isn't the only case such blocks are used for)
and if you call calloc() it just assigns such a block of zeroed memory to your process instead of a uninitialized one as it >could< do by calling malloc().
So it handles such blocks of memory the same way. and if the compiler/OS decides you don't ever/yet need the full 1MB it also dont' gives you a full 1MB block of the zeroed memory.
In How far he was right:
If you heavily call calloc() and also use the memory, the OS could go out of zeroed memory which was probably prepared in some idle time.
This would be causing indeed the system to get a bit slower, as than the os is forced by a call to calloc() to write 0's in the block first.
But at all: There is no regulation about whether malloc/calloc have to allocate the memory on the call or just as you are using the memory. So your special example depends on the OS treatment.
Can someone please help me with a very simple example on how to use shared memory? The example included in the Cuda C programming guide seems cluttered by irrelevant details.
For example, if I copy a large array to the device global memory and want to square each element, how can shared memory be used to speed this up? Or is it not useful in this case?
In the specific case you mention, shared memory is not useful, for the following reason: each data element is used only once. For shared memory to be useful, you must use data transferred to shared memory several times, using good access patterns, to have it help. The reason for this is simple: just reading from global memory requires 1 global memory read and zero shared memory reads; reading it into shared memory first would require 1 global memory read and 1 shared memory read, which takes longer.
Here's a simple example, where each thread in the block computes the corresponding value, squared, plus the average of both its left and right neighbors, squared:
__global__ void compute_it(float *data)
{
int tid = threadIdx.x;
__shared__ float myblock[1024];
float tmp;
// load the thread's data element into shared memory
myblock[tid] = data[tid];
// ensure that all threads have loaded their values into
// shared memory; otherwise, one thread might be computing
// on unitialized data.
__syncthreads();
// compute the average of this thread's left and right neighbors
tmp = (myblock[tid > 0 ? tid - 1 : 1023] + myblock[tid < 1023 ? tid + 1 : 0]) * 0.5f;
// square the previousr result and add my value, squared
tmp = tmp*tmp + myblock[tid] * myblock[tid];
// write the result back to global memory
data[tid] = tmp;
}
Note that this is envisioned to work using only one block. The extension to more blocks should be straightforward. Assumes block dimension (1024, 1, 1) and grid dimension (1, 1, 1).
Think of shared memory as an explicitly managed cache - it's only useful if you need to access data more than once, either within the same thread or from different threads within the same block. If you're only accessing data once then shared memory isn't going to help you.