I have this function written in C, that reverse an array:
for (int i=0; i<N;i++){
A[i] = B[N-i-1];
}
I have to write a kernel function that not suffer of Uncoalasced Memory Access (given at B[N-i-1]) using tiling and the local memory. So the idea is: Doing reverse in a local memory and write result back in the array A. How i can do it? Im a newebie.
Assumption: input size match with global size.
You already have the fastest solution.
To understand why, we need to dig a bit deeper.
The memory bandwidh of a GPU is different for coalesced/misaligned read, and this is different for read/write operations. On most GPUs, while misaligned writes are almost as fast as coalesced reads, for misaligned writes there is a large performance penalty. So misaligned reads are ok, but misaligned writes should be avoided.
In your example, you have coalesced writes to A and misaligned reads from B, so you already get peak memory bandwidth.
kernel void reverse_kernel(global float* A, global float* B) { // equivalent to "for(uint i=0u; i<N; i++) {", but executed in parallel
const uint i = get_global_id(0);
A[i] = B[N-1-i];
}
For coalesced memory access, generally contiguous threads must access contiguous memory addresses. But there is some special cases where coalesced access is still allowed in differend access patterns: strided access and broadcasting. I am not sure if reverse access also falls into that class, but for the reads it doesn't matter anyways.
To your initial question with shared memory: This is how you would do it. But there probably won't be any significant speedup here. Speedup with shared memory is only large if the shared memory is accessed multiple times; here you write and read it only once.
kernel void reverse_kernel(global float* A, global float* B) { // equivalent to "for(uint i=0u; i<N; i++) {", but executed in parallel
const uint i = get_global_id(0);
const uint lid = get_local_id(0);
const uint gid = get_group_id(0);
local float cache[64]; // workgroup size on C++ size must be set to 64 as well
cache[lid] = B[64*gid+lid]; // coalesced read from B
barrier(CLK_LOCAL_MEM_FENCE);
A[i] = cache[64-1-lid]; // coalesced write to A
}
Related
I don't know how OpenMP works, but I presume calling a function with restricted pointer arguments inside a parallel for loop doesn't work if the objects could be shared by multiple threads? Take the following example of serial code meant to perform a weighted sum across matrix columns:
const int n = 10;
const double x[n][n] = {...}; // matrix, containing some numbers
const double w[n] = {...}; // weights, containing some numbers
// my weighted sum function
double mywsum(const double *restrict px, const double *restrict pw, const int n) {
double tmp = 0.0;
for(int i = 0; i < n; ++i) tmp += px[i] * pw[i];
return tmp;
}
double res[n]; // results vector
const double *pw = &w[0]; // creating pointer to w
// loop doing column-wise weighted sum
for(int j = 0; j < n; ++j) {
res[j] = mywsum(&x[0][j], pw, n);
}
Now I want to parallelize this loop using OpenMP, e.g.:
#pragma omp parallel for
for(int j = 0; j < n; ++j) {
res[j] = mywsum(&x[0][j], pw, n);
}
I believe the *restrict px could still be valid as the particular elements pointed to can only be accessed by one thread at a time, but the *restrict pw should cause problems as the elements of w are accessed concurrently by multiple threads, so the restrict clause should be removed here?
I presume calling a function with restricted pointer arguments inside a parallel for loop doesn't work if the objects could be shared by multiple threads?
The restrict keyword is totally independent of using multiple threads. It tells the compiler that the pointer target an object that is not aliased, that is, referenced by any other pointers in the function. It is meant to avoid aliasing in C. The fact that other threads can call the function is not a problem. In fact, if threads write in the same location, you have a much bigger problem: a race condition. If multiple threads read in the same location, this is not a problem (with or without the restrict keyword). The compiler basically does not care about multi-threading when the function mywsum is compiled. It can ignore the effect of other threads since there is no locks, atomic operations or memory barriers.
I believe the *restrict px could still be valid as the particular elements pointed to can only be accessed by one thread at a time, but the *restrict pw should cause problems as the elements of w are accessed concurrently by multiple threads, so the restrict clause should be removed here?
It should be removed because it is not useful, but not because it cause any issue.
The use of the restrict keyword is not very useful here since the compiler can easily see that there is no possible overlapping. Indeed, the only store done in the loop is the one of tmp which is a local variable and the input arguments cannot point on tmp because it is a local variable. In fact, compilers will store tmp in a register if optimizations are enabled (so it does not even have an address in practice).
One should keep in mind that restrict is bound to the function scope where it is define (ie. in the function mywsum). Thus, inlining or the use of the function in a multithreaded context have no impact on the result with respect to the restrict keyword.
I think &x[0][j] is wrong because the loop of the function iterate over n items and the pointer starts to the j-th item. This means the loop access to the item x[0][j+n-1] theoretically causing out-of-bound accesses. In practice you will observe no error because 2D C array are flatten in memory and &x[0][n] should be equal to &x[1][0] in your case. The result will certainly not what you want.
Нhe atomic operation in my program works correctly as long as I don't increase the grid size or call the kernel again. How can this be? Perhaps shared memory isn't automatically freed?
__global__ void DevTest() {
__shared__ int* k1;
k1 = new int(0);
atomicAdd( k1, 1);
}
int main()
{
for (int i = 0; i < 100; i++) DevTest << < 50, 50 >> > ();
}
This:
__shared__ int* k1;
creates storage in shared memory for a pointer. The pointer is uninitialized there; it doesn't point to anything.
This:
k1 = new int(0);
sets that pointer to point to a location on the device heap, not in shared memory. The device heap is limited by default to 8MB. Furthermore, there is an allocation granularity, such that a single int allocation may use up more than 4 bytes of device heap space (it will).
It's generally good practice in C++ to have a corresponding delete for every new operation. Your kernel code does not have this, so as you increase the grid size, you will use up more and more of the device heap memory. You will eventually run into the 8MB limit.
So there are at least 2 options to fix this:
delete the allocation created with new at the end of the thread code
increase the limit on the device heap, instructions are linked in the documentation above
As an aside, shared memory is shared by all the threads in the threadblock. So for your kernel launch of <<<50,50>>> you have 50 threads in each threadblock. Each of those 50 threads will see the same k1 pointer, and each will try to set it to a separate location/allocation, as each executes the new operation. This doesn't make any sense, and it will prevent item 1 above from working correctly (49 of the 50 allocated pointer values will be lost).
So your code doesn't really make any sense. What you are showing here is not a sensible thing to do, and there is no simple way to fix it. You could do something like this:
__global__ void DevTest() {
__shared__ int* k1;
if (threadIdx.x == 0) k1 = new int(0);
__syncthreads();
atomicAdd( k1, 1);
__syncthreads();
if (threadIdx.x == 0) delete k1;
}
Even such an arrangement could theoretically (perhaps, on some future large GPU) eventually run into the 8MB limit if you launched enough blocks (i.e. if there were enough resident blocks, and taking into account an undefined allocation granularity). So the correct approach is probably to do both 1 and 2.
I'm trying to Implement simple OS and now have to implement memory management.
At first, we typed simple code code to check memory size as below.
What the problem i met is that the result of this function depends on increment size.
If I set increment to 1024, this function return 640Kb.
However, If I set increment to 1024*1024, this functinon return 120Mb.
(my system(bochs)'s memory set to 120MB.)
I checked the optimization option and A20 gate.
Anyone who knows why my function didn't work well?
unsigned int memtest_sub(unsigned int start, unsigned int end)
{
unsigned int i;
unsigned int* ptr;
unsigned int orgValue;
const unsigned int testValue = 0xbfbfbfbf;
for (i = start; i <= end; i += 1024*1024) {
ptr = (unsigned int*) i;
orgValue = *ptr;
*ptr = testValue;
if (*ptr != testValue) {
break;
}
*ptr = orgValue;
}
return i;
}
You can't do probes like that.
First the memory isn't necessarily contiguous as you've already discovered. It almost never is. The hole at 640k is for legacy reasons, but even further in the memory is usually split up. You have to ask your firmware for the memory layout.
Second some memory banks might be double mapped into the physical space and you'll end up in real trouble if you start using them. This isn't very common, but it's a real pain to deal with it.
Third, and probably most important, there are devices mapped into that space. By writing to random addresses you're potentially writing to registers of important hardware. Writing back whatever you read won't do you good because some hardware registers have side effects as soon as you write them. As a matter of fact, some hardware registers have side effects when you read them. Some of that hardware isn't necessarily protected and you might do permanent damage. I've bricked ethernet hardware in the past by having pointer errors in a 1:1 mapped kernel because the EEPROM/flash was unprotected. Other places you write to might actually change the layout of the memory itself.
Since you're most likely on i386 read this: http://wiki.osdev.org/Detecting_Memory_(x86)
Also, consider using a boot loader that detects memory for you and communicates that and other important information you need to know in a well defined API. The boot loader is better debugged with respect to all weird variants of hardware.
Following assignments are buggy:
ptr = (unsigned int*) i;
orgValue = *ptr;
*ptr = testValue;
ptr not pointing any valid memory, you can't treat i's value as address where you can perform some read-write operation - Undefined behaviour
In my algorithm I know work with static arrays, no dynamic ones. But I sometimes
reach the limit of the stack. Am I right, that static arrays are stored to the stack?
Which parameters affect my maximum stack size for one C programm?
Are there many system parameters which affect the maximal array size? Does the maximunm no. of elements depend of the array type? Does it depend on the total system RAM? Or does every C programm have a static maximum stack size?
Am I right, that static arrays are stored to the stack?
No, static arrays are stored in the static storage area. The automatic ones (i.e. ones declared inside functions, with no static storage specifier) are allocated on the stack.
Which parameters affect my maximum stack size for one C program?
This is system-dependent. On some operating systems you can change stack size programmatically.
Running out of stack space due to automatic storage allocation is a clear sign that you need to reconsider your memory strategy: you should either allocate the buffer in the static storage area if re-entrancy is not an issue, or use dynamic allocation for the largest of your arrays.
Actually, it depends on the C compiler for the platform you use.
As an example, there are even systems which don't have a real stack so recursion won't work.
A static array is compiled as a continuous memory area with pointers. The pointers might be two or four bytes in size (or maybe even only one on exotic platforms).
There are platforms which use memory pages which have "near" and "far" pointers which differ in size (and speed, of course). So it could be the case that the pointers representing the array and the objects need to fit into the same memory page.
On embedded systems, static data usually is collected in the memory area which will later be represented by the read-only memory. So your array will have to fit in there.
On platforms which run arbitrary applications, RAM is the limiting factor if none of the above applies.
Most of your questions have been answered, but just to give an answer that made my life a lot easier:
Qualitatively the maximum size of the non-dynamically allocated array depends on the amount of RAM that you have. Also it depends on the type of the array, e.g. an int may be 4 bytes while a double may be 8 bytes (they are also system dependent), thus you will be able to have an array that is double in number of elements if you use int instead of double.
Having said that and keeping in mind that sometimes numbers are indeed important, here is a very noobish code snippet to help you extract the maximum number in your system.
#include <stdio.h>
#include <stdlib.h>
#define UPPER_LIMIT 10000000000000 // a very big number
int main (int argc, const char * argv[])
{
long int_size = sizeof(int);
for (int i = 1; i < UPPER_LIMIT; i++)
{
int c[i];
for (int j = 0; j < i; j++)
{
c[j] = j;
}
printf("You can set the array size at %d, which means %ld bytes. \n", c[i-1], int_size*c[i-1]);
}
}
P.S.: It may take a while until you reach your system's maximum and produce the expected Segmentation Fault, so you may want to change the initial value of i to something closer to your system's RAM, expressed in bytes.
Can someone please help me with a very simple example on how to use shared memory? The example included in the Cuda C programming guide seems cluttered by irrelevant details.
For example, if I copy a large array to the device global memory and want to square each element, how can shared memory be used to speed this up? Or is it not useful in this case?
In the specific case you mention, shared memory is not useful, for the following reason: each data element is used only once. For shared memory to be useful, you must use data transferred to shared memory several times, using good access patterns, to have it help. The reason for this is simple: just reading from global memory requires 1 global memory read and zero shared memory reads; reading it into shared memory first would require 1 global memory read and 1 shared memory read, which takes longer.
Here's a simple example, where each thread in the block computes the corresponding value, squared, plus the average of both its left and right neighbors, squared:
__global__ void compute_it(float *data)
{
int tid = threadIdx.x;
__shared__ float myblock[1024];
float tmp;
// load the thread's data element into shared memory
myblock[tid] = data[tid];
// ensure that all threads have loaded their values into
// shared memory; otherwise, one thread might be computing
// on unitialized data.
__syncthreads();
// compute the average of this thread's left and right neighbors
tmp = (myblock[tid > 0 ? tid - 1 : 1023] + myblock[tid < 1023 ? tid + 1 : 0]) * 0.5f;
// square the previousr result and add my value, squared
tmp = tmp*tmp + myblock[tid] * myblock[tid];
// write the result back to global memory
data[tid] = tmp;
}
Note that this is envisioned to work using only one block. The extension to more blocks should be straightforward. Assumes block dimension (1024, 1, 1) and grid dimension (1, 1, 1).
Think of shared memory as an explicitly managed cache - it's only useful if you need to access data more than once, either within the same thread or from different threads within the same block. If you're only accessing data once then shared memory isn't going to help you.