I have a vector called d_index calculated in the CUDA device memory and I want to change just one value, like this...
d_index[columnsA-rowsA]=columnsA;
How can I do this without having to copy it to the system memory and then back to the device memory?
You could either call kernel on <<<1,1>>> grid, that changes only the desired element:
__global__ void change_elem(int *arr, int idx, int val) {
arr[idx] = val;
}
// ....
// Somewhere in CPU code
change_elem<<<1,1>>>(d_index, columnsA-rowsA, columnsA);
, or use something like:
int tmp = columnsA;
cudaMemcpy(&d_index[columnsA-rowsA], &tmp, sizeof(int), cudaMemcpyHostToDevice);
If you only do this once, I think there is no big difference which version to use. If you call this code often, you better consider including this array modification into some other kernel to avoid invocation overhead.
Host (CPU) code cannot directly access device memory, so you have two choices:
Launch a single thread kernel (e.g. update_array<<<1,1>>>(index, value))
Use cudaMemcpy() to the location
Use thrust device_vector
Of course updating a single value in an array is very inefficient, hopefully you've considered whether this is necessary or perhaps it could be avoided? For example, could you update the array as part of the GPU code?
I think since the d_index array is in the device memory, it can be directly accessed by every thread.
Related
Suppose we have a C library my_image_library that provides the functions
image* image_new_from_file(const char* path);
void image_free(image*);
and Lua bindings like this
typedef struct my_image_t {
image* inner;
} my_image_t;
int lua_image_new_from_file(lua_State* L) {
my_image_t* img = lua_newuserdata(L, sizeof(my_image_t));
const char* path = lua_tostring(L, 1);
img->inner = image_new_from_file(path);
// Assume set up of a metatable with `__gc` that calls `image_free` on `inner`
return 1;
}
As far as I understand, to Lua the userdatum looks like it has a size of 8 bytes (the single pointer). Whereas the application's total heap consumption could be dozens of megabytes per image.
And that means that if a program using those bindings were to create and subsequently release many of those values, the GC would only know about a couple hundred bytes allocated and probably not trigger for a while, while the "external" image data keeps piling on.
Is there any way of making the GC aware of the "actual" size of that userdata?
Or is there any other way to improve this situation?
I guess one could make it image inner; and copy the entire memory region, but the double allocation and copying seems very inefficient. And I'm not experienced enough in C to be sure that this actually works well for every library that returns an internally allocated pointer.
Our company bought a proprietary C function: we have a compiled library ProcessData.a and an interface file to call it:
# ProcessData.h
void ProcessData(char* pointer_to_data, int data_len);
We want to use this function on an ARM embedded CPU and we want to know how much stack space it might use.
Question: how to measure the stack usage of an arbitrary function?
What I tried so far is to implement the following helper functions:
static int* stackPointerBeforeCall;
void StartStackMeasurement(void) {
asm ("mov %0, sp" : "=r"(stackPointerBeforeCall));
// For some reason I can't overwrite values immediately below the
// stack pointer. I suspect a return address is placed there.
static int* pointer;
pointer = stackPointerBeforeCall - 4;
// Filling all unused stack space with a fixed constant
while (pointer != &_sstack) {
*pointer = 0xEEEEEEEE;
pointer--;
}
*pointer = 0xEEEEEEEE;
}
void FinishStackMeasurement(void) {
int* lastUnusedAddress = &_sstack;
while (*lastUnusedAddress == 0xEEEEEEEE) {
lastUnusedAddress++;
}
// Printing how many stack bytes a function has used
printf("STACK: %d\n", (stackPointerBeforeCall-lastUnusedAddress)*sizeof(int));
}
And then use them just before and after the function call:
StartStackMeasurement();
ProcessData(array, sizeof(array));
FinishStackMeasurement();
But this seems like a dangerous hack - especially the part where I am subtracting 4 from the stackPointerBeforeCall and overwriting everything below. Is there a better way?
Compile the program and analyze the assembly or machine code for the function in question. Many functions use the stack in a static manner, and this static size can be reasoned by analysis of the compiled code. Some functions dynamically allocate stack space based on some computation, usually associated with some input parameter. In those cases, you'll see different instructions being used to allocate stack space, and will have to work back to reason how the dynamic stack size might be derived.
Of course, this analysis would have to be redone with updates to the function (library).
You can use getrusage which is a function that gets you the resource usage of your software, in particular ru_isrss which is
An integral value expressed the same way, which is the amount of unshared memory used for stack space
(source)
You can then compare it to the stack usage of your program with a mocked call to the library.
However, this will only work if your system has implemented ru_isrss (unlike linux), otherwise the field will be set to 0.
This is my situation, I have an array of pointers that point to arrays of some data... Let's say:
Data** array = malloc ( 100 * sizeof(Data*));
for(i = 0; i < 100; i++) array[i] = malloc (20 * sizeof(Data);
Inside a parallel region, I make some operations that use that data. For instance:
#pragma omp parallel num_threads(4) firstprivate(array)
{
function(array[0], array[omp_get_thread_num()];
}
The first parameter is read-only but is the same along all threads...
The problem is that if I use as the first parameter a diferent block of data, i.e.: array[omp_get_thread_num()+1], each function lasts 1seg. But when I use the same block of data, array[0], each function call lasts 4segs.
My theory is that there is no way to know if the array[0] will be changed or not by the funciton so each thread asks for a copy and invalidate the copies that other threads have and that should explain the delay...
I tried to make a local copy of array[0] like this:
#pragma omp parallel num_threads(4) firstprivate(array)
{
Data* tempData = malloc(20 * sizeof(Data));
memcpy(tempData,array[0], 20*sizeof(Data));
function(tempData, array[omp_get_thread_num()];
}
But I get the same result... It's like the thread doesn't 'release' the Data block so other threads could use it...
I have to note that the first parameter is not always array[0] so I can't use firstprivate(array[0]) in the pragma line...
Questions are:
Am I doing something wrong?
Is there a way to 'release' a shared block of memory so other threads
could use it?
It was very difficult try to make me understand so if you need further information, please let me know!
Thanks in advance... Javier
EDIT: I can't change the function declaration because it comes inside a library! (ACML)
I think you are right in your analysis that the compiler has no way to know that the pointed to array didn't change behind his back. Actually he knows that they might change, since thread 0 receives the same array[0] also as a modifiable argument.
So he has to have the values reloaded too often. First, you should declare your function something like
void function(Data const*restrict A, Data*restrict B);
This is telling the compiler, first, that the values in A can't be changed, and then that none of the pointers can be aliased by the other (or any other pointer), and so that he knows that the values in the arrays will only changed by the function itself.
For thread number 0 the assertion above wouldn't be true, the arrays A and B are actually the same. So you'd best copy array[0] to a common temparray before you go into the #pragma omp parallel, and pass that same temparray as a first argument to every thread:
Data const* tempData = memcpy(malloc(20 * sizeof(Data)), array[0], 20*sizeof(Data));
#pragma omp parallel num_threads(4)
function(tempData, array[omp_get_thread_num()];
I think you are wrong in your analysis. If the data is not changed, imho it will not be synchronized between cores. There are two probable reasons for the slowing down.
The core #0 gets function(array[0], array[0]). You said that first parameter is read-only, but the second is not. So core #0 will change the data in array[0] and the CPU will have to synchronize this data between the cores all the time.
The second possible reason is the small size of your arrays (20 elements). What happens is that the core #1 gets a pointer to the 20-element array, and core #2 gets the pointer to array, which is probably right after the array #1 in memory. Thus there is a high probability that they lie on the same cache line. The CPU does not track changing each particular element - if it sees that elements on the same cache line are changed, it will synchronize the cache between the cores. Solution is to make each array bigger (so that after 20 elements you have unused space equal to cache size (128K? 256K?))
My guess that you have both, #1 and #2 problems in your code.
How can i make a parameter vector treated as a local variable in each instance in cuda?
__global__ void kern(char *text, int N){
//if i change text[0]='0'; the change only affects the current instance of the kernel and not the other threads
}
Thanks!
Every thread will receive the same input parameters, so in this case char *text is the same in every thread - that's a fundamental part of the programming model. Since the pointer points to global memory, if one thread changes data through the pointer (i.e. modifies the global memory) then the change affects all threads (ignoring hazards).
This is exactly the same as standard C, except now you have multiple threads accessing through the pointer. In other words, if you modify text[0] inside a standard C function then the changes are visible outside the function.
If I understand correctly, you're asking for every thread to have a local copy of the contents of text. Well the solution is exactly the same as for standard C if you don't want changes visible outside the function:
__global__ void kern(char* text, int N) {
// If you have an upper bound for N...
char localtext[NMAX];
// If you don't know the range of N...
char *localtext;
localtext = malloc(N*sizeof(char));
// Copy from text to localtext
// Since each thread has the same value for i this will
// broadcast from the L1 cache
for (int i = 0 ; i < N ; i++)
localtext[i] = text[i];
//...
}
Note that I'm assuming you have sm_20 or later. Also note that while using malloc in device code is possible, you will pay a performance price.
Suppose that the function
void foo(int n, double x[])
sorts the n-vector x, does some operations on x, and then restores the original ordering to x before returning. So internally, foo needs some temporary storage, e.g., at least an n-vector of integers so that it store the original ordering.
What's the best way to handle this temporary storage? I can think of two obvious approaches:
foo declares its own workspace by declaring an internal array, i.e., at the top of foo we have
int temp[n];
in the main calling routine, dynamically allocate the n-vector of ints once and pass in the storage at each call to a version of foo that accepts the temporary storage as a 3rd arg, i.e.,
double *temp = malloc(n*sizeof(double));
foo(n, x, temp);
I'm worried that option 1 is inefficient (the function foo will get called many times with the same n), and option 2 is just plain ugly, since I have to carry around this temporary storage so that it's always available wherever I happen to need a call to foo(n,x).
Are there other more elegant options?
If you end up using option 2 – that is, the function uses memory that is allocated elsewhere – use proper encapsulation.
In a nutshell, don’t pass in a raw array, pass in a context object which has matching init and release functions.
Then the user must still pass in the context and properly set it up and tear it down but the details are hidden from her and she doesn’t care about the details of the allocation. This is a common pattern in C.
typedef struct {
double* storage;
} foo_context;
void foo_context_init(foo_context*, int n);
void foo_context_free(foo_context*);
void foo(foo_context* context, int n, double x[]);
Now, for a very simple case this is clearly a tremendous overhead and I agree with Oli that option 1 called for.
Option 1 is clearly the cleanest (because it's completely encapsulated). So go with Option 1 until profiling has determined that this is a bottleneck.
Update
#R's comment below is correct; this could blow your stack if n is large. The pre-C99 "encapsulated" method would be to malloc the local array, rather than putting it on the stack.
On most architectures option 1 is very efficient since it allocates memory on the stack and is typically an add to the stack and/or frame pointer. Just be careful not to make n too large.
As Oli said in his answer the best is to have the function being autonomous about this temporary array. A single allocation is not going to cost a lot unless that function is called in a very fast loop... so get it right first, then profile and then decide if it's worth doing an optimization.
That said in a few cases after profiling and when the temp data structure needed was a bit more complex that a single int array I adopted the following approach:
void foo(int n, ... other parameters ...)
{
static int *temp_array, temp_array_size;
if (n > temp_array_size)
{
/* The temp array we have is not big enough, increase it */
temp_array = realloc(temp_array, n*sizeof(int));
if (!temp_array) abort("Out of memory");
temp_array_size = n;
}
... use temp_array ...
}
note that using a static array rules out for example multithreading or recursion and this should be clearly stated in the documentation.