CUDA shared memory not faster than global? - arrays

Hi i have kernel function, where i need to compare bytes. Area where i want to search is divided into blocks, so array of 4k bytes is divided to 4k/256 = 16 blocks. Each thread in block reads array on idx and compare it with another array, where is what i want to search. I've done this by two ways:
1.Compare data in global memory, but often threads in block need to read the same address.
2.Copy data from global memory to shared memory, and compare bytes in shared memory in the same way as mentioned above. Still problem with same address read.
Copy to shared memory looks like this:
myArray[idx] = global[someIndex-idx];
whatToSearch[idx] = global[someIndex+idx];
Rest of the code is the same. Only operations on data in example 2 are performed in shared arrays.
But first option is about 10% faster, than that with the shared memory, why?? Thank you for explanations.

If you are only using the data once and there is no data reuse between different threads in a block, then using shared memory will actually be slower. The reason is that when you copy data from global memory to shared, it still counts as a global transaction. Reads are faster when you read from shared memory, but it doesn't matter because you already had to read the memory once from global, and the second step of reading from shared memory is just an extra step that doesn't provide anything of value.
So, the key point is that using shared memory is only useful when you need to access the same data more than once (whether from the same thread, or from different threads in the same block).

You are using shared memory to save on accesses to global memory, but each thread is still making two accesses to global memory, so it won't be faster. The speed drop is probably because the threads that access the same location in global memory within a block try to read it into the same location in shared memory, and this needs to be serialized.
I'm not sure of exactly what you are doing from the code you posted, but you should ensure that the number of times global is read from and written to, aggregated across all the threads in a block, is significantly lower when you use shared memory. Otherwise you won't see a performance improvement.

Related

Lock the memory to physical RAM in C for the dynamically allocate pointer

I want to lock the memory to physical RAM in C with mlock and munlock, but I'm unsure about the correct usage.
Allow me to explain in a step by step scenario:
Let's assume that I dynamically allocate a pointer using calloc:
char * data = (char *)calloc(12, sizeof(char*));
Should I do mlock right after that?
Let's also assume that I later attempt to resize the memory block with realloc:
(char *)realloc(data, 100 * sizeof(char*));
Note the above increase amount ( 100 ) is random and sometimes i will decrease the memory block.
Should I first do munlock and then mlock again to address the changes made?
Also when I want to free the pointer data later, should I munlock first?
I hope someone can please explain the correct steps to me so I can understand better.
From the POSIX specification of mlock() and munlock():
The mlock() function shall cause those whole pages containing any part
of the address space of the process starting at address addr and
continuing for len bytes to be memory-resident until unlocked or until
the process exits or execs another process image. The implementation
may require that addr be a multiple of {PAGESIZE}.
The munlock() function shall unlock those whole pages containing any
part of the address space of the process starting at address addr and
continuing for len bytes, regardless of how many times mlock() has
been called by the process for any of the pages in the specified
range. The implementation may require that addr be a multiple of
{PAGESIZE}.
Note that:
Both functions work on whole pages
Both functions might require addr to be a multiple of page size
munlock doesn't use any reference counting to track lock lifetime
This make it almost impossible to use them with pointers returned by malloc/calloc/realloc as they can:
Accidently lock/unlock nearby pages (you might unlock pages that must be memory-resident by accident)
Might return pointers that are not suitable for those functions
You should use mmap instead or any other OS-specific mechanism. For example Linux has mremap which allows you to "reallocate" memory. Whatever you use, make sure mlock behavior is well-defined for it. From Linux man pages:
If the memory segment specified by old_address and old_size is locked
(using mlock(2) or similar), then this lock is maintained when the
segment is resized and/or relocated. As a consequence, the amount of
memory locked by the process may change.
Note Nate Eldredge's comment below:
Another problem with using realloc with locked memory is that the data
will be copied to the new location before you have a chance to find
out where it is and lock it. If your purpose in using mlock is to
ensure that sensitive data never gets written out to swap, this
creates a window of time where that might happen.
TL;DR
Memory locking doesn't mix with general-purpose memory allocation using the C language runtime.
Memory locking does mix with page-oriented virtual memory mapping OS-level APIs.
The above hold unless special circumstances arise (that's my way out of this :)

Shared memory addresses

I am using C on Linux, and allocating 2 shared memory segments.
The first segment is created in a main program, then I call a subprocess and create the second segment.
In the subprocess, I place the address of the second segment in a pointer I set aside in the first segment.
Upon returning to the main program, when I attach to the second segment and compare the pointers (the one returned from shmat, and the one previously stored by the subprocess) I find they are different.
Is this expected?
Thanks, Mark.
Yes, this is expected. Mapping to a common address in the virtual space of the two processes would be a very constraining limitation. Among others, the memory manager would have to know simultaneously which processes are willing to map, so that it finds a common free area. This would defeat the very principle of virtual memory (every process sees a blank address space), and cause configurations impossible to arbitrate.
Sharing at common addresses is indeed possible, but only makes sense when the mapping is to some reserved section of the address space, so that it doesn't get mapped elsehow.

What does "cacheline aligned" mean?

I read this article about PostgreSQL performance: http://akorotkov.github.io/blog/2016/05/09/scalability-towards-millions-tps/
One optimization was "cacheline aligment".
What is this? How does it help and how to apply this in code?
CPU caches transfer data from and to main memory in chunks called cache lines; a typical size for this seems to be 64 bytes.
Data that are located closer to each other than this may end up on the same cache line.
If these data are needed by different cores, the system has to work hard to keep the data consistent between the copies residing in the cores' caches. Essentially, while one thread modifies the data, the other thread is blocked by a lock from accessing the data.
The article you reference talks about one such problem that was found in PostgreSQL in a data structure in shared memory that is frequently updated by different processes. By introducing padding into the structure to inflate it to 64 bytes, it is guaranteed that no two such data structures end up in the same cache line, and the processes that access them are not blocked more that absolutely necessary.
This is only relevant if your program parallelizes execution and accesses a shared memory region, either by multithreading or by multiprocessing with shared memory. In this case you can benefit by making sure that data that are frequently accessed by different execution threads are not located close enough in memory that they can end up in the same cache line.
The typical way to do that is by adding “dead” padding space at the end of a data structure.
I found some interesting articles on the topic that you may want to read:
http://www.drdobbs.com/parallel/maximize-locality-minimize-contention/208200273?pgno=3
http://www.drdobbs.com/tools/memory-constraints-on-thread-performance/231300494
http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206

When would you use mmap

So, I understand that if you need some dynamically allocated memory, you can use malloc(). For example, your program reads a variable length file into a char[]. You don't know in advance how big to make your array, so you allocate the memory in runtime.
I'm trying to understand when you would use mmap(). I have read the man page and to be honest, I don't understand what the use case is.
Can somebody explain a use case to me in simple terms? Thanks in advance.
mmap can be used for a few things. First, a file-backed mapping. Instead of allocating memory with malloc and reading the file, you map the whole file into memory without explicitly reading it. Now when you read from (or write to) that memory area, the operations act on the file, transparently. Why would you want to do this? It lets you easily process files that are larger than the available memory using the OS-provided paging mechanism. Even for smaller files, mmapping reduces the number of memory copies.
mmap can also be used for an anonymous mapping. This mapping is not backed by a file, and is basically a request for a chunk of memory. If that sounds similar to malloc, you are right. In fact, most implementations of malloc will internally use an anonymous mmap to provide a large memory area.
Another common use case is to have multiple processes map the same file as a shared mapping to obtain a shared memory region. The file doesn't have to be actually written to disk. shm_open is a convenient way to make this happen.
Whenever you need to read/write blocks of data of a fixed size it's much simpler (and faster) to simply map the data file on disk to memory using mmap and acess it directly rather than allocate memory, read the file, access the data, potentially write the data back to disk, and free the memory.
consider the famous producer-consumer problem, the producer creates a shared memory object using shm_open(), and since our goal is to make the producer and consumer share data, we use the mmap syscall to map that shared memory region to the process' address space. Now, the consumer can open that shared memory object (shared memory objects are referred to by a "name") and read from it, after a call to mmap to map the address space as done for the producer.

Creating queue in share memory POSIX

For my implementation I am using mmap for allocating shared memory for interprocess communication. In this shared memory I initialize a queue (I set first and last pointer to NULL).
The problem is how to push new item into the Queue. Normally I would use a malloc to allocate my 'queue item struct' and then point to it, but I can't use that, can I? I need to allocate it somehow in the shared memory. I probably could use another mmap and push the item there and then point to it, but it doesn't seem right, because I would have to do that multiple times.
Can this be done simply or I should think about different solutions?
Thanks for any ideas.
General rules to create a queue in shared memory:
1) Never use pointers as shared elements, because the OS may choose different virtual addresses in different processes. Always use offsets from the shared memory view base address, or array indices, or anyway something that is position-independent.
2) You have to manually partition your shared memory. E.g. you must know how many items your queue may contain, and dimension the shared area so it can contain the "haeder" (insertion index and extraction index...) and the item array. It's often enough to define a structure that contains both the "header" and the "item array" of the correct size: the memory size is sizeof(your_structure), its address is the one returned by mmap.
3) Carefully consider multithreading and multiprocessing issues. Protect the access to the shared memory with a mutex if it's acceptable that the accessing threads may block. But if you want to create a "non-blocking" queue, you must at least use atomic operations to change the relevant fields, and consider any possible timing issue.
Regards

Resources