Re-use threads in CUDA - arrays

I have a large series of numbers, in an array, about 150MB of numbers, and I need to find consecutive sequences of numbers, the sequences might be from 3 to 160 numbers. so to make it simple, I decided the each thread should start such as ThreadID = CellID
So thread0 looks at cell0, and if the number in cell0 matches my sequence, then, thread0 = cell1 and so on, and if the numbed does not match, the thread is stopped and I do that for my 20000 threads.
So that works out, fine but I wanted to know how to reuse threads, because the array in which i'm looking for the series of number is much bigger.
So should I divide my array in smaller arrays, and load them into shared memory, and loop over the number of smaller arrays and (eventually pad the last one). Or should I keep the big array in global memory, and have my thread to be to ThreadID = cellID and then ThreadID = cellID+20000 etc. or is there a better way to go through.
To clarify : At the moment i use 20 000 threads, 1 Array of numbers in Global Memory (150MB), and a sequence of numbers in shared memory (eg: 1,2,3,4,5), represented as an array. Thread0 start at Cell0, and look if the cell0 in global memory, is equal to cell0 in shared memory, if yes, thread0 compare cell1 in global memory, to cell1 in shared memory, and so on until there is a full match.
If the numbers in both (global and shared memory) cells are not equal, that thread is simply discarded. Since, most of the numbers in the Global memory Array will not match the first number of my sequence. I thought it was a good idea to use one thread to match Cell_N in GM to Cell_N in ShM and overlap the threads. and this technique allows coalesced memory access the first time, since every thread from 0 to 19 999 will access contiguous memory.
But what I would like to know, is "what would be the best way to re-use the threads" that have been discarded, or the threads that finished to match. To be able to match the entire array of 150MB instead of simply match (20000 numbers + (length of sequence -1)).

"what would be the best way to re-use the threads" that have been discarded, or the threads that finished to match. To be able to match the entire array of 150MB instead of simply match (20000 numbers + (length of sequence -1)).
You can re-use threads in a fashion similar to the canonical CUDA reduction sample (using the final implementation as a reference).
int idx = threadIdx.x+blockDim.x*blockIdx.x;
while (idx < DSIZE){
perform_sequence_matching(idx);
idx += gridDim.x*blockDim.x;
}
In this way, with an arbitrary number of threads in your grid, you can cover an arbitrary problem size (DSIZE);

Related

OpenCL - For Loops and Relationship with GlobalWorkSize

While reading over some code I had a couple of questions that popped into my head.
Let's assume we had a globalWorkSize of one million elements in an array.
Assume the purpose of the kernel was to simply take summation of 100 elements at a time and store these values in an output. Ex) First time the kernel would sum elements 0-99, then it would do 1-100, then 2-101 and so on. All the summed values get stored in an array.
Now we know that there are 1 million elements, when we pass this to clEnqueueNDRangeKernel, does that mean the kernel will execute close to one million times?
I noticed that the for loop in the kernel only loops to one hundred elements, then the value is just stored in another array. So by just examining the for loop, one would just think that after 100 elements it would stop. How does the computer know when we have reached 1 million elements? Is it because we passed the parameter in clEnqueueNDRangeKernel and at an atomic level it knows that more elements need to be processed?
The device has no way to know that there are one million elements in the array. So if you set the global_work_size to be one million the last 99 kernels will happily overindex the array which may or may not seg fault depending on the device.
When you call the clEnqueueNDRangeKernel API call with a worksize of N then the information is sent to the device and the device will execute anough uniform sized workgroups untill it has executed the kernel N times.
Hope this answers your question.

Performance implications of a large number of mutexes

Suppose I have an array of 1,000,000 elements, and a number of worker threads each manipulating data in this array. The worker threads might be updating already populated elements with new data, but each operation is limited to a single array element, and is independent of the values of any other element.
Using a single mutex to protect the entire array would clearly result in high contention. On the other extreme, I could create an array of mutexes that is the same length as the original array, and for each element array[i] I would lock mutex[i] while operating on it. Assuming an even distribution of data, this would mostly eliminate lock contention, at the cost of a lot of memory.
I think a more reasonable solution would be to have an array of n mutexes (where 1 < n < 1000000). Then for each element array[i] I would lock mutex[i % n] while operating on it. If n is sufficiently large, I can still minimize contention.
So my question is, is there a performance penalty to using a large (e.g. >= 1000000) number of mutexes in this manner, beyond increased memory usage? If so, how many mutexes can you reasonably use before you start to see degradation?
I'm sure the answer to this is somewhat platform specific; I'm using pthreads on Linux. I'm also working on setting up my own benchmarks, but the scale of data that I'm working on makes that time consuming, so some initial guidance would be appreciated.
That was the initial question. For those asking for more detailed information regarding the problem, I have 4 multiple GB binary data files describing somewhere in the neighborhood of half a billion events that are being analyzed. The array in question is actually the array of pointers backing a very large chained hash table. We read the four data files into the hash table, possibly aggregating them together if they share certain characteristics. The existing implementation has 4 threads, each reading one file and inserting records from that file into the hash table. The hash table has 997 locks and 997*9973 = ~10,000,000 pointers. When inserting an element with hash h, I first lock mutex[h % 997] before inserting or modifying the element in bucket[h % 9943081]. This works all right, and as far as I can tell, we haven't had too many issues with contention, but there is a performance bottleneck in that we're only using 4 cores of a 16 core machine. (And even fewer as we go along since the files generally aren't all the same size.) Once all of the data has been read into memory, then we analyze it, which uses new threads and a new locking strategy tuned to the different workload.
I'm attempting to improve the performance of the data load stage by switching to a thread pool. In the new model, I still have one thread for each file which simply reads the file in ~1MB chunks and passes each chunk to a worker thread in the pool to parse and insert. The performance gain so far has been minimal, and the profiling that I did seemed to indicate that the time spent locking and unlocking the array was the likely culprit. The locking is built into the hash table implementation we are using, but it does allow specifying the number of locks to use independently of the size of the table. I'm hoping to speed things up without changing the hash table implementation itself.
(A very partial & possibly indirect answer to your question.)
Have once scored a huge performance hit trying this (on a CentOS) raising the number of locks from a prime of ~1K to a prime of ~1M. While I never fully understood its reason, I eventually figured out (or just convinced myself) that it's the wrong question.
Suppose you have an array of length M, with n workers. Furthermore, you use a hash function to protect the M elements with m < M locks (e.g., by some random grouping). Then, using the Square Approximation to the Birthday Paradox, the chance of a collision between two workers - p - is given by:
p ~ n2 / (2m)
It follows that the number of mutexes you need, m, does not depend on M at all - it is a function of p and n only.
Under Linux there is no cost other than the memory associated with more mutexes.
However, remember that the memory used by your mutexes must be included in your working set - and if your working set size exceeds the relevant cache size, you'll see a significant performance drop. This means that you don't want an excessively sized mutex array.
As Ami Tavory points out, the contention depends on the number of mutexes and number of threads, not the number of data elements protected - so there's no reason to link the number of mutexes to the number of data elements (with the obvious proviso that it never makes sense to have more mutexes than elements).
In the general scenario, I would advise
Simply locking the whole array (simple, very often "good enough" if your application is mostly doing "other stuff" besides accessing the array)
... or ...
Implementing a read/write lock on the entire array (assuming reads equal or exceed writes)
Apparently your scenario doesn't match either case.
Q: Have you considered implementing some kind of a "write queue"?
Worst case, you'd only need one mutex. Best case, you might even be able to use a lock-less mechanism to manage your queue. Look here for some ideas that might be applicable: https://msdn.microsoft.com/en-us/library/windows/desktop/ee418650%28v=vs.85%29.aspx

Necessity of pthread mutex

I have an int array[100] and I want 5 threads to calculate the sum of all array elements.
Each thread iterates through 20 elements within its dedicated range and writes the sum into a global sum variable.
Is a mutex necessary here? There is no synchronization needed since all threads are reading from independent sources.
for(i=offset; i<offset+range; i++){
// not used pthread_mutex_lock(&mutex);
sum += array[i];
// not used pthread_mutex_unlock(&mutex);
}
Can this lead to unpredictable behavior or does the OS actually handle this?
Is it advisable to leave out the mutex in this case? I've noticed that those algorithms run a lot faster without it.
Yes, you need synchronization, because all thread are modifying the sum at the same time. Here's example:
You have array of 4 elements [a1, a2, a3, a4] and 2 threads t1 and t2 and sum. To begin let's say t1 get value a1 and adds it to sum. But it's not an atomic operation, so he copy current value of sum (it's 0) to his local space, let's call it t1_s, adds to it a1 and then write sum = t1_s. But at the same time t2 do the same, he get sum value (which is 0, because t1 have not completed it operation) to t2_s, adds a3 and write to sum. So we got in the sum value of a3 insted of a1 + a3. This is called data race.
There are multiple solutions to this is:
You can use mutex as you already did in your code, but as you mentioned it can be slow, since mutex locks are expensive and all other threads are waiting for it.
Create array (with size of number of threads) to calculte local sums for all threads and then do the last reduction on this array in the one thread. No synchronization needed.
Without array calculate local sum_local for each thread and in the end add all these sums to shared variable sum using mutex. I guess it will be faster (however it need to be checked).
However as #gavinb mentioned all of it make sense only for larger amount of data.
I have an int array[100] and I want 5 threads to calculate the sum of all array elements. Each thread iterates through 20 elements within its dedicated range and writes the sum into a global sum variable.
First of all, it's worth pointing out that the overhead of this many threads processing this small amount of data would probably not be an advantage. There is a cost to creating threads, serialising access, and waiting for them to finish. With a dataset this small, an well-optimised sequential algorithm is probably faster. It would be an interesting exercise to measure the speedup with varying number of threads.
Is a mutex necessary here? There is no synchronization needed since all threads are reading from independent sources.
Yes - the reading of the array variable is independent, however updating the sum variable is not, so you would need a mutex to serialise access to sum, according to your description above.
However, this is a very inefficient way of calculating the sum, as each thread will be competing (and waiting, hence wasting time) for access to increment sum. If you calculate intermediate sums for each subset (as #Werkov also mentioned), then wait for them to complete and add the intermediate sums to create the final sum, there will be no contention reading or writing, so you wouldn't need a mutex and each thread could run as quickly as possible. The limiting factor on performance would then likely be memory access pattern and cache behaviour.
Can this lead to unpredictable behavior or does the OS actually handle this?
Yes, definitely. The OS will not handle this for you as it cannot predict how/when you will access different parts of memory, and for what reason. Shared data must be protected between threads whenever any one of them may be writing to the data. So you would almost certainly get the wrong result as threads trip over each other updating sum.
Is it advisable to leave out the mutex in this case? I've noticed that those algorithms run a lot faster without it.
No, definitely not. It might run faster, but it will almost certainly not give you the correct result!
In the case where it is possible to partition data in such a way there aren't dependencies (i.e. reads/writes) across partitions. In your example, there is the dependency of the sum variable and mutex is necessary. However, you can have partial sum accumulator for each thread and then only sum these sub results without need of a mutex.
Of course, you needn't to do this by hand. There are various implementations of this, for instance see OpenMP's parallel for and reduction.

CUDA threads appending variable amounts of data to common array

My application takes millions of input records, each 8 bytes, and hashes each one into two or more output bins. That is, each input key K creates a small number of pairs (B1,K), (B2,K), ... The number of output bins per key is not known until the key is processed. It's usually 2 but could occasionally be 10 or more.
All those output pairs need to be eventually stored in one array since all the keys in each bin will later be processed together. How to do this efficiently?
Using an atomic increment to repeatedly reserve a pair from a global array sounds horribly slow. Another obvious method would be to init a hash table as an array of pointers to some sort of storage per bin. That looks slower.
I'm thinking of pre-reserving 2 pairs per input record in a block shared array, then grabbing more space as needed (i.e., a reimplementation of the STL vector reserve operation), then having the last thread in each block copying the block shared array to global memory.
However I'm not looking forward to implementing that. Help? Thanks.
Using an atomic increment to repeatedly reserve a pair from a global
array sounds horribly slow.
You could increment bins of a global array instead of one entry at a time. In other words, you could have a large array, each thread could start with 10 possible output entries. If the thread over flows it requests for the next available bin from the global array. If you're worried about slow speed with the 1 atomic number, you could use 10 atomic numbers to 10 portions of the array and distribute the accesses. If one gets full, find another one.
I'm also considering processing the data twice: the 1st time just to
determine the number of output records for each input record. Then
allocate just enough space and finally process all the data again.
This is another valid method. The bottleneck is calculating the offset of each thread into the global array once you have the total number of results for each thread. I haven't figured a reasonable parallel way to do that.
The last option I can think of, would be to allocate a large array, distribute it based on blocks, used a shared atomic int (would help with slow global atomics). If you run out of space, mark that the block didn't finish, and mark where it left off. On your next iteration complete the work that hasn't been finished.
Downside of course of the distributed portions of global memory is like talonmies said... you need a gather or compaction to make the results dense.
Good luck!

Memory compactor in C?

I have a program that simulates best-fit memory management.
Basically, while there are available holes in the holes list, large enough for a process we are trying to allocate, processes are allocated and added to the process list. However, eventually, we get to a point where holes become very fragmented and we need to perform compaction.
The easiest way to do this, is obviously to create a new list and add all the processes in sequential order. However, that is not very realistic, since in real world, you wouldn't have space to move things to and create a new list.
Can you think of a way to push all the processes to one end of memory and free space to the other? Basically it is set up like this array of holes (holes are structs that contain starting index and size) and an array of processes (processes are also structs that contain process id, starting index and size).
You could shuffle the the allocated memory to the end of available memory one after another.
Pseudocode:
sort procs by start_index (descending)
avail_end = END_OF_MEM - p[0].size (adjust alignment)
for p in procs
memmove( avail_end, p.start_index , p.size )
avail_end = avail_end - p.size
this should lead to one free memory block at the beginning of available memory. You could also stop this function after one region has moved (after a time threshold has reached) and continue later (By testing whter there is a gap beetween subsequent process alocated regions, to skip unneccesary moves).
If the array of holes is sorted on starting index, you could iterate over the array and take two holes and move the memory chunk in between to the start index of the first hole.
Further explanation:
Each hole has a starting index and an ending index equal to start_index + size.
By comparing the ending index of the first hole to the starting index of the second, you get the size of the memory chunk in between. Then you can do a memmove of the memory chunk to the first starting index.
When freeing a memory (and creating a hole), do you already merge it with adjacent holes? Otherwise that is a good idea to lessen fragmentation.

Resources