CUDA threads appending variable amounts of data to common array - arrays

My application takes millions of input records, each 8 bytes, and hashes each one into two or more output bins. That is, each input key K creates a small number of pairs (B1,K), (B2,K), ... The number of output bins per key is not known until the key is processed. It's usually 2 but could occasionally be 10 or more.
All those output pairs need to be eventually stored in one array since all the keys in each bin will later be processed together. How to do this efficiently?
Using an atomic increment to repeatedly reserve a pair from a global array sounds horribly slow. Another obvious method would be to init a hash table as an array of pointers to some sort of storage per bin. That looks slower.
I'm thinking of pre-reserving 2 pairs per input record in a block shared array, then grabbing more space as needed (i.e., a reimplementation of the STL vector reserve operation), then having the last thread in each block copying the block shared array to global memory.
However I'm not looking forward to implementing that. Help? Thanks.

Using an atomic increment to repeatedly reserve a pair from a global
array sounds horribly slow.
You could increment bins of a global array instead of one entry at a time. In other words, you could have a large array, each thread could start with 10 possible output entries. If the thread over flows it requests for the next available bin from the global array. If you're worried about slow speed with the 1 atomic number, you could use 10 atomic numbers to 10 portions of the array and distribute the accesses. If one gets full, find another one.
I'm also considering processing the data twice: the 1st time just to
determine the number of output records for each input record. Then
allocate just enough space and finally process all the data again.
This is another valid method. The bottleneck is calculating the offset of each thread into the global array once you have the total number of results for each thread. I haven't figured a reasonable parallel way to do that.
The last option I can think of, would be to allocate a large array, distribute it based on blocks, used a shared atomic int (would help with slow global atomics). If you run out of space, mark that the block didn't finish, and mark where it left off. On your next iteration complete the work that hasn't been finished.
Downside of course of the distributed portions of global memory is like talonmies said... you need a gather or compaction to make the results dense.
Good luck!

Related

Is there an efficient way of storing the most recent part of a continuous stream in an array?

I have a never ending stream of data coming in to a program I’m writing. I would like to have a fixed size buffer array which only stores the T most recent observations of that stream. However, to me its not obvious how to implement that in an efficient way.
What I have done so far is to first allocate the buffer of length T and place incoming observations in consecutive order from the top as they arrive: data_0->index 0, data_1->index 1…data_T->index T.
Which works fine until the buffer is full. But when observation data_T+1 arrives, index 0 needs to be removed from the buffer and all T-1 rows needs to be moved up one step in the array/matrix in order to place the newest data point at index T.
That seems to be a very inefficient approach when the buffer is large and hundreds of thousands of elements need to be pushed one row up all the time.
How is this normally solved?
This algorithm called FIFO queue java fifo queue
Look at this API it has several code examples.

OpenCL - For Loops and Relationship with GlobalWorkSize

While reading over some code I had a couple of questions that popped into my head.
Let's assume we had a globalWorkSize of one million elements in an array.
Assume the purpose of the kernel was to simply take summation of 100 elements at a time and store these values in an output. Ex) First time the kernel would sum elements 0-99, then it would do 1-100, then 2-101 and so on. All the summed values get stored in an array.
Now we know that there are 1 million elements, when we pass this to clEnqueueNDRangeKernel, does that mean the kernel will execute close to one million times?
I noticed that the for loop in the kernel only loops to one hundred elements, then the value is just stored in another array. So by just examining the for loop, one would just think that after 100 elements it would stop. How does the computer know when we have reached 1 million elements? Is it because we passed the parameter in clEnqueueNDRangeKernel and at an atomic level it knows that more elements need to be processed?
The device has no way to know that there are one million elements in the array. So if you set the global_work_size to be one million the last 99 kernels will happily overindex the array which may or may not seg fault depending on the device.
When you call the clEnqueueNDRangeKernel API call with a worksize of N then the information is sent to the device and the device will execute anough uniform sized workgroups untill it has executed the kernel N times.
Hope this answers your question.

Performance implications of a large number of mutexes

Suppose I have an array of 1,000,000 elements, and a number of worker threads each manipulating data in this array. The worker threads might be updating already populated elements with new data, but each operation is limited to a single array element, and is independent of the values of any other element.
Using a single mutex to protect the entire array would clearly result in high contention. On the other extreme, I could create an array of mutexes that is the same length as the original array, and for each element array[i] I would lock mutex[i] while operating on it. Assuming an even distribution of data, this would mostly eliminate lock contention, at the cost of a lot of memory.
I think a more reasonable solution would be to have an array of n mutexes (where 1 < n < 1000000). Then for each element array[i] I would lock mutex[i % n] while operating on it. If n is sufficiently large, I can still minimize contention.
So my question is, is there a performance penalty to using a large (e.g. >= 1000000) number of mutexes in this manner, beyond increased memory usage? If so, how many mutexes can you reasonably use before you start to see degradation?
I'm sure the answer to this is somewhat platform specific; I'm using pthreads on Linux. I'm also working on setting up my own benchmarks, but the scale of data that I'm working on makes that time consuming, so some initial guidance would be appreciated.
That was the initial question. For those asking for more detailed information regarding the problem, I have 4 multiple GB binary data files describing somewhere in the neighborhood of half a billion events that are being analyzed. The array in question is actually the array of pointers backing a very large chained hash table. We read the four data files into the hash table, possibly aggregating them together if they share certain characteristics. The existing implementation has 4 threads, each reading one file and inserting records from that file into the hash table. The hash table has 997 locks and 997*9973 = ~10,000,000 pointers. When inserting an element with hash h, I first lock mutex[h % 997] before inserting or modifying the element in bucket[h % 9943081]. This works all right, and as far as I can tell, we haven't had too many issues with contention, but there is a performance bottleneck in that we're only using 4 cores of a 16 core machine. (And even fewer as we go along since the files generally aren't all the same size.) Once all of the data has been read into memory, then we analyze it, which uses new threads and a new locking strategy tuned to the different workload.
I'm attempting to improve the performance of the data load stage by switching to a thread pool. In the new model, I still have one thread for each file which simply reads the file in ~1MB chunks and passes each chunk to a worker thread in the pool to parse and insert. The performance gain so far has been minimal, and the profiling that I did seemed to indicate that the time spent locking and unlocking the array was the likely culprit. The locking is built into the hash table implementation we are using, but it does allow specifying the number of locks to use independently of the size of the table. I'm hoping to speed things up without changing the hash table implementation itself.
(A very partial & possibly indirect answer to your question.)
Have once scored a huge performance hit trying this (on a CentOS) raising the number of locks from a prime of ~1K to a prime of ~1M. While I never fully understood its reason, I eventually figured out (or just convinced myself) that it's the wrong question.
Suppose you have an array of length M, with n workers. Furthermore, you use a hash function to protect the M elements with m < M locks (e.g., by some random grouping). Then, using the Square Approximation to the Birthday Paradox, the chance of a collision between two workers - p - is given by:
p ~ n2 / (2m)
It follows that the number of mutexes you need, m, does not depend on M at all - it is a function of p and n only.
Under Linux there is no cost other than the memory associated with more mutexes.
However, remember that the memory used by your mutexes must be included in your working set - and if your working set size exceeds the relevant cache size, you'll see a significant performance drop. This means that you don't want an excessively sized mutex array.
As Ami Tavory points out, the contention depends on the number of mutexes and number of threads, not the number of data elements protected - so there's no reason to link the number of mutexes to the number of data elements (with the obvious proviso that it never makes sense to have more mutexes than elements).
In the general scenario, I would advise
Simply locking the whole array (simple, very often "good enough" if your application is mostly doing "other stuff" besides accessing the array)
... or ...
Implementing a read/write lock on the entire array (assuming reads equal or exceed writes)
Apparently your scenario doesn't match either case.
Q: Have you considered implementing some kind of a "write queue"?
Worst case, you'd only need one mutex. Best case, you might even be able to use a lock-less mechanism to manage your queue. Look here for some ideas that might be applicable: https://msdn.microsoft.com/en-us/library/windows/desktop/ee418650%28v=vs.85%29.aspx

Re-use threads in CUDA

I have a large series of numbers, in an array, about 150MB of numbers, and I need to find consecutive sequences of numbers, the sequences might be from 3 to 160 numbers. so to make it simple, I decided the each thread should start such as ThreadID = CellID
So thread0 looks at cell0, and if the number in cell0 matches my sequence, then, thread0 = cell1 and so on, and if the numbed does not match, the thread is stopped and I do that for my 20000 threads.
So that works out, fine but I wanted to know how to reuse threads, because the array in which i'm looking for the series of number is much bigger.
So should I divide my array in smaller arrays, and load them into shared memory, and loop over the number of smaller arrays and (eventually pad the last one). Or should I keep the big array in global memory, and have my thread to be to ThreadID = cellID and then ThreadID = cellID+20000 etc. or is there a better way to go through.
To clarify : At the moment i use 20 000 threads, 1 Array of numbers in Global Memory (150MB), and a sequence of numbers in shared memory (eg: 1,2,3,4,5), represented as an array. Thread0 start at Cell0, and look if the cell0 in global memory, is equal to cell0 in shared memory, if yes, thread0 compare cell1 in global memory, to cell1 in shared memory, and so on until there is a full match.
If the numbers in both (global and shared memory) cells are not equal, that thread is simply discarded. Since, most of the numbers in the Global memory Array will not match the first number of my sequence. I thought it was a good idea to use one thread to match Cell_N in GM to Cell_N in ShM and overlap the threads. and this technique allows coalesced memory access the first time, since every thread from 0 to 19 999 will access contiguous memory.
But what I would like to know, is "what would be the best way to re-use the threads" that have been discarded, or the threads that finished to match. To be able to match the entire array of 150MB instead of simply match (20000 numbers + (length of sequence -1)).
"what would be the best way to re-use the threads" that have been discarded, or the threads that finished to match. To be able to match the entire array of 150MB instead of simply match (20000 numbers + (length of sequence -1)).
You can re-use threads in a fashion similar to the canonical CUDA reduction sample (using the final implementation as a reference).
int idx = threadIdx.x+blockDim.x*blockIdx.x;
while (idx < DSIZE){
perform_sequence_matching(idx);
idx += gridDim.x*blockDim.x;
}
In this way, with an arbitrary number of threads in your grid, you can cover an arbitrary problem size (DSIZE);

Differentiating between queue full and queue empty

I am using a an array with a write index and a read index to implement a straightforward FIFO Queue. I do the usual MOD ArraySize when incrementing the write and read index.
Is there a way to differentiate between queue full and queue empty condition (wrIndex == rdIndex) without using any additional queuecount and also without wasting any array entry i.e . Queue is full if (WrIndex + 1 ) MOD ArraySize == ReadIndex
I'd go with 'wasting' an array entry to detect the queue full condition, especially if you're dealing with different threads/tasks being producers and consumers. Having another flag keep track of that situation increases the locking necessary to keep things consistent and increases the likelihood of some sort of bug that introduces a race condition. This is even more true in the case where you can't use a critical section (as you mention in a comment) to ensure that things are in-sync.
You'll need at least a bit somewhere to keep track of that condition, and that probably means at least a byte. Assuming that your queue contains ints you're only saving 3 bytes of RAM and you're going to chew up several more bytes of program image (which might not be as precious, so that might not matter). If you keep a flag bit inside a byte used to store other flag bits, then you have to additionally deal with setting/testing/clearing that flag bit in a thread safe manner to ensure that the other bits don't get corrupted.
If you're queuing bytes, then you probably save nothing - you can consider the sentinel element to be the flag that you'd have to put somewhere else. But now you have to have no extra code to deal with the flag.
Consider carefully if you really need that extra queue item, and keep in mind that if you're queuing bytes, then the extra queue item probably isn't really extra space
Instead of a read and write index, you could use a read index and a queue count. From the queue count, you can easily tell if the queue is empty of full. And the write index can be computed as (read index + queue count) mod array_size.
What's wrong with a queue count? It sounds like you're going for maximum efficiency and minimal logic, and while I would do the same, I think I'd still use a queue count variable. Otherwise, one other potential solution would be to use a linked list. Low memory usage, and removing first element would be easy, just make sure that you have pointers to the head and tail of the list.
Basically you only need a single additional bit somewhere to signal that the queue is currently empty. You can probably stash that away somewhere, e.g., in the most significant bit of one of your indices (and than AND-ing the lower bits creatively in places where you need to work only on the actual index into your array).
But honestly, I'd go with a queue count first and only cut that if I really need that space, instead of putting up with bit fiddling.

Resources