OpenCL - For Loops and Relationship with GlobalWorkSize - arrays

While reading over some code I had a couple of questions that popped into my head.
Let's assume we had a globalWorkSize of one million elements in an array.
Assume the purpose of the kernel was to simply take summation of 100 elements at a time and store these values in an output. Ex) First time the kernel would sum elements 0-99, then it would do 1-100, then 2-101 and so on. All the summed values get stored in an array.
Now we know that there are 1 million elements, when we pass this to clEnqueueNDRangeKernel, does that mean the kernel will execute close to one million times?
I noticed that the for loop in the kernel only loops to one hundred elements, then the value is just stored in another array. So by just examining the for loop, one would just think that after 100 elements it would stop. How does the computer know when we have reached 1 million elements? Is it because we passed the parameter in clEnqueueNDRangeKernel and at an atomic level it knows that more elements need to be processed?

The device has no way to know that there are one million elements in the array. So if you set the global_work_size to be one million the last 99 kernels will happily overindex the array which may or may not seg fault depending on the device.
When you call the clEnqueueNDRangeKernel API call with a worksize of N then the information is sent to the device and the device will execute anough uniform sized workgroups untill it has executed the kernel N times.
Hope this answers your question.

Related

Is there an efficient way of storing the most recent part of a continuous stream in an array?

I have a never ending stream of data coming in to a program I’m writing. I would like to have a fixed size buffer array which only stores the T most recent observations of that stream. However, to me its not obvious how to implement that in an efficient way.
What I have done so far is to first allocate the buffer of length T and place incoming observations in consecutive order from the top as they arrive: data_0->index 0, data_1->index 1…data_T->index T.
Which works fine until the buffer is full. But when observation data_T+1 arrives, index 0 needs to be removed from the buffer and all T-1 rows needs to be moved up one step in the array/matrix in order to place the newest data point at index T.
That seems to be a very inefficient approach when the buffer is large and hundreds of thousands of elements need to be pushed one row up all the time.
How is this normally solved?
This algorithm called FIFO queue java fifo queue
Look at this API it has several code examples.

MPI QuickSort Program

I am a newbie, trying to edit a program. I have a MPI Program that divide array into subsets, the master sends the subsets to the slaves, they doo a quicksort and then return the sorted numbers to the master so he can write them in a file.
What I am trying to do is make the quick sort happen even quicker. My idea is to make the master divide the array and sends subsets to the slaves but keeping one for himself. Then dividing them again to new subsets (for example if we have numbers from 1 to 100 in the array the new subsets should be from 1 to 25, 26 to 50, 51 to 75 and 76 to 100) and then keep the first subset (1 to 25) for himself, send the second (26 to 50) to the first slave, the third one (51 to 76) to the second slave and etc. The slaves should do the same. Then it should perform a quicksort and the slave should return the sorted numbers to the master. I am hoping that this way the sort should be faster. The problem is that as I said I am a newbie and I need help with ideas, advices and even code so I can achieve my goal.
For this answer I am going to stick with the assumption that this should be done with Quicksort, and that the data is read on a single process. Just keep in mind that there are many sophisticated parallel sorting techniques.
Your idea of separating the numbers by subsets is problematic, because it makes assumptions about the shape of data. For non-uniformly distributed data sets it won't even help to know the minimum and maximum. It is better to simply send out equal amount of numbers to each process, let them sort and afterwards merge the data.
For the merge you start with ntasks sorted sub-lists and want to end up with a single one. A naive merge would repeatedly look for the minimal element in each sub-list, remove that and append it to the final list. This needs ntasks * N comparisons, N swaps and N * 2 memory. You can optimize the comparisons to log2(ntasks) * N by doing an actual merge sort, but that also needs log2(ntasks) * N swaps. You can further refine that by keeping the sub-lists (or pointers to their first element) in a priority queue, which should give you log2(ntasks) * N comparisons and N swaps.
About the usage of MPI:
Do not use MPI_Isend & MPI_Wait right after each other. In this case use MPI_Send instead. Use the immediate variants only if you can actually do something useful between the MPI_Isend and MPI_Wait.
Use collective operations whenever possible. To distribute data from the root to all slaves, use MPI_Scatter or MPI_Scatterv. The first requires all ranks to receive the same number of elements, which can also be achieved by padding. To collect data from the slaves in the master, use MPI_Gather or MPI_Gatherv.1 Collectives are more easy to get right, because they describe the high level operation. Their implementation is usually highly optimized.
To receive an unknown-size message, you can also send the message directly and use MPI_Probe at the receiver side to determine the size. You are even allowed to MPI_Recv with a buffer that is larger than the sent buffer, if you know an upper bound.
1 You could also consider the merge step as a reduction and parallelize the necessary computation for that.
In principle your solution looks very good. I don't understand completely if for the larger files you are intending to process them in chunks or as a whole. From my experience I suggest that you assign as large as possible blocks to the slaves. This way the rather expensive message passing operations are executed only very seldom.
What I cannot understand in your question is what the overall goal of your program is. Is it your intention to sort the complete input files in parallel? If this is the case you will need some sort of merge sort to be applied to the results you receive from the individual processes.

Re-use threads in CUDA

I have a large series of numbers, in an array, about 150MB of numbers, and I need to find consecutive sequences of numbers, the sequences might be from 3 to 160 numbers. so to make it simple, I decided the each thread should start such as ThreadID = CellID
So thread0 looks at cell0, and if the number in cell0 matches my sequence, then, thread0 = cell1 and so on, and if the numbed does not match, the thread is stopped and I do that for my 20000 threads.
So that works out, fine but I wanted to know how to reuse threads, because the array in which i'm looking for the series of number is much bigger.
So should I divide my array in smaller arrays, and load them into shared memory, and loop over the number of smaller arrays and (eventually pad the last one). Or should I keep the big array in global memory, and have my thread to be to ThreadID = cellID and then ThreadID = cellID+20000 etc. or is there a better way to go through.
To clarify : At the moment i use 20 000 threads, 1 Array of numbers in Global Memory (150MB), and a sequence of numbers in shared memory (eg: 1,2,3,4,5), represented as an array. Thread0 start at Cell0, and look if the cell0 in global memory, is equal to cell0 in shared memory, if yes, thread0 compare cell1 in global memory, to cell1 in shared memory, and so on until there is a full match.
If the numbers in both (global and shared memory) cells are not equal, that thread is simply discarded. Since, most of the numbers in the Global memory Array will not match the first number of my sequence. I thought it was a good idea to use one thread to match Cell_N in GM to Cell_N in ShM and overlap the threads. and this technique allows coalesced memory access the first time, since every thread from 0 to 19 999 will access contiguous memory.
But what I would like to know, is "what would be the best way to re-use the threads" that have been discarded, or the threads that finished to match. To be able to match the entire array of 150MB instead of simply match (20000 numbers + (length of sequence -1)).
"what would be the best way to re-use the threads" that have been discarded, or the threads that finished to match. To be able to match the entire array of 150MB instead of simply match (20000 numbers + (length of sequence -1)).
You can re-use threads in a fashion similar to the canonical CUDA reduction sample (using the final implementation as a reference).
int idx = threadIdx.x+blockDim.x*blockIdx.x;
while (idx < DSIZE){
perform_sequence_matching(idx);
idx += gridDim.x*blockDim.x;
}
In this way, with an arbitrary number of threads in your grid, you can cover an arbitrary problem size (DSIZE);

CUDA threads appending variable amounts of data to common array

My application takes millions of input records, each 8 bytes, and hashes each one into two or more output bins. That is, each input key K creates a small number of pairs (B1,K), (B2,K), ... The number of output bins per key is not known until the key is processed. It's usually 2 but could occasionally be 10 or more.
All those output pairs need to be eventually stored in one array since all the keys in each bin will later be processed together. How to do this efficiently?
Using an atomic increment to repeatedly reserve a pair from a global array sounds horribly slow. Another obvious method would be to init a hash table as an array of pointers to some sort of storage per bin. That looks slower.
I'm thinking of pre-reserving 2 pairs per input record in a block shared array, then grabbing more space as needed (i.e., a reimplementation of the STL vector reserve operation), then having the last thread in each block copying the block shared array to global memory.
However I'm not looking forward to implementing that. Help? Thanks.
Using an atomic increment to repeatedly reserve a pair from a global
array sounds horribly slow.
You could increment bins of a global array instead of one entry at a time. In other words, you could have a large array, each thread could start with 10 possible output entries. If the thread over flows it requests for the next available bin from the global array. If you're worried about slow speed with the 1 atomic number, you could use 10 atomic numbers to 10 portions of the array and distribute the accesses. If one gets full, find another one.
I'm also considering processing the data twice: the 1st time just to
determine the number of output records for each input record. Then
allocate just enough space and finally process all the data again.
This is another valid method. The bottleneck is calculating the offset of each thread into the global array once you have the total number of results for each thread. I haven't figured a reasonable parallel way to do that.
The last option I can think of, would be to allocate a large array, distribute it based on blocks, used a shared atomic int (would help with slow global atomics). If you run out of space, mark that the block didn't finish, and mark where it left off. On your next iteration complete the work that hasn't been finished.
Downside of course of the distributed portions of global memory is like talonmies said... you need a gather or compaction to make the results dense.
Good luck!

Matrix Multiplication multiprocess in C

I want to do matrix multiplication using multiple processes via fork and using shared memory with each of the process computing one row for small sized matrix but with higher sized matrices it is not possible to create a process for each row . So It should compute a block of rows determined by the size. For example upto 10 rows it should calculate one row per process and after that say for 20 rows one process should calculate 4 rows each. I am unable to program it as I can take row number as number of processes. Suppose I take number of processes constant say 8 then each block would have N/8 rows.But then size of matrix should be multiple of 8, and number of processes should be variable.Suppose it has 6 CPUs , Can I take number of processes to be constant i e 6 .What can be the right approach ? How should I write it ?
Here's some example code which demonstrates matrix mult. in pthreads. I found it almost instantly in a search engine. It shows a method for doing what you describe.
http://www.cs.arizona.edu/classes/cs422/spring13/examples/matmult-dyn.c
You'll probably need to do some fine tuning of it to determine what is the best approach.
You probably should also read this article:
http://aristeia.com/TalkNotes/PDXCodeCamp2010.pdf

Resources