I'd like to improve the efficiency of a code which includes updates to every value of an array which is identical on all processors run with MPI. The basic structure I have now is to memcpy chunks of the data into a local array on each processor, operate on those, and Allgatherv (have to use "v" because the size of local blocks isn't strictly identical).
In C this would look something like:
/* counts gives the parallelization, counts[RANK] is the local memory size */
/* offsets gives the index in the global array to the local processors */
memcpy (&local_memory[0], &total_vector[0], counts[RANK] * sizeof (double));
for (i = 0; i < counts[RANK]; i++)
local_memory[i] = new_value;
MPI_Allgatherv (&local_memory[0], counts[RANK], MPI_DOUBLE, &total_vector[0], counts, offsets, MPI_DOUBLE, MPI_COMM_WORLD);
As it turns out, this isn't very efficient. In fact, it's really freaking slow, so bad that for most system sizes I'm interested in the parallelization doesn't lead to any increase in speed.
I suppose an alternative to this would be to update just the local chunks of the global vector on each processor and then broadcast the correct chunk of memory from the correct task to all other tasks. While this avoids the explicit memory handling, the communication cost of the broadcast has to be pretty high. It's effectively all-to-all.
EDIT: I just went and tried this solution, where you have to loop over the number of tasks and execute that number of broadcast statements. This method is even worse.
Anyone have a better solution?
The algorithm you describe is "all to all." Each rank updates part of a larger array, and all ranks must sync that array from time to time.
If the updates happen at controlled points in the program flow, a Gather/Scatter pattern might be beneficial. All ranks send their update to "rank 0", and rank 0 sends the updated array to everyone else. Depending on the array size, number of ranks, interconnect between each rank, etc....this pattern may offer less overhead than the Allgatherv.
Related
my problem is about getting "sum" for some same length arrays. For example,I have a M*N(100 * 2000) length float array in all. I would like to get M(100) sum values of every N(2000) float numbers. I found two ways to do this job. One is with Cublas function in a for loop for M ,like cublasSasum. The other is self-written kernel function, adding numbers in loop. My problem is the speed of these two ways and how to choose between them.
For Cublas method, no matter how big is N(4000~2E6), the time consuming is depending mainly on M, the loop number.
For self-written kennel function, the speed varied much with N. In detail, if N is small, below 5000, it runs much faster than the Cublas way. Then the time consumption is increasing with N's increasing.
N = 4000 |10000 | 40000 | 80000 | 1E6 | 2E6
t = 254ms| 422ms | 1365ms| 4361ms| 5399ms | 10635ms
If N is big enough, it runs much slower than Cublas way. My problem is how could I make a predition with M or N to decide which way I should use? My code might be used on different GPU device. Must I compare the speed in a parameter swept and then "guess" to make a choice in every GPU device, or I could inference from GPU device information?
Besides, for the kernel function way,I also have problem in deciding the blockSize and gridSize. I must note here that what I concern more is speed not efficiency. Because the memory is limited. For example, if I got 8G memory. My dataformat is float in 4 bytes. N is 1E5. Then M is at most 2E4, which is smaller than the MaxGridSize. So I got two ways as below. I found have a bigger gridSize is always better, I don't know the reason. Is it about the usage of register number per thread? But I don't think it needs many registers per thread in this kernel function.
Any suggestion or information would be appreciated. Thank you.
Cublas way
for (int j = 0;j<M;j++)
cublasStatus = cublasSasum(cublasHandle,N,d_in+N*j,1,d_out+j);
self-written kernel way
__global__ void getSum(int M, int N, float* in, float * out)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
if(i<M){
float tmp = 0;
for(int ii = 0; ii<N; ii++){
tmp += *(in+N*i+ii);
}
out[i] = tmp;
}
}
Bigger gridSize is faster. I don't know the reason.
getSum<<<M,1>>>(M, N, d_in, d_out); //faster
getSum<<<1,M>>>(M, N, d_in, d_out);
This is a blockSize-time parameter swept result. M = 1E4.N = 1E5.
cudaEventRecord(start, 0);
//blockSize = 1:1024;
int gridSize = (M + blockSize - 1) / blockSize;
getSum<<<gridSize1,blockSize1>>>...
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
It seems I should choose a relative small blockSize, like 10~200. I just would like to know why the full occupancy(blockSize 1024) is slower. I just post here for some possible reasons, registers number?latency?
Using CuBLAS is generally a very good idea and should be preferred if there is dedicated function doing want you want directly, especially for large datasets. That being said, you timings are very bad for a GPU kernel working on such small dataset. Let us understand why.
Bigger gridSize is faster. I don't know the reason.
getSum<<<M,1>>>(M, N, d_in, d_out);
getSum<<<1,M>>>(M, N, d_in, d_out);
The syntax of calling a CUDA kernel is kernel<<<numBlocks, threadsPerBlock>>>. Thus the first line submit a kernel with M blocks of 1 threads. Don't do that: this is very inefficient. Indeed, The CUDA programming manual say:
The NVIDIA GPU architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors. [...]
The multiprocessor creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps. [...]
A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp executes each branch path taken, disabling threads that are not on that path. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjoint code paths.
As a result, the first call create M blocks of 1 threads wasting 31 CUDA cores of 32 available in each warp. It means that you will likely read only 3% of the peak performance of your GPU...
The second call create one block of M threads. Because M is not a multiple of 32, few CUDA core are wasted. Moreover, it uses only 1 SM over the many available on your GPU because you have only one block. Modern GPUs have dozens of SMs (my GTX-1660S has 22 SM). This means that you will use only a tiny fraction of your GPU capability (few %). Not to mention that the memory access pattern is not contiguous slowing down even more the computation...
If you want to use your GPU much more efficiently, you need to provide more parallelism and to waste less resources. You can start by writing a kernel working on a 2D grid performing a reduction using atomics. This is not perfect, but much better than your initial code. You should also take care of reading memory contiguously (threads sharing the same warp should read/write a contiguous block of memory).
Please read precisely the CUDA manual or tutorials before writing CUDA code. It describes all of this very well and accurately.
UPDATE:
Based on the new informations, the problem you are experimenting with the blockSize is likely due to the strided memory accesses in the kernel (more specifically the N*i). Strided memory access patterns are slow and are generally slower when the stride is getting bigger. In your kernel, each thread will access to a different block in memory. GPU (and actually most hardware computing units) are optimized for accessing contiguous chunks of data as previously said. If you want to solve this problem and get faster results, you need to work on the other dimension in parallel (so not M but N).
Furthermore, the BLAS calls are inefficient because each iteration of the loop on the CPU will call a kernel on the GPU. Calling a kernel introduces a quite big overhead (typically from few microseconds up to ~100 us). Thus doing this in a loop called tens of thousands of times will be very slow.
I am a newbie, trying to edit a program. I have a MPI Program that divide array into subsets, the master sends the subsets to the slaves, they doo a quicksort and then return the sorted numbers to the master so he can write them in a file.
What I am trying to do is make the quick sort happen even quicker. My idea is to make the master divide the array and sends subsets to the slaves but keeping one for himself. Then dividing them again to new subsets (for example if we have numbers from 1 to 100 in the array the new subsets should be from 1 to 25, 26 to 50, 51 to 75 and 76 to 100) and then keep the first subset (1 to 25) for himself, send the second (26 to 50) to the first slave, the third one (51 to 76) to the second slave and etc. The slaves should do the same. Then it should perform a quicksort and the slave should return the sorted numbers to the master. I am hoping that this way the sort should be faster. The problem is that as I said I am a newbie and I need help with ideas, advices and even code so I can achieve my goal.
For this answer I am going to stick with the assumption that this should be done with Quicksort, and that the data is read on a single process. Just keep in mind that there are many sophisticated parallel sorting techniques.
Your idea of separating the numbers by subsets is problematic, because it makes assumptions about the shape of data. For non-uniformly distributed data sets it won't even help to know the minimum and maximum. It is better to simply send out equal amount of numbers to each process, let them sort and afterwards merge the data.
For the merge you start with ntasks sorted sub-lists and want to end up with a single one. A naive merge would repeatedly look for the minimal element in each sub-list, remove that and append it to the final list. This needs ntasks * N comparisons, N swaps and N * 2 memory. You can optimize the comparisons to log2(ntasks) * N by doing an actual merge sort, but that also needs log2(ntasks) * N swaps. You can further refine that by keeping the sub-lists (or pointers to their first element) in a priority queue, which should give you log2(ntasks) * N comparisons and N swaps.
About the usage of MPI:
Do not use MPI_Isend & MPI_Wait right after each other. In this case use MPI_Send instead. Use the immediate variants only if you can actually do something useful between the MPI_Isend and MPI_Wait.
Use collective operations whenever possible. To distribute data from the root to all slaves, use MPI_Scatter or MPI_Scatterv. The first requires all ranks to receive the same number of elements, which can also be achieved by padding. To collect data from the slaves in the master, use MPI_Gather or MPI_Gatherv.1 Collectives are more easy to get right, because they describe the high level operation. Their implementation is usually highly optimized.
To receive an unknown-size message, you can also send the message directly and use MPI_Probe at the receiver side to determine the size. You are even allowed to MPI_Recv with a buffer that is larger than the sent buffer, if you know an upper bound.
1 You could also consider the merge step as a reduction and parallelize the necessary computation for that.
In principle your solution looks very good. I don't understand completely if for the larger files you are intending to process them in chunks or as a whole. From my experience I suggest that you assign as large as possible blocks to the slaves. This way the rather expensive message passing operations are executed only very seldom.
What I cannot understand in your question is what the overall goal of your program is. Is it your intention to sort the complete input files in parallel? If this is the case you will need some sort of merge sort to be applied to the results you receive from the individual processes.
Suppose I have an array of 1,000,000 elements, and a number of worker threads each manipulating data in this array. The worker threads might be updating already populated elements with new data, but each operation is limited to a single array element, and is independent of the values of any other element.
Using a single mutex to protect the entire array would clearly result in high contention. On the other extreme, I could create an array of mutexes that is the same length as the original array, and for each element array[i] I would lock mutex[i] while operating on it. Assuming an even distribution of data, this would mostly eliminate lock contention, at the cost of a lot of memory.
I think a more reasonable solution would be to have an array of n mutexes (where 1 < n < 1000000). Then for each element array[i] I would lock mutex[i % n] while operating on it. If n is sufficiently large, I can still minimize contention.
So my question is, is there a performance penalty to using a large (e.g. >= 1000000) number of mutexes in this manner, beyond increased memory usage? If so, how many mutexes can you reasonably use before you start to see degradation?
I'm sure the answer to this is somewhat platform specific; I'm using pthreads on Linux. I'm also working on setting up my own benchmarks, but the scale of data that I'm working on makes that time consuming, so some initial guidance would be appreciated.
That was the initial question. For those asking for more detailed information regarding the problem, I have 4 multiple GB binary data files describing somewhere in the neighborhood of half a billion events that are being analyzed. The array in question is actually the array of pointers backing a very large chained hash table. We read the four data files into the hash table, possibly aggregating them together if they share certain characteristics. The existing implementation has 4 threads, each reading one file and inserting records from that file into the hash table. The hash table has 997 locks and 997*9973 = ~10,000,000 pointers. When inserting an element with hash h, I first lock mutex[h % 997] before inserting or modifying the element in bucket[h % 9943081]. This works all right, and as far as I can tell, we haven't had too many issues with contention, but there is a performance bottleneck in that we're only using 4 cores of a 16 core machine. (And even fewer as we go along since the files generally aren't all the same size.) Once all of the data has been read into memory, then we analyze it, which uses new threads and a new locking strategy tuned to the different workload.
I'm attempting to improve the performance of the data load stage by switching to a thread pool. In the new model, I still have one thread for each file which simply reads the file in ~1MB chunks and passes each chunk to a worker thread in the pool to parse and insert. The performance gain so far has been minimal, and the profiling that I did seemed to indicate that the time spent locking and unlocking the array was the likely culprit. The locking is built into the hash table implementation we are using, but it does allow specifying the number of locks to use independently of the size of the table. I'm hoping to speed things up without changing the hash table implementation itself.
(A very partial & possibly indirect answer to your question.)
Have once scored a huge performance hit trying this (on a CentOS) raising the number of locks from a prime of ~1K to a prime of ~1M. While I never fully understood its reason, I eventually figured out (or just convinced myself) that it's the wrong question.
Suppose you have an array of length M, with n workers. Furthermore, you use a hash function to protect the M elements with m < M locks (e.g., by some random grouping). Then, using the Square Approximation to the Birthday Paradox, the chance of a collision between two workers - p - is given by:
p ~ n2 / (2m)
It follows that the number of mutexes you need, m, does not depend on M at all - it is a function of p and n only.
Under Linux there is no cost other than the memory associated with more mutexes.
However, remember that the memory used by your mutexes must be included in your working set - and if your working set size exceeds the relevant cache size, you'll see a significant performance drop. This means that you don't want an excessively sized mutex array.
As Ami Tavory points out, the contention depends on the number of mutexes and number of threads, not the number of data elements protected - so there's no reason to link the number of mutexes to the number of data elements (with the obvious proviso that it never makes sense to have more mutexes than elements).
In the general scenario, I would advise
Simply locking the whole array (simple, very often "good enough" if your application is mostly doing "other stuff" besides accessing the array)
... or ...
Implementing a read/write lock on the entire array (assuming reads equal or exceed writes)
Apparently your scenario doesn't match either case.
Q: Have you considered implementing some kind of a "write queue"?
Worst case, you'd only need one mutex. Best case, you might even be able to use a lock-less mechanism to manage your queue. Look here for some ideas that might be applicable: https://msdn.microsoft.com/en-us/library/windows/desktop/ee418650%28v=vs.85%29.aspx
I have an int array[100] and I want 5 threads to calculate the sum of all array elements.
Each thread iterates through 20 elements within its dedicated range and writes the sum into a global sum variable.
Is a mutex necessary here? There is no synchronization needed since all threads are reading from independent sources.
for(i=offset; i<offset+range; i++){
// not used pthread_mutex_lock(&mutex);
sum += array[i];
// not used pthread_mutex_unlock(&mutex);
}
Can this lead to unpredictable behavior or does the OS actually handle this?
Is it advisable to leave out the mutex in this case? I've noticed that those algorithms run a lot faster without it.
Yes, you need synchronization, because all thread are modifying the sum at the same time. Here's example:
You have array of 4 elements [a1, a2, a3, a4] and 2 threads t1 and t2 and sum. To begin let's say t1 get value a1 and adds it to sum. But it's not an atomic operation, so he copy current value of sum (it's 0) to his local space, let's call it t1_s, adds to it a1 and then write sum = t1_s. But at the same time t2 do the same, he get sum value (which is 0, because t1 have not completed it operation) to t2_s, adds a3 and write to sum. So we got in the sum value of a3 insted of a1 + a3. This is called data race.
There are multiple solutions to this is:
You can use mutex as you already did in your code, but as you mentioned it can be slow, since mutex locks are expensive and all other threads are waiting for it.
Create array (with size of number of threads) to calculte local sums for all threads and then do the last reduction on this array in the one thread. No synchronization needed.
Without array calculate local sum_local for each thread and in the end add all these sums to shared variable sum using mutex. I guess it will be faster (however it need to be checked).
However as #gavinb mentioned all of it make sense only for larger amount of data.
I have an int array[100] and I want 5 threads to calculate the sum of all array elements. Each thread iterates through 20 elements within its dedicated range and writes the sum into a global sum variable.
First of all, it's worth pointing out that the overhead of this many threads processing this small amount of data would probably not be an advantage. There is a cost to creating threads, serialising access, and waiting for them to finish. With a dataset this small, an well-optimised sequential algorithm is probably faster. It would be an interesting exercise to measure the speedup with varying number of threads.
Is a mutex necessary here? There is no synchronization needed since all threads are reading from independent sources.
Yes - the reading of the array variable is independent, however updating the sum variable is not, so you would need a mutex to serialise access to sum, according to your description above.
However, this is a very inefficient way of calculating the sum, as each thread will be competing (and waiting, hence wasting time) for access to increment sum. If you calculate intermediate sums for each subset (as #Werkov also mentioned), then wait for them to complete and add the intermediate sums to create the final sum, there will be no contention reading or writing, so you wouldn't need a mutex and each thread could run as quickly as possible. The limiting factor on performance would then likely be memory access pattern and cache behaviour.
Can this lead to unpredictable behavior or does the OS actually handle this?
Yes, definitely. The OS will not handle this for you as it cannot predict how/when you will access different parts of memory, and for what reason. Shared data must be protected between threads whenever any one of them may be writing to the data. So you would almost certainly get the wrong result as threads trip over each other updating sum.
Is it advisable to leave out the mutex in this case? I've noticed that those algorithms run a lot faster without it.
No, definitely not. It might run faster, but it will almost certainly not give you the correct result!
In the case where it is possible to partition data in such a way there aren't dependencies (i.e. reads/writes) across partitions. In your example, there is the dependency of the sum variable and mutex is necessary. However, you can have partial sum accumulator for each thread and then only sum these sub results without need of a mutex.
Of course, you needn't to do this by hand. There are various implementations of this, for instance see OpenMP's parallel for and reduction.
so I am learning about parallel programming and am writing a program to calculate a global sum of a list of numbers. The list is split up into several sublists (depending on how many cores I have), and the lists are individually summed in parallel. After each core has its own sum, I use MPI_Reduce to send the values back to other cores, until they eventually make it back to root. Rather than just sending their values back to root directly (O(n)), we send them back, to other cores in parallel (O(log(n)), like this image illustrates: http://imgur.com/rL2O3Tr
So, everything is working fine until like 54. I think I may be misunderstanding MPI_Reduce. I was under the impression MPI_Reduce simply took a value in one thread, and a value in another thread (destination thread), and executed an operation on the value, and then stored it in the same spot in the second thread. This is what I want at least. I want to take my_sum from the sending thread, and add it to the my_sum in the receiving thread. Can you use MPI_Reduce on the same addresses in different threads? They both have the same name.
Furthermore, I want to generate a binary tree representation like this: http://imgur.com/cz6iFxl
Where S02 means that the sum was sent to thread 2, and R03 means that the sum was received by thread 3. For this I am creating an array of structs for each step in the sums (log(n) steps). Each step occurs on lines 59 - 95, each iteration of the while loop is one step. Lines 64-74 are where the thread is sending it's sum to the destination thread, and recording the information in the array of structs.
I think I may be using MPI_Send the wrong way. I am using it like this:
MPI_Send(srInfo, 1, MPI_INT, root, 0, MPI_COMM_WORLD);
Where srInfo is an array of structs, so just a pointer to the first struct (right?). Will this not work because the memory is not shared?
Sorry I am very new to parallel programming, and just need help understanding this, thanks.
You might be misunderstanding what MPI_REDUCE is supposed to do at a higher level. Is there a reason that you really need to divide up your reduction manually? Usually, the MPI collectives are going to be better at optimizing for large scale communicators that you will be able to do on your own. I'd suggest just using the MPI_REDUCE function to do the reduction for all ranks.
So your code will do something like this:
Divide up the work among all of your ranks somehow (could be reading from a file, being sent from some "root" process to all of the others, etc.).
Each rank sums up its own values.
Each rank enters into an MPI_REDUCE with its own value. This would look something like:
MPI_Reduce(&myval, &sum, 1, MPI_INT, MPI_SUM, root, MPI_COMM_WORLD);
That should automatically do all of the summation for you in what is usually some sort of tree fashion.