I have a parallelized algorithm that can output a random number from 1 to 1000.
My objective is to compute, for N executions of the algorithm, how many times each number is chosen.
So for instance, I am doing N/100 executions of the algorithm, on 100 threads, and the final result is an array of 1000 ints, which are the occurrences of each number.
Is there a way to parallelize this intelligently? For instance, if I only use one global array I will have to lock it every time I want to write in it, which will make my algorithm run almost as if there was no parallelization. On the other hand, I can't just make one array of 1000 numbers per threads, just to have them be 1% filled and merge them at the end.
This appears to be histogramming. If you want to do it quickly, use a library such as CUB or Thrust.
For cases where there are a small number of bins, one approach is to have each thread operate on its own set of bins, for a segment of the input. Then do a parallel reduction on each bin. If you are clever about the storage organization of your bins, the parallel reduction amounts to summation of matrix columns:
Bins:
1 2 3 4 ... 1000
T 1
h 2
r 3
e .
a .
d 100
In the above example, each thread takes a segment of the input, and operates on one row of the partial sums matrix.
When all threads are finished with their segments, then sum the columns of the matrix, which can be done very efficiently and quickly with a simple for-loop kernel.
There are a couple of things you can do.
If you want to be as portable as possible, you could have one lock for each index.
If this is being run on a Windows system, I would suggest InterlockedIncrement
Related
This is a follow up of this question.
I have an array D(:,:,:) of size NxMxM. Typically, for the problem that I am considering now, it is M=400 and N=600000 (I reshaped the array in order to give the biggest size to the first entry).
Therefore, for each value l of the first entry, D(l,:,:) is an MxM matrix in a certain basis. I need to perform a change of components of this matrix using a basis set vec(:,:), of size MxM, so as to obtain the matrices D_mod(l,:,:).
I think that the easiest way to do it is with:
D_mod=0.0d0
do b=1,M
do a=1,M
do nu=1,M
do mu=1,M
D_mod(:,mu,nu)=D_mod(:,mu,nu)+vec(mu,a)*vec(nu,b)*D(:,a,b)
end do
end do
end do
end do
Is there a way to improve the speed of this calculation (also using LAPACK/BLAS libraries)?
I was considering this approach: reshaping D into a N x M^2 matrix D_re; doing the tensor product vec(:,:) x vec(:,:) and reshaping it in order to obtain an M^2 x M^2 matrix vecsq_re(:,:) (this motivates this question); finally, doing the matrix product of these two matrices with zgemm. However, I am not sure this is a good strategy.
EDIT
I am sorry, I wrote the question too fast and too late. The size can be up to 600000, yes, but I usually adopt strategies to reduce it by a factor 10 at least. The code is supposed to run on nodes with 100 Gb of memory
As #IanBush has said, your D array is enormous, and you're likely to need some kind of high-memory machine or cluster of machines to process it all at once. However, you probably don't need to process it all at once.
Before we get to that, let's first imagine you don't have any memory issues. For the operation you have described, D looks like an array of N matrices, each of size M*M. You say you have "reshaped the array in order to give the biggest size to the first entry", but for this problem this is the exact opposite of what you want. Fortran is a column-major language, so iterating across the first index of an array is much faster than iterating across the last. In practice, this means that an example triple-loop like
do i=1,N
do j=1,M
do k=1,M
D(i,j,k) = D(i,j,k) +1
enddo
enddo
enddo
will run much slower 1 than the re-ordered triple-loop
do k=1,M
do j=1,M
do i=1,N
D(i,j,k) = D(i,j,k) +1
enddo
enddo
enddo
and so you can immediately speed everything up by transposing D and D_mod from N*M*M arrays to M*M*N arrays and rearranging your loops. You can also speed everything up by replacing your manually-written matrix multiplication with matmul (or BLAS/LAPACK), to give
do i=1,N
D_mod(:,:,i) = matmul(matmul(vec , D(:,:,i)),transpose(vec))
enddo
Now that you're doing matrix multiplication one matrix at a time, you can also find a solution for your memory usage problems: instead of loading everything into memory and trying to do everything at once, just load one D matrix at a time into an M*M array, calculate the corresponding entry of D_mod, and write it out to disk before loading the next matrix.
1 if your compiler doesn't just optimise the loop order.
I am trying to parallelize the radix sort using POSIX threads using C language. The specialty is the radix sort needs to be implemented for floating-point numbers. Currently, the code is running sequentially but I have no idea how to parallelize the code. Can anyone help me with this? Any help is appreciated.
Radix sorts are pretty hard to parallelize efficiently on CPUs. There is two parts in a radix sort: the creation of the histogram and the bucket filling.
To create an histogram in parallel you can fill local histograms in each thread and then perform a (tree-based) reduction of the histograms to build a global one. This strategy scale well as long as the histogram are small relative to the data chunks computed by each thread. An alternative way to parallelize this step is to use atomic adds to fill directly a shared histogram. This last method scale barely when thread write accesses conflicts (which often happens on small histograms and many threads). Note that in both solutions, the input array is evenly distributed between threads.
Regarding the bucket filling part, one solution is to make use of atomic adds to fill the buckets: 1 atomic counter per bucket is needed so that each thread can push back items safely. This solution only scale when threads do not often access to the same bucket (bucket conflicts). This solution is not great as the scalability of the algorithm is strongly dependent of the content of the input array (sequential in the worst case). There are solutions to reduces conflicts between threads (better scalability) at the expense of more work (slower with few threads). One is to fill the buckets from both sides: threads with an even ID fill the buckets in ascending order while threads with an odd ID fill them in descending order. Note that it is important to take into account false sharing to maximize performance.
A simple way to parallelize radix sort for all but the first pass is to use a most significant digit (MSD) pass to split up the array into bins, each of which can then be sorted concurrently. This approach relies on having somewhat uniform distribution of values, at least in terms of the most significant digit, so that the bins are reasonably equal in size.
For example, using a digit size of 8 bits (base 256), use a MSD pass to split up the array into 256 bins. Assuming there are t threads, then sort t bins at a time, using least significant digit first radix sort.
For larger arrays, it may help to use a larger initial digit size to split up the array into a larger number of bins, with the goal of getting t bins to fit in cache.
Link to a non-parallelized radix sort that uses MSD for first pass, then LSD for next 3 passes. The loop at the end of RadixSort() to sort the 256 bins could be parallelized:
Radix Sort Optimization
For the first pass, you could use the parallel method in Jerome Richard's answer, but depending on the data pattern, it may not help much, due to cache and memory conflicts.
Is there a math formula that I can put in my Windows calculator to calculate the time cost of my nested loop?
From what I read, I saw things like O(n) but I don't know how to put that in a calculator.
I have a 300 levels nested loop, each loop level will go from 0 to 1000
So it is like:
for i as integer =0 to 1000
{
//300 more nested loops, each goes from 0 to 1000
}
I am assuming that 1 complete loop will take 1 second (i.e. the 300 levels from the top most loop to the inner most loop will take 1 second).
How can I find out how many seconds it will take to complete the whole loop operation?
Thank you
If you have M nested loops each with N steps, the total number of times the inner-most code will run is NM. Consequently, if this code takes t seconds to execute, the total running time will be no less than NMt.
In your case, N = 1001 and M = 301. Thus
NM = 1351006135283321356815039655729366684349803831988739053240801914465955329955010318785683390381893021480038259097571556568393033624125663039020474816807139123124687688667110651554536455554983313531622053666601142890485454586020409971220427023079449603217442510854172465669657551198374621035716253537483681789962979899381794117593366167602159102878741666110186140811771157661856975727227011774938173689257851684515033630838203428990519981303994460521486651131205651440789014004132287034501678194895276766533222238644463714676717122486815519675003623218747516074833762258750872602504763945523812443905022112877989765178585281722001229086777672022301235387486256207147702409003911671750616133700186863997717468027217550738860605995255072363983914048955786591110292838269781981812167453230775137941595147968127169264749956845265384891251656562329411331097495063637387550896404766837508698223985774995150301001.
If, in addition, we assume that the inner-most code is really fast, like t = 1 µs, the total running time will be no less than NMt = 1.35... × 10897 s.
When the universe comes to its end, your code will be at almost exactly 0.0000000000000000000000000000 % completion, assuming the hardware hasn't faulted by then and the human civilization hasn't come to an end (thus making it difficult for you to find the electrical power you need to run your hardware).
According to me, Comb sort should also run in sub quadratic time just like shell sort. This is because comb sort is to bubble sort just how shell sort is related to insertion sort. Shell sort sorts the array according to gap sequences applying insertion sort and similarly comb sort sorts the array according to gap sequences applying bubble sort. So what is the the running time of comb sort?
(This question has been unanswered for a while, so I'm converting my comment into an answer.)
Although there are similarities between shell sort and comb sort, the average-case runtime of comb sort is O(n2). Proving this is a bit tricky, and the technique that I've seen used to prove it is the incompressibility method, an information-theoretic technique involving Kolmogorov complexity.
Hope this helps!
With what sequence of increments?
If the increments are chosen to be: the set of all numbers of the form (2^p * 3^q), that are less than N, then, yes, the running time is better than quadratic (it's proportional to N times the square of the logarithm of N). With that set of increments, Combsort performs exactly the same exchanges as a Shellsort using the same increments (the "Pratt sequence"). But that's not what people usually have in mind when they're talking about Combsort.
In theory...
With increments that are decreasing geometrically (e.g. on each pass over the input the increment is, say, about 80% of the previous increment), which is what people usually mean when they talk about Combsort... yes, asymptotically, it is quadratic in both the worst-case and the average case. But...
In practice...
So long as the increments are relatively prime and the ratio between one increment and the next is sensible (80% is fine), n has to astronomically large before the average running time will be much more than n.log(n). I've sorted hundreds of millions of records at a time with Combsort, and I've only ever seen quadratic running times when I've deliberately engineered them by constructing "killer inputs". In practice, with relatively prime increments (and a ratio between adjacent increments of 1.25:1), even for millions of records, Combsort requires on average, about 3 times as many comparisons as a mergesort and typically takes between 2 and 3 times as long to run.
I'm trying to find the average value of an array of floating point values using multiple threads on a single machine. I'm not concerned with the size of the array or memory constraints (assume a moderately sized array, large enough to warrant multiple threads). In particular, I'm looking for the most efficient scheduling algorithm. It seems to me that a static block approach would be the most efficient.
So, given that I have x machine cores, it would seem reasonable to chunk the array into array.size/x values and have each core sum the results for their respective array chunk. Then, the summed results from each core are added and the final result is this value divided by the total number of array elements (note: in the case of the # of array elements not being exactly divisible by x, I am aware of the optimization to distribute the elements as evenly as possible across the threads).
The array will obviously be shared between threads, but since there are no writes involved, I won't need to involve any locking mechanisms or worry about synchronization issues.
My question is: is this actually the most efficient approach for this problem?
In contrast, for example, consider the static interleaved approach. In this case, if I had four cores (threads), then thread one would operate on array elements 0, 4, 8, 12... whereas thread two would operate on elements 1, 5, 9, 13... This would seem worse since each core would be continually getting cache misses, whereas the static block approach means that each core operates on success values and takes advantage of data locality. Some tests that I've run seem to back this up.
So, can anyone point out a better approach than static block, or confirm that this is most likely the best approach?
Edit:
I'm using Java and Linux (Ubuntu). I'm trying not to think to much about the language/platform involved, and just look at the problem abstractly from a scheduling point of view that involves manually assigning workload to multiple threads. But I understand that the language and platform are important factors.
Edit-2:
Here's some timings (nano time / 1000) with varying array sizes (doubles).
Sequential timings used a single java thread.
The others implemented their respective scheduling strategies using all available (4) cores working in parallel.
1,000,000 elements:
---Sequential
5765
1642
1565
1485
1444
1511
1446
1448
1465
1443
---Static Block
15857
4571
1489
1529
1547
1496
1445
1415
1452
1661
---Static Interleaved
9692
4578
3071
7204
5312
2298
4518
2427
1874
1900
50,000,000 elements:
---Sequential
73757
69280
70255
78510
74520
69001
69593
69586
69399
69665
---Static Block
62827
52705
55393
53843
57408
56276
56083
57366
57081
57787
---Static Interleaved
179592
306106
239443
145630
171871
303050
233730
141827
162240
292421
Your system doesn't seem to have the memory bandwidth to take advantage of 4 threads on this problem. Doing floating point addition of elements is just not enough work to keep the CPU busy at the rate memory can deliver data... your 4 cores are sharing the same memory controller/DRAM... and are waiting on memory. You probably will see the same speedup if you use 2 threads instead of 4.
Interleaving is a bad idea, as you said and as you confirmed, why waste precious memory bandwidth bringing data into a core and then only using one fourth of it. If you are lucky and the threads run somewhat in sync then you will get reuse of data in the Level 2 or Level 3 cache, but you will still bring data into the L1 cache and only use a fraction.
Update: when adding 50 million elements one concern is loss of precision, log base 2 of 50 million is about 26 bits, and double precision floating point has 53 effective bits (52 explicit and 1 implied). The best case is when all elements have similar exponents (similar in magnitude). Things get worse if the numbers in the array have a large range of exponents, in the worst case the range is large and they are sorted in descending order of magnitude. Precision of your final average can be improved by sorting the array and adding in ascending order. See this SO question for more insight into the precision problem when adding a large number of items: Find the average within variable number of doubles.