MPI - counting maximum from a matrix column - c

I have sequential code for looking for maximum value in matrix columns. Because this matrix could be even 5000 x 5000, I am thinking of speeding it in MPI. I don't know how to achieve this now, but I looked up functions MPI_Scatter for distributing items from a columns (maybe block mapping) and MPI_Gather for getting max values from all processes (in my case max 3 processes) and then comparing them... Do you think this could have some benefit in lesser computing time? If so, can someone give me a kick off?

Is all you want to do find out the maximum entry in the matrix (or within a part of the matrix)?
If so, the easiest way for you is probably split the matrix up to different processes, search for the maximum value in the part each process gets assigned and then compare them using MPI_Allreduce, which is able to send the maximum value of a variable which has different values in each processes to all processes.
No matter if you are dealing with a whole matrix or just a column, this technique can of course always be applied. You just have to think about a good way of splitting the area up to different processes
Of course this will speed up your computation only from a certain matrix size upwards. I assume if you are dealing with a 10 x 10 matrix and want to split it into 3 processes, the overhead for MPI is larger than the gain from parallelization. :)

Related

Is there a way to perform 2D convolutions with strides using Accelerate library in Swift?

I am trying to perform a specific downsampling process. It is described by the following pseudocode.
//Let V be an input image with dimension of M by N (row by column)
//Let U be the destination image of size floor((M+1)/2) by floor((N+1)/2)
//The floor function is to emphasize the rounding for the even dimensions
//U and V are part of a wrapper class of Pixel_FFFF vImageBuffer
for i in 0 ..< U.size.rows {
for j in 0 ..< U.size.columns {
U[i,j] = V[(i * 2), (j * 2)]
}
}
The process basically takes pixel values on every other locations spanning on both dimensions. The resulting image will be approximately half of the original image.
On a one-time call, the process is relatively fast running by itself. However, it becomes a bottleneck when the code is called numerous times inside a bigger algorithm. Therefore, I am trying to optimize it. Since I use Accelerate in my app, I would like to be able to adapt this process in a similar spirit.
Attempts
First, this process can be easily done by a 2D convolution using the 1x1 kernel [1] with a stride [2,2]. Hence, I considered the function vImageConvolve_ARGBFFFF. However, I couldn't find a way to specify the stride. This function would be the best solution, since it takes care of the image Pixel_FFFF structure.
Second, I notice that this is merely transferring data from one array to another array. So, I thought vDSP_vgathr function is a good solution for this. However, I hit a wall, since the resulting vector of vectorizing a vImageBuffer would be the interleaving bits structure A,R,G,B,A,R,G,B,..., which each term is 4 bytes. vDSP_vgathr function transfers every 4 bytes to the destination array using a specified indexing vector. I could use a linear indexing formula to make such vector. But, considering both even and odd dimensions, generating the indexing vector would be as inefficient as the original solution. It would require loops.
Also, neither of the vDSP 2D convolution functions fit the solution.
Is there any other functions in Accelerate that I might have overlooked? I saw that there's a stride option in the vDSP 1D convolution functions. Maybe, does someone know an efficient way to translate 2D convolution process with strides to 1D convolution process?

Large matrices with a certain structure: how can one define where memory allocation is not needed?

Is there a way to create a 3D array for which only certain elements are defined, while the rest does not take up memory?
Context: I am running Monte-Carlo simulations in which I want to solve 10^5 matrices. All of these matrices have a majority of elements that are zero, for which I wouldn't need to use 8 bytes of memory per element. These elements are the same for all matrices. For simplicity, I have combined all of these matrices into a 3D array, but if my matrices start to become too large, I encounter memory issues (since at matrix dimensions of 100*100*100000, the array already takes up 8 GB of memory).
One workaround would be to store every matrix element with its 10^6 iterations in a vector, that way, no additional information needs to be stored. The inconvenience is that then I would need to work with more than 50 different vectors, and I prefer working with arrays.
Is there any way to tell R that some matrix elements don't need information?
I have been thinking that defining a new class could help for this, but since I have just discovered classes, I am not sure what all the options are. Do you think this could be a good approach? Are there specific things I should keep in mind?
I also know that there are packages made to deal with memory problems, but that did not seem like the quickest solution in terms of human and computation effort for this specific problem.

Find the most frequent number in an array, with limited memory

How to find the most frequent number in an array? The array can be extremely large, for example 2GB and we only have limited memory, say 100MB.
I'm thinking about external sort, which is sorting and than duplicating numbers that are next to each other. Or hashma. But don't know what to do with the limited memory. And I'm even not sure if external sort is a good idea for this.
In the worst case, all your numbers are distinct except for one number which appears twice, and there's no way to detect this in main memory unless you have the two duplicate numbers loaded into main memory at the same time, which is unlikely without sorting if your total data size is much larger than main memory size. In that case, aysmptotically the best thing to do is sort numbers in batches and save to disk in files, and then do a merge sort merge step reading in all the sorted files into memory a few lines at a time, and outputting the merged sorted list to a new file. Then you go through the aggregate sorted file in order and count how many times you see each number, keeping track of which number has occurred the most times.
If you assume that the most frequent number is 50% frequency or higher, then you can do much better. You can solve the problem with constant extra memory just going through the list of numbers once. Basically you start by initializing the most common value (MCV) to the first number and initialize a counter N to 1. Then you go through the list. If the next number in the list is the MCV, you increase N by one. Otherwise you decrease N by 1. If N is 0 and the next number is different than MCV, then you set MCV to the new number and set N to 1. It is easy to prove this will terminate with the most common value stored in MCV.
Depending on what the hypotheses are, an even better way of doing it might be using the MJRTY algorithm:
http://www.cs.utexas.edu/~moore/best-ideas/mjrty/
Or its generalization:
http://www.cs.yale.edu/homes/el327/datamining2011aFiles/ASimpleAlgorithmForFindingFrequentElementsInStreamsAndBags.pdf
The idea is that with exactly two variables (a counter and a value store) you can determine, if there exists a majority element (appearing strictly more than 50% of the time), what that element is. The generalization require (k+1) counters and value stores to find the elements appearing 100/k%.
Because these are only candidates to majority (if there is are k-majority elements, those are they; but if there are no k-majority elements, than these are just random elements there by chance), a second pass on the data could help you get the exact count of the candidates, and determine which one, if any, is a majority element.
This is extremely fast and memory efficient.
There are few other optimizations, but with 4kb of memory, you should be able to find the majority element of 2GB of data with good probability - depending on the type of data you have.
Assumptions:
Integer is 4 bytes.
There are less then (100 MB / 4 B) = (104857600 / 4) = 26214400 distinct integers in the 2 GB array. Every number maps into 0-26214399 index range.
Let's do the histogram.
Make buckets in our 100 MB space. It's an integer array called histogram, which can store up to 26214400 counters. Counters are initally set to 0.
Iterate once through the 2 GB array. When you read x, do histogram[x]++.
Find the maximum in the histogram, iterating through it once. If the maximum is histogram[i], then i is the most frequent number.
The bottleneck is step 2, iterating through 2 GB array, but we do it only once.
If the second assumptions doesn't hold (i.e. there are more than 26214400 distinct integers):
Make histogram for numbers with indices from 0 to 26214399. Keep the most frequent number from histogram. Make histogram for numbers with indices from 26214400 to 52428798. Keep the most frequent number from the histogram and the previous most frequent number. And so on.
In the worst case, with 2^32 distinct numbers, it will do (2^32 / 26214400 + 1) = 164 iterations over that 2 GB array.
In general, it will do (NUMBER_OF_DISTINCT_NUMBERS / 26214400 + 1) iterations.
Assuming 4-byte integers, you can fit (100 / 4) = 25MB integers into available memory.
Read through your big file, counting each occurrence of and number in the range 0 ... 25MB-1. Use a big array to accumulate counts.
Find the number which occurs most frequently, store the number and its frequency and clear the array.
Read through the big file repeating the counting process for numbers in the range 25MB ... 50MB-1.
Find the number which occurs most frequently in the new array. Compare it with the number/frequency you stored at step 2. Store the number/frequency of the one with the higher frequency and clear the array.
Lather, rinse, repeat.
ETA: If we can assume that there is one single answer, that there aren't two different numbers with the same highest frequency, then you can discard all numbers if the array for a particular range shows a tie. Otherwise the problem of storing the winner for each range becomes more complex.
If you have limited memory but a reasonable amount of processing power and super fast speed isn't an issue, depending on your dataset you could try something like:
Iterate through array counting number of numbers 1 to 1000. Keep the one with the biggest count. Then count 1001 to 2000. Keep the biggest count between these, and the biggest one from the first batch. Repeat until all numbers have been counted.
I'm sure there are many optimisations for this based on the specifics of your dataset.

How to know if an array is sorted?

I already read this post but the answer didn't satisfied me Check if Array is sorted in Log(N).
Imagine I have a serious big array over 1,000,000 double numbers (positive and/or negative) and I want to know if the array is "sorted" trying to avoid the max numbers of comparisons because comparing doubles and floats take too much time. Is it possible to use statistics on It?, and if It was:
It is well seen by real-programmers?
Should I take samples?
How many samples should I take
Should they be random, or in a sequence?
How much is the %error permitted to say "the array sorted"?
Thanks.
That depends on your requirements. If you can say that if 100 random samples out of 1.000.000 is enough the assume it's sorted - then so it is. But to be absolutely sure, you will always have to go through every single entry. Only you can answer this question since only you know how certain you need to be about it being sorted.
This is a classic probability problem taught in high school. Consider this question:
What is the probability that the batch will be rejected?
In a batch of 8,000, clocks 7% are defective. A random sample of 10 (without replacement) from the 8,000 is selected and tested. If at least one is defective the entire batch will be rejected.
So you can take a number of random samples from your large array and see if it's sorted, but you must note that you need to know the probability that the sample is out of order. Since you don't have that information, a probabilistic approach wouldn't work efficiently here.
(However, you can check 50% of the array and naively conclude that there is a 50% chance that it is sorted correctly.)
If you run a divide and conquer algorithm using multiprocessing (real parallelism, so only for multi-core CPUs) you can check whether an array is sorted or not in Log(N).
If you have GPU multiprocessing you can achieve Log(N) very easily since modern graphics card are able to run few thousands processes in parallel.
Your question 5 is the question that you need to answer to determine the other answers. To ensure the array is perfectly sorted you must go through every element, because any one of them could be the one out of place.
The maximum number of comparisons to decide whether the array is sorted is N-1, because there are N-1 adjacent number pairs to compare. But for simplicity, we'll say N as it does not matter if we look at N or N+1 numbers.
Furthermore, it is unimportant where you start, so let's just start at the beginning.
Comparison #1 (A[0] vs. A[1]). If it fails, the array is unsorted. If it succeeds, good.
As we only compare, we can reduce this to the neighbors and whether the left one is smaller or equal (1) or not (0). So we can treat the array as a sequence of 0's and 1's, indicating whether two adjacent numbers are in order or not.
Calculating the error rate or the propability (correct spelling?) we will have to look at all combinations of our 0/1 sequence.
I would look at it like this: We have 2^n combinations of an array (i.e. the order of the pairs, of which only one is sorted (all elements are 1 indicating that each A[i] is less or equal to A[i+1]).
Now this seems to be simple:
initially the error is 1/2^N. After the first comparison half of the possible combinations (all unsorted) get eliminated. So the error rate should be 1/2^n + 1/2^(n-1).
I'm not a mathematician, but it should be quite easy to calculate how many elements are needed to reach the error rate (find x such that ERROR >= sum of 1/2^n + 1/2^(n-1)... 1/^(2-x) )
Sorry for the confusing english. I come from germany..
Since every single element can be the one element that is out-of-line, you have to run through all of them, hence your algorithm has runtime O(n).
If your understanding of "sorted" is less strict, you need to specify what exaclty you mean by "sorted". Usually, "sorted" means that adjacent elements meet a less or less-or-equal condition.
Like everyone else says, the only way to be 100% sure that it is sorted is to run through every single element, which is O(N).
However, it seems to me that if you're so worried about it being sorted, then maybe having it sorted to begin with is more important than the array elements being stored in a contiguous portion in memory?
What I'm getting at is, you could use a map whose elements by definition follow a strict weak ordering. In other words, the elements in a map are always sorted. You could also use a set to achieve the same effect.
For example: std::map<int,double> collectoin; would allow you to almost use it like an array: collection[0]=3.0; std::cout<<collection[0]<<std:;endl;. There are differences, of course, but if the sorting is so important then an array is the wrong choice for storing the data.
The old fashion way.Print it out and see if there in order. Really if your sort is wrong you would probably see it soon. It's more unlikely that you would only see a few misorders if you were sorting like 100+ things. When ever I deal with it my whole thing is completely off or it works.
As an example that you probably should not use but demonstrates sampling size:
Statistically valid sample size can give you a reasonable estimate of sortedness. If you want to be 95% certain eerything is sorted you can do that by creating a list of truly random points to sample, perhaps ~1500.
Essentially this is completely pointless if the list of values being out of order in one single place will break subsequent algorithms or data requirements.
If this is a problem, preprocess the list before your code runs, or use a really fast sort package in your code. Most sort packages also have a validation mode, where it simply tells you yes, the list meets your sort criteria - or not. Other suggestions like parallelization of your check with threads are great ideas.

How to implement a huge matrix in C

I'm writing a program for a numerical simulation in C. Part of the simulation are spatially fixed nodes that have some float value to each other node. It is like a directed graph. However, if two nodes are too far away, (farther than some cut-off length a) this value is 0.
To represent all these "correlations" or float values, I tried to use a 2D array, but since I have 100.000 and more nodes, that would correspond to 40GB memory or so.
Now, I am trying to think of different solustions for that problem. I don't want to save all these values on the harddisk. I also don't want to calculate them on the fly. One idea was some sort of sparse matrix, like the one one can use in Matlab.
Do you have any other ideas, how to store these values?
I am new to C, so please don't expect too much experience.
Thanks and best regards,
Jan Oliver
How many nodes, on average, are within the cutoff distance for a given node determines your memory requirement and tells you whether you need to page to disk. The solution taking the least memory is probably a hash table that maps a pair of nodes to a distance. Since the distance is the same each way, you only need to enter it into the hash table once for the pair -- put the two node numbers in numerical order and then combine them to form a hash key. You could use the Posix hsearch/hcreate/hdestroy functions for the hash table, although they are less than ideal.
A sparse matrix approach sounds ideal for this. The Wikipedia article on sparse matrices discusses several approaches to implementation.
A sparse adjacency matrix is one idea, or you could use an adjacency list, allowing your to only store the edges which are closer than your cutoff value.
You could also hold a list for each node, which contains the other nodes this node is related to. You would then have an overall number of list entries of 2*k, where k is the number of non-zero values in the virtual matrix.
Implementing the whole system as a combination of hashes/sets/maps is still expected to be acceptable with regard to speed/performance compared to a "real" matrix allowing random access.
edit: This solution is one possible form of an implementation of a sparse matrix. (See also Jim Balter's note below. Thank you, Jim.)
You should indeed use sparse matrices if possible. In scipy, we have support for sparse matrices, so that you can play in python, although to be honest sparse support still has rough edges.
If you have access to matlab, it will definitely be better ATM.
Without using sparse matrix, you could think about using memap-based arrays so that you don't need 40 Gb of RAM, but it will still be slow, and only really make sense if you have a low degree of sparsity (say if 10-20 % of your 100000x100000 matrix has items in it, then full arrays will actually be faster and maybe even take less space than sparse matrices).

Resources