MPI partition a matrix into smaller matrices - c

The situation is this:
I have an array of dimensions 4x4 and what I have to do is partition this matrix into "blocks"(aka smaller matrices) and distribute them to the "slave-processes". More specifically, suppose the total amount of processes is 4(1 master, 3 slaves and all will calculate what is to be calculated) which makes the partitioning of the 4x4 matrix into four 2x2 matrices. However besides making "buffers" of size 2x2 should be avoid(actually I want to avoid it).
The question is: is there any "clever", more "painless" way to do manage it?
PS: I have to manage this problem http://www.cas.usf.edu/~cconnor/parallel/2dheat/2dheat.html meaning that a Cartesian Communicator will be created.

This is essentially how (plain) MPI works. The 2×2 matrices constitute a distributed data structure. Together, they comprise the actual 4×4 matrix. You could of course also use four 1×4 or 4×1 matrices, that does have some advantages (easier programming) and disadvantages (more communication needed when scaling).
In actual problems, such as a 2D heat equation, you often need to consider the halo around each local matrix. This halo is then exchanged during simulation step.
Note that the code you linked uses full sized matrices on each worker rank. This is a simplification, but wastes resources and is thus not scalable.
MPI gives you some help to manage those distributed data, for instance via Cartesian communicators, or one-sided communication for easier halo-exchange, but essentially you have to manage the distributed data structure.
There are parallel paradigms that provide higher level abstractions of distributed data structures, but even an overview would IMHO bee too broad for this format. Many of those are related to the Partitioned Global Address Space (PGAS) concept. Implementations can range from new languages, language extensions (Co-array Fortran) to libraries and frameworks. Some use MPI internally.

Related

Why array of complex numbers is declared row-wise in fftw?

The fftw manual states that (Page 4)
The data is an array of type fftw_complex, which is by default a double[2] composed of the real (in[i][0]) and imaginary (in[i][1]) parts of a complex number.
For doing FFT of a time series of n samples, the size of the matrix becomes n rows and 2 columns. If we wish to do element-by-element manipulation, isn't accessing in[i][0] for different values of i slow for large values of n since C stores 2D arrays in row-major format?
The real and imaginary part are stored consecutively in memory (assuming little endian lay out where byte 0 of R0 is at the smallest address) :
In-1,Rn-1..|I1,R1|I0,R0
That means that it's possible to copy an element i into place accessing the same cacheline (usually 64byte today), as both real & imaginary are adjacent. If you stored the 2D array in the Fortan order and wanted to assign to one element, then you immediately access memory on 2 different cachelines, as they are stored N*sizeof double locations apart in memory - Row & COlumn Major order
Now if your processing was just operating on the real parts in one thread and the imaginary seperately in another, for some reason, then yes it would be more efficient to store them in Column major order, or even as seperate parallel arrays. In general though, data is stored close together because it is used together.
All arrays in C are really single dimensional byte arrays, unless you store an array of pointers to arrays, usually done with things like strings with varying lengths.
Sometimes in matrix calculations, it's actually faster to first transpose one array, because of the rules of matrix multiplication, it's complex but if you want the real nitty gritty details search for Ulrich Dreppers article at LWN.net about memory which shows an example that benefits from this technique (section 5 IIRC).
Very often Scientific numberic libraries have worked in Column major order, because Fortran compatability was more important than using the array in a natural way. Most languages prefer Row major, as it's generally more desirable, when you store fixed length strings in a table for instance.

Automated sparse matricies in Fortran

I know that Intel Fortran has libraries with functions and subroutines for working with sparse matricies, but I'm wondering if there is also some sort of data type or automated method for creating the sparse matricies in the first place.
BACKGROUND: I have a program that uses some 3 & 4 dimensional arrays that can be very large in the first 2 dimensions (~10k to ~100k elements in each dimension, maybe more). In the first 2 dimensions, each array is mostly (95% or so) populated w/ zeroes. To make the program friendly to machines with a "normal" amount of RAM available, I'd like to convert to sparse matricies. The manner in which the current conventional arrays are handled & updated throughout the code is pretty dependent on the code application, so I'm looking for a way to convert to sparse matrix storage without significant modification to the code. Basically, I'm lazy, and I don't want to revise the entire memory management implementation or write an entire new module where my arrays live and are managed. Is there a library or something else for Fortran that would implement a data type or something so that I can use sparse matrix storage without re-engineering each array and how it is handled? Thanks for the help. Cheers.
There are many different sparse formats and many different libraries for handling sparse matrices in Fortran (e.g. sparskit, petsc, ...) However, none of them can offer that compact array handling formalism, which is available in Fortran for intrinsic dense arrays (especially the subarray notation). So, you'll have to touch your code at several places, when you want to change it to use sparse matrices.

Per-thread hashtable-like data structure implementation in CUDA

Short version of my question:
I have a CUDA program where each thread needs to store numbers in different "bins", and I identify each of these bins by an integer. For a typical run of my program, each CUDA thread might only store numbers in 100 out of millions of bins, so I'd like to know if there is a data structure other than an array that would allow me to hold this data. Each thread would have its own copy of this structure. If I were programming in Python, I would just use a dictionary where the bin numbers are the keys, for example mydict[0] = 1.0, mydict[2327632] = 3.0, and then at the end of the run I would look at the keys and do something with them (and ignore the bins where no numbers are stored in them since they aren't in the dictionary). I tried implementing a hash table for every thread in my cuda program and it killed performance.
Long version:
I have a CUDA Monte Carlo simulation which simulates the transport of particles through a voxelized (simple volume elements) geometry. The particles deposit energy during their transport and this energy is tallied on a voxel-per-voxel basis. The voxels are represented as a linearized 3D grid which is quite large, around 180^3 elements. Each CUDA thread transports 1-100 particles and I usually try to maximize the number of threads that I spawn my kernel with. (Currently, I use 384*512 threads). The energy deposited in a given voxel is added to the linearized 3d grid which resides in global memory through atomicAdd.
I'm running into some problems with a part of my simulation which involves calculating uncertainties in my simulation. For a given particle, I have to keep track of where (which voxel indices) it deposits energy, and how much energy for a given voxel, so that I can square this number at the end of the particle transport before moving on to a new particle. Since I assign each thread one (or a few) particle, this information has to be stored at a per-thread scope. The reason I only run into this problem with uncertainty calculation is that energy deposition can just be done as an atomic operation to a global variable every time a thread has to deposit energy, but uncertainty calculation has to be done at the end of a particle's transport, so I have to somehow have each thread keep track of the "history" of their assigned particles.
My first idea was to implement a hash table whose key would be the linearized voxel index, and value would be energy deposited, and I would just square every element in that hash table and add it to a global uncertainty grid after a particle is done transporting. I tried to implement uthash but it destroyed the performance of my code. I'm guessing it caused a huge amount of thread divergence.
I could simply use two dynamic arrays where one stores the voxel index and the other would store the energy deposited for that voxel, but I am thinking that it would also be very bad for performance. I'm hoping that there is a data structure that I don't know about which would lend itself well to being used in a CUDA program. I also tried to include many details in case I am completely wrong in my approach to the problem.
Thank you
Your question is a bit jargon-ful. If you can distill out the science and leave just the computer science, you might get more answers.
There have been CUDA hash tables implemented. The work at that link will be included in the 2.0 release of the CUDPP library. It is already working in the SVN trunk of CUDPP, if you would like to try it.
That said, if you really only need per-thread storage, and not shared storage, you might be able to do something much simpler, like some per-thread scratch space (in shared or global memory) or a local array.

What Haskell representation is recommended for 2D, unboxed pixel arrays with millions of pixels?

I want to tackle some image-processing problems in Haskell. I'm working with both bitonal (bitmap) and color images with millions of pixels. I have a number of questions:
On what basis should I choose between Vector.Unboxed and UArray? They are both unboxed arrays, but the Vector abstraction seems heavily advertised, particular around loop fusion. Is Vector always better? If not, when should I use which representation?
For color images I will wish to store triples of 16-bit integers or triples of single-precision floating-point numbers. For this purpose, is either Vector or UArray easier to use? More performant?
For bitonal images I will need to store only 1 bit per pixel. Is there a predefined datatype that can help me here by packing multiple pixels into a word, or am I on my own?
Finally, my arrays are two-dimensional. I suppose I could deal with the extra indirection imposed by a representation as "array of arrays" (or vector of vectors), but I'd prefer an abstraction that has index-mapping support. Can anyone recommend anything from a standard library or from Hackage?
I am a functional programmer and have no need for mutation :-)
For multi-dimensional arrays, the current best option in Haskell, in my view, is repa.
Repa provides high performance, regular, multi-dimensional, shape polymorphic parallel arrays. All numeric data is stored unboxed. Functions written with the Repa combinators are automatically parallel provided you supply +RTS -Nwhatever on the command line when running the program.
Recently, it has been used for some image processing problems:
Real time edge detection
Efficient Parallel Stencil Convolution in Haskell
I've started writing a tutorial on the use of repa, which is a good place to start if you already know Haskell arrays, or the vector library. The key stepping stone is the use of shape types instead of simple index types, to address multidimensional indices (and even stencils).
The repa-io package includes support for reading and writing .bmp image files, though support for more formats is needed.
Addressing your specific questions, here is a graphic, with discussion:
On what basis should I choose between Vector.Unboxed and UArray?
They have approximately the same underlying representation, however, the primary difference is the breadth of the API for working with vectors: they have almost all the operations you'd normally associate with lists (with a fusion-driven optimization framework), while UArray have almost no API.
For color images I will wish to store triples of 16-bit integers or triples of single-precision floating-point numbers.
UArray has better support for multi-dimensional data, as it can use arbitrary data types for indexing. While this is possible in Vector (by writing an instance of UA for your element type), it isn't the primary goal of Vector -- instead, this is where Repa steps in, making it very easy to use custom data types stored in an efficient manner, thanks to the shape indexing.
In Repa, your triple of shorts would have the type:
Array DIM3 Word16
That is, a 3D array of Word16s.
For bitonal images I will need to store only 1 bit per pixel.
UArrays pack Bools as bits, Vector uses the instance for Bool which does do bit packing, instead using a representation based on Word8. Howver, it is easy to write a bit-packing implementation for vectors -- here is one, from the (obsolete) uvector library. Under the hood, Repa uses Vectors, so I think it inherits that libraries representation choices.
Is there a predefined datatype that can help me here by packing multiple pixels into a word
You can use the existing instances for any of the libraries, for different word types, but you may need to write a few helpers using Data.Bits to roll and unroll packed data.
Finally, my arrays are two-dimensional
UArray and Repa support efficient multi-dimensional arrays. Repa also has a rich interface for doing so. Vector on its own does not.
Notable mentions:
hmatrix, a custom array type with extensive bindings to linear algebra packages. Should be bound to use the vector or repa types.
ix-shapeable, getting more flexible indexing from regular arrays
chalkboard, Andy Gill's library for manipulating 2D images
codec-image-devil, read and write various image formats to UArray
Once I reviewed the features of Haskell array libraries which matter for me, and compiled a comparison table (only spreadsheet: direct link). So I'll try to answer.
On what basis should I choose between Vector.Unboxed and UArray? They are both unboxed arrays, but the Vector abstraction seems heavily advertised, particular around loop fusion. Is Vector always better? If not, when should I use which representation?
UArray may be preferred over Vector if one needs two-dimensional or multi-dimensional arrays. But Vector has nicer API for manipulating, well, vectors. In general, Vector is not well suited for simulating multi-dimensional arrays.
Vector.Unboxed cannot be used with parallel strategies. I suspect that UArray cannot be used neither, but at least it is very easy to switch from UArray to boxed Array and see if parallelization benefits outweight the boxing costs.
For color images I will wish to store triples of 16-bit integers or triples of single-precision floating-point numbers. For this purpose, is either Vector or UArray easier to use? More performant?
I tried using Arrays to represent images (though I needed only grayscale images). For color images I used Codec-Image-DevIL library to read/write images (bindings to DevIL library), for grayscale images I used pgm library (pure Haskell).
My major problem with Array was that it provides only random access storage, but it doesn't provide many means of building Array algorithms nor doesn't come with ready to use libraries of array routines (doesn't interface with linear algebra libs, doesn't allow to express convolutions, fft and other transforms).
Almost every time a new Array has to be built from the existing one, an intermediate list of values has to be constructed (like in matrix multiplication from the Gentle Introduction). The cost of array construction often out-weights the benefits of faster random access, to the point that a list-based representation is faster in some of my use cases.
STUArray could have helped me, but I didn't like fighting with cryptic type errors and efforts necessary to write polymorphic code with STUArray.
So the problem with Arrays is that they are not well suited for numerical computations. Hmatrix' Data.Packed.Vector and Data.Packed.Matrix are better in this respect, because they come along with a solid matrix library (attention: GPL license). Performance-wise, on matrix multiplication, hmatrix was sufficiently fast (only slightly slower than Octave), but very memory-hungry (consumed several times more than Python/SciPy).
There is also blas library for matrices, but it doesn't build on GHC7.
I didn't have much experience with Repa yet, and I don't understand repa code well. From what I see it has very limited range of ready to use matrix and array algorithms written on top of it, but at least it is possible to express important algorithms by the means of the library. For example, there are already routines for matrix multiplication and for convolution in repa-algorithms. Unfortunately, it seems that convolution is now limited to 7×7 kernels (it's not enough for me, but should suffice for many uses).
I didn't try Haskell OpenCV bindings. They should be fast, because OpenCV is really fast, but I am not sure if the bindings are complete and good enough to be usable. Also, OpenCV by its nature is very imperative, full of destructive updates. I suppose it's hard to design a nice and efficient functional interface on top of it. If one goes OpenCV way, he is likely to use OpenCV image representation everywhere, and use OpenCV routines to manipulate them.
For bitonal images I will need to store only 1 bit per pixel. Is there a predefined datatype that can help me here by packing multiple pixels into a word, or am I on my own?
As far as I know, Unboxed arrays of Bools take care of packing and unpacking bit vectors. I remember looking at implementation of arrays of Bools in other libraries, and didn't see this elsewhere.
Finally, my arrays are two-dimensional. I suppose I could deal with the extra indirection imposed by a representation as "array of arrays" (or vector of vectors), but I'd prefer an abstraction that has index-mapping support. Can anyone recommend anything from a standard library or from Hackage?
Apart from Vector (and simple lists), all other array libraries are capable of representing two-dimensional arrays or matrices. I suppose they avoid unneccesary indirection.
Although, this doesn't exactly answer your question and isn't really even haskell as such, I would recommend taking a look at CV or CV-combinators libraries at hackage. They bind the many rather useful image processing and vision operators from the opencv-library and make working with machine vision problems much faster.
It would be rather great if someone figures out how repa or some such array library could be directly used with opencv.
Here is a new Haskell Image Processing library that can handle all of the tasks in question and much more. Currently it uses Repa and Vector packages for underlying representations, which consequently inherits fusion, parallel computation, mutation and most of the other goodies that come with those libraries. It provides an easy to use interface that is natural for image manipulation:
2D indexing and unboxed pixels with arbitrary precision (Double, Float, Word16, etc..)
all essential functions like map, fold, zipWith, traverse ...
support for various color spaces: RGB, HSI, gray scale, Bi-tonal, Complex, etc.
common image processing functionality:
Binary morphology
Convolution
Interpolation
Fourier transform
Histogram plotting
etc.
Ability to treat pixels and images as regular numbers.
Reading and writing common image formats through JuicyPixels library
Most importantly, it is a pure Haskell library, so it does not depend on any external programs. It is also highly extendable, new color spaces and image representations can be introduced.
One thing that it does not do is packing multiple binary pixels in a Word, instead it uses a Word per binary pixel, maybe in a future...

Matrix operations in CUDA

What is the best way to organize matrix operations in CUDA (in terms of performance)?
For example, I want to calculate C * C^(-1) * B^T + C, C and B are matrices.
Should I write separate functions for multiplication, transposition and so on or write one function for the whole expression?
Which way is the fastest?
I'd recommend you to use the CUBLAS library. It's normally much daster and more reliable than everything you could write on your own. In addition it's API is similar to the BLAS library which is the standard library for numerical linear algebra.
I think the answer depends heavily on the size of your matrices.
If you can fit a matrix in shared memory, I would probably use a single block to compute that and have all inside a single kernel (probably bigger, where this computation is only a part of it). Hopefully, if you have more matrices, and you need to compute the above equation several times, you can do it in parallel, utilising all GPU computing power.
However, if your matrices are much bigger, you will want more blocks to compute that (check matrix multiplication example in CUDA manual). You need a guarantee that multiplication is finished by all blocks before you proceed with the next part of your equation, and if so, you will need a kernel call for each of your operations.

Resources