broadcasting across tensors in `pytorch`

broadcasting across tensors in `pytorch` - arrays

I am using pytorch as an array processing language (not for the traditional deep learning purposes), and I am wondering what the canonical way is to do "batching" parallelism.
For example, suppose I want to compute svds of two dimensional layers of a 3-d tensor (using torch.svd(), say), and I want to return a tuple of stacked us, stacked s, stacked v.
Presumably, through the magic of SIMD parallelism, this should be doable in roughly the same time as a single layer svd (on gpu), but how to program it?

PyTorch is a high level software library with lots of python wrappers for highly optimized compiled code. A function or operator either supports batch data or not.
There is no other way around it than writing your own C/C++/CUDA code and invoke it with python.
Luckily, most functions support batch processing (including torch.svd() as pointed out by jodag) and it can be assumed that the developers (or the compiler) paid attention to data parallelism in the implementation. I recommend you to stack your tensors wherever you can. It usually leads to significant speedups.
Note that the batch dimension is always the first dimension of a tensor. PyTorch supports broadcasting for common operators like +, -, *, / as documented here. Because of possible ambiguities you are sometimes required to reshape your data accordingly to make clear what you want. For example if you want to add a batch of scalars to a batch of vectors you need to do something like:
a = torch.zeros(2, 2)
b = torch.arange(2)
a + b.view(2, 1) # or b.reshape(2, 1)
# tensor([[0., 0.],
[1., 1.]])

Related

C Vectorization: Is it possible to do elementwise operation in array like python-vectorization?

I am moving from python to C, in the hope of faster implementation, and trying to learn vectorization in C equivalent to python vectorization. For example, assume that we have binary array Input_Binary_Array, if I want to multiply each element for the index, say, i, by 2**i and then sum all non-zero, in python-vectorization we do the following:
case 1 : Value = (2. ** (np.nonzero(Input_Binary_Array)[0] + 1)).sum()
Or if we do slicing and do elementwise addition/subtraction/multiplication, we do the following:
case 2 : Array_opr= (Input_Binary_Array[size:] * 2**Size -Input_Binary_Array[:-size])
C is a powerful low-level language, so simple for/while loop is quite faster, but I am not sure that there are no equivalent vectorizations like python.
So, my question is, is there an explicit vectorization code for:
1.
multiplying all elements of an array
with a constant number (scalar)
2.
elementwise addition, subtraction, division for 2 given arrays of same size.
3.
slicing, summing, cumulative summing
or, the simple for, while loop is the only faster option to do above operations like python vectorization (case 1, 2)?

The answer is to either use a library to achieve those things, or write one. The C language by itself is fairly minimalist. It's part of the appeal. Some libraries out there include the Intel MLK, and there's gsl, which has that along with huge number of other functions, and more.
Now, with that said, I would recommend that if moving from Python to C is your plan, moving from Python to C++ is the better one. The reason that I say this is because C++ already has a lot of the tools you would be looking for to build what you like syntactically.
Specifically, you want to look at C++ std::vector, iterators, ranges and lambda expressions, all within C++20 and working rather well. I was able to make my own iterator on my own sort of weird collection and then have Linq style functions tacked onto it, with Linq semantics...
So I could say
mycollection<int> myvector = { 1, 2, 4, 5 };
Something like that anyway - the initializer expression rules I forget sometimes.
auto v = mycollection
.where( []( auto& itm ) { itm > 3; } )
.sum( []( auto& itm ) { return itm; } );
and get more or less what I expect.
Since you control the iterator down to every single detail you could ever want (and the std framework already thought of many), you can make it go as fast as you need, use multiple cores and so on.
In fact, I think STL for MS and maybe GCC both actually have swap in parallel algorithms where you just use them.
So C is good, but consider C++, if you are going that "C like" route. Because that's the only way you'll get the performance you want with the syntax you need.
Iterators basically let you wrap the concept of a for loop as an object. So,

So, my question is, is there an explicit vectorization code for:
1.
multiplying all elements of an array
with a constant number (scalar)
The C language itself does not have a syntax for expressing that with a single simple statement. One would ordinarily write a loop to perform the multiplication element by element, or possibly find or write a library that handles it.
Note also that as far as I am aware, the Python language does not have this either. In particular, the product of a Python list and an integer n is not scalar multiplication of the list elements, but rather a list with n times as many elements. Some of your Python examples look like they may be using Numpy, which can provide for that sort of thing, but that's a third-party package, analogous to a library in C.
elementwise addition, subtraction, division for 2 given arrays of same
size.
Same as above. Including this not being in Python, either, at least not as the effect of any built-in Python operator on objects of a built-in type.
slicing, summing, cumulative summing
C has limited array slicing, in that you can access contiguous subarrays of an array more or less as arrays in their own right. The term "slice" is not typically used, however.
C does not have built-in sum() functions for arrays.
or, the simple for, while loop is the only faster option to do above
operations like python vectorization (case 1, 2)?
There are lots and lots of third-party C libraries, including some good ones for linear algebra. The C language and standard library does not have such features built in. One would typically choose among writing explicit loops, writing a custom library, and integrating a third party library based on how much your code relies on such operations, whether you need or can benefit from customization to your particular cases, and whether the operations need to be highly optimized.

Efficient way to perform tensor products in Fortran

I need to perform some tensor products and contractions on some large arrays in Fortran. Sometimes they are vectors or matrices and sometimes some of the objects involved are 3-arrays or 4-arrays.
Of course, it is very easy to write a subroutine achieving this with some nested loops, and that's just what I've done. But I have to call this subroutine with all its loops a lot of times for very large arrays, and I was just wondering whether there is some optimized function or subroutine implemented in Fortran which I could benefit from.

Last time I looked (about a year ago) I did not find a high performance general purpose tensor product library in Fortran. I think one of the reason for this might be Fortran's cumbersome way of resizing arrays, which is a constant requirement when dealing with tensors.
If you only need multiplication you might be able to get away with using your own code. However if you need high performance, or more general operations, I would highly recommend writing a C interface and using one of the excellent C++ libraries out there, which are probably already optimized for your type of application:
Physics:
http://itensor.org/
Machine learning:
https://github.com/tensorflow/tensorflow
These are only examples. For a more complete listing see:
Tensor multiplication library

Basic GPU application, integer calculations

Long story short, I have done several prototypes of interactive software. I use pygame now (python sdl wrapper) and everything is done on CPU. I am starting to port it to C now and at the same time search for the existing possibilities to use some GPU power to entlast the CPU from redundant operations. However I cannot find a good "guideline" what exact technology/tools should I pick in my situation. I just read plethora of docs, it drains my mental powers very fast. I am not sure if it is possible at all, so I'm puzzled.
Here I've made a very rough sketch of my typical application skeleton that I develop, but given that it uses GPU now (note, I have almost zero practical knowledge about GPU programming). Still important is that data types and functionality must be exactly preserved. Here it is:
So F(A,R,P) is some custom function, for example element substitution, repetition, etc. Function is presumably constant in program lifetime, rectangle's shapes generally are not equal with A shape, so it is not in-place calculation. So they are simply generated whith my functions. Examples of F: repeat rows and columns of A; substitute values with values from Substitution tables; compose some tiles into single array; any math function on A values, etc. As said all this can be easily made on CPU, but app must be really smooth. BTW in pure Python it became just unusable after adding several visual features, which are based on numpy arrays. Cython helps to make fast custom functions but then the source code is already kind of a salad.
Question:
Does this schema reflect some (standart) technology/dev.tools?
Is CUDA what I am looking for? If yes, some links/examples which coincides whith my application structure, would be great.
I realise, this a big question, so I will give more details if it helps.
Update
Here is a concrete example of two typical calculations for my prototype of bitmap editor. So the editor works with indexes and the data include layers with corresponding bit masks. I can determine the size of layers and masks are same size as layers and, say, all layers are same size (1024^2 pixels = 4 MB for 32 bit values). And my palette is say, 1024 elements (4 Kilobytes for 32 bpp format).
Consider I want to do two things now:
Step 1. I want to flatten all layers in one. Say A1 is default layer (background) and layers 'A2' and 'A3' have masks 'm2' and 'm3'. In python i'd write:
from numpy import logical_not
...
Result = (A1 * logical_not(m2) + A2 * m2) * logical_not(m3) + A3 * m3
Since the data is independent I believe it must give speedup proportionl to number of parallel blocks.
Step 2. Now I have an array and want to 'colorize' it with some palette, so it will be my lookup table. As I see now, there is a problem with simultanous read of lookup table element.
But my idea is, probably one can just duplicate the palette for all blocks, so each block can read its own palette? Like this:

When your code is highly parallel (i.e. there are small or no data dependencies between stages of processing) then you can go for CUDA (more finegrained control over synching) or OpenCL (very similar AND portable OpenGL-like API to interface with the GPU for kernel processing). Most of the acceleration work we do happens in OpenCL, which has excellent interop with both OpenGL and DirectX, but we also have the same setup working with CUDA. One big difference between CUDA and OpenCL is that in CUDA you can compile kernels once and delay-load (and/or link) them in your app, whereas in OpenCL the compiler plays nice with the OpenCL driver stack to ensure the kernel is compiled when the app starts.
One alternative that is often overlooked if you're using Microsoft Visual Studio is C++AMP, a C++ syntax-friendly and intuitive api for those who do not want to dig into the logic twists and turns of OpenCL/CUDA API's. Big advantage here is that the code also works if you do not have a GPU in the system, but then you do not have as many options to tweak performance. Still, in a lot of cases, this is a fast and efficient way to write proof your concept code and re-implement bits and parts in CUDA or OpenCL later.
OpenMP and Thread Building Blocks are only good alternatives when you have synching issues and lots of data dependencies. Native threading using worker threads is also a viable solution, but only if you have a good idea on how synch-points can be set up between the different processes in such a way that threads do not starve each-other out when fighting for priority. This is a lot harder to get right, and tools such as Parallel Studio are a must. But then, so is NVida NSight if you're writing GPU code.
Appendix:
A new platform called Quasar (http://quasar.ugent.be/blog/) is being developed that enables you to write your math problems in a syntax that is very similar to Matlab, but with full support of c/c++/c# or java integration, and cross-compiles (LLVM, CLANG) your "kernel" code to any underlying hardware configuration. It generates CUDA ptx files, or runs on openCL, or even on your CPU using TBB's, or a mixture of them. Using a few monikers, you can decorate the algorithm so that the underlying compiler can infer types (you can also explicitly use strict typing), so you can leave the type-heavy stuff entirely up to the compiler. To be fair, at the time of writing, the system is still w.i.p. and the first OpenCL compiled programs are just being tested, but most important benefit is fast prototyping with almost identical performance compared to optimized cuda.

What you want to do is send values really fast to the GPU using the high frequency dispatch and then display the result of a function which is basically texture lookups and some parameters.
I would say this problem will only be worth solving on the GPU if two conditions are met:
The size of A[] is optimised to make the transfer times irrelevant (Look at, http://blog.theincredibleholk.org/blog/2012/11/29/a-look-at-gpu-memory-transfer/).
The lookup table is not too big and/or the lookup values are organized in a way that the cache can be maximally utilized, in general random lookups on the GPU can be slow, ideally you can pre-load the R[] values in a shared memory buffer for each element of the A[] buffer.
If you can answer both of those questions positively then and only then consider having a go at using the GPU for your problem, else those 2 factors will overpower the computational speed-up that the GPU can provide you with.
Another thing you can have a look at is to as best as you can overlap the transfer and computing times to hide as much as possible the slow transfer rates of CPU->GPU data.
Regarding your F(A, R, P) function you need to make sure that you do not need to know the value of F(A, R, P)[0] in order to know what the value of F(A, R, P)[1] is because if you do then you need to rewrite F(A, R, P) to go around this issue, using some parallelization technique. If you have a limited number of F() functions then this can be solved by writing a parallel version of each F() function for the GPU to use, but if F() is user-defined then your problem becomes a bit trickier.
I hope this is enough information to have an informed guess towards whether you should or not use a GPU to solve your problem.
EDIT
Having read your edit, I would say yes. The palette could fit in shared memory (See GPU shared memory size is very small - what can I do about it?) which is very fast, if you have more than one palette, you could fit 16KB (size of shared mem on most cards) / 4KB per palette = 4 palettes per block of threads.
One last warning, integer operations are not the fastest on the GPU, consider using floating points if necessary after you have implemented your algorithm and it is working as a cheap optimization.

There is not much difference between OpenCL/CUDA so choose which works better for you. Just remember that CUDA will limit you to the NVidia GPUs.
If i understand corretly to your problem, kernel (function executed on GPU) should be simple. It should follow this pseudocode:
kernel main(shared A, shared outA, const struct R, const struct P, const int maxOut, const int sizeA)
int index := getIndex() // get offset in input array
if(sizeA >= index) return // GPU often works better when n of threads is 2^n
int outIndex := index*maxOut // to get offset in output array
outA[outIndex] := F(A[index], R, P)
end
Functions F should be inlined and you can use switch or if for different function. Since there is not known size of the output of F, then you have to use more memory. Each kernel instance must know positions for correct memory writes and reads so there have to be some maximum size (if there is none, than this all is useless and you have to use CPU!). If different sizes are sparse, then I would use something like computing these different sizes after getting the array back to RAM and compute these few with CPU, while filling outA with some zeros or indication values.
Sizes of arrays are obviously length(A) * maxOut = length(outA).
I forgot to mention that if execution of F is not same in most of the cases (same source code), than GPU will serialize it. GPU multiprocessors have a few cores connected into the same instruction cache so it will have to serialize the code, which is not the same for all cores! OpenMP or threads are better choice for this kind of problem!

What Haskell representation is recommended for 2D, unboxed pixel arrays with millions of pixels?

I want to tackle some image-processing problems in Haskell. I'm working with both bitonal (bitmap) and color images with millions of pixels. I have a number of questions:
On what basis should I choose between Vector.Unboxed and UArray? They are both unboxed arrays, but the Vector abstraction seems heavily advertised, particular around loop fusion. Is Vector always better? If not, when should I use which representation?
For color images I will wish to store triples of 16-bit integers or triples of single-precision floating-point numbers. For this purpose, is either Vector or UArray easier to use? More performant?
For bitonal images I will need to store only 1 bit per pixel. Is there a predefined datatype that can help me here by packing multiple pixels into a word, or am I on my own?
Finally, my arrays are two-dimensional. I suppose I could deal with the extra indirection imposed by a representation as "array of arrays" (or vector of vectors), but I'd prefer an abstraction that has index-mapping support. Can anyone recommend anything from a standard library or from Hackage?
I am a functional programmer and have no need for mutation :-)

For multi-dimensional arrays, the current best option in Haskell, in my view, is repa.
Repa provides high performance, regular, multi-dimensional, shape polymorphic parallel arrays. All numeric data is stored unboxed. Functions written with the Repa combinators are automatically parallel provided you supply +RTS -Nwhatever on the command line when running the program.
Recently, it has been used for some image processing problems:
Real time edge detection
Efﬁcient Parallel Stencil Convolution in Haskell
I've started writing a tutorial on the use of repa, which is a good place to start if you already know Haskell arrays, or the vector library. The key stepping stone is the use of shape types instead of simple index types, to address multidimensional indices (and even stencils).
The repa-io package includes support for reading and writing .bmp image files, though support for more formats is needed.
Addressing your specific questions, here is a graphic, with discussion:
On what basis should I choose between Vector.Unboxed and UArray?
They have approximately the same underlying representation, however, the primary difference is the breadth of the API for working with vectors: they have almost all the operations you'd normally associate with lists (with a fusion-driven optimization framework), while UArray have almost no API.
For color images I will wish to store triples of 16-bit integers or triples of single-precision floating-point numbers.
UArray has better support for multi-dimensional data, as it can use arbitrary data types for indexing. While this is possible in Vector (by writing an instance of UA for your element type), it isn't the primary goal of Vector -- instead, this is where Repa steps in, making it very easy to use custom data types stored in an efficient manner, thanks to the shape indexing.
In Repa, your triple of shorts would have the type:
Array DIM3 Word16
That is, a 3D array of Word16s.
For bitonal images I will need to store only 1 bit per pixel.
UArrays pack Bools as bits, Vector uses the instance for Bool which does do bit packing, instead using a representation based on Word8. Howver, it is easy to write a bit-packing implementation for vectors -- here is one, from the (obsolete) uvector library. Under the hood, Repa uses Vectors, so I think it inherits that libraries representation choices.
Is there a predefined datatype that can help me here by packing multiple pixels into a word
You can use the existing instances for any of the libraries, for different word types, but you may need to write a few helpers using Data.Bits to roll and unroll packed data.
Finally, my arrays are two-dimensional
UArray and Repa support efficient multi-dimensional arrays. Repa also has a rich interface for doing so. Vector on its own does not.
Notable mentions:
hmatrix, a custom array type with extensive bindings to linear algebra packages. Should be bound to use the vector or repa types.
ix-shapeable, getting more flexible indexing from regular arrays
chalkboard, Andy Gill's library for manipulating 2D images
codec-image-devil, read and write various image formats to UArray

Once I reviewed the features of Haskell array libraries which matter for me, and compiled a comparison table (only spreadsheet: direct link). So I'll try to answer.
On what basis should I choose between Vector.Unboxed and UArray? They are both unboxed arrays, but the Vector abstraction seems heavily advertised, particular around loop fusion. Is Vector always better? If not, when should I use which representation?
UArray may be preferred over Vector if one needs two-dimensional or multi-dimensional arrays. But Vector has nicer API for manipulating, well, vectors. In general, Vector is not well suited for simulating multi-dimensional arrays.
Vector.Unboxed cannot be used with parallel strategies. I suspect that UArray cannot be used neither, but at least it is very easy to switch from UArray to boxed Array and see if parallelization benefits outweight the boxing costs.
For color images I will wish to store triples of 16-bit integers or triples of single-precision floating-point numbers. For this purpose, is either Vector or UArray easier to use? More performant?
I tried using Arrays to represent images (though I needed only grayscale images). For color images I used Codec-Image-DevIL library to read/write images (bindings to DevIL library), for grayscale images I used pgm library (pure Haskell).
My major problem with Array was that it provides only random access storage, but it doesn't provide many means of building Array algorithms nor doesn't come with ready to use libraries of array routines (doesn't interface with linear algebra libs, doesn't allow to express convolutions, fft and other transforms).
Almost every time a new Array has to be built from the existing one, an intermediate list of values has to be constructed (like in matrix multiplication from the Gentle Introduction). The cost of array construction often out-weights the benefits of faster random access, to the point that a list-based representation is faster in some of my use cases.
STUArray could have helped me, but I didn't like fighting with cryptic type errors and efforts necessary to write polymorphic code with STUArray.
So the problem with Arrays is that they are not well suited for numerical computations. Hmatrix' Data.Packed.Vector and Data.Packed.Matrix are better in this respect, because they come along with a solid matrix library (attention: GPL license). Performance-wise, on matrix multiplication, hmatrix was sufficiently fast (only slightly slower than Octave), but very memory-hungry (consumed several times more than Python/SciPy).
There is also blas library for matrices, but it doesn't build on GHC7.
I didn't have much experience with Repa yet, and I don't understand repa code well. From what I see it has very limited range of ready to use matrix and array algorithms written on top of it, but at least it is possible to express important algorithms by the means of the library. For example, there are already routines for matrix multiplication and for convolution in repa-algorithms. Unfortunately, it seems that convolution is now limited to 7×7 kernels (it's not enough for me, but should suffice for many uses).
I didn't try Haskell OpenCV bindings. They should be fast, because OpenCV is really fast, but I am not sure if the bindings are complete and good enough to be usable. Also, OpenCV by its nature is very imperative, full of destructive updates. I suppose it's hard to design a nice and efficient functional interface on top of it. If one goes OpenCV way, he is likely to use OpenCV image representation everywhere, and use OpenCV routines to manipulate them.
For bitonal images I will need to store only 1 bit per pixel. Is there a predefined datatype that can help me here by packing multiple pixels into a word, or am I on my own?
As far as I know, Unboxed arrays of Bools take care of packing and unpacking bit vectors. I remember looking at implementation of arrays of Bools in other libraries, and didn't see this elsewhere.
Finally, my arrays are two-dimensional. I suppose I could deal with the extra indirection imposed by a representation as "array of arrays" (or vector of vectors), but I'd prefer an abstraction that has index-mapping support. Can anyone recommend anything from a standard library or from Hackage?
Apart from Vector (and simple lists), all other array libraries are capable of representing two-dimensional arrays or matrices. I suppose they avoid unneccesary indirection.

Although, this doesn't exactly answer your question and isn't really even haskell as such, I would recommend taking a look at CV or CV-combinators libraries at hackage. They bind the many rather useful image processing and vision operators from the opencv-library and make working with machine vision problems much faster.
It would be rather great if someone figures out how repa or some such array library could be directly used with opencv.

Here is a new Haskell Image Processing library that can handle all of the tasks in question and much more. Currently it uses Repa and Vector packages for underlying representations, which consequently inherits fusion, parallel computation, mutation and most of the other goodies that come with those libraries. It provides an easy to use interface that is natural for image manipulation:
2D indexing and unboxed pixels with arbitrary precision (Double, Float, Word16, etc..)
all essential functions like map, fold, zipWith, traverse ...
support for various color spaces: RGB, HSI, gray scale, Bi-tonal, Complex, etc.
common image processing functionality:
Binary morphology
Convolution
Interpolation
Fourier transform
Histogram plotting
etc.
Ability to treat pixels and images as regular numbers.
Reading and writing common image formats through JuicyPixels library
Most importantly, it is a pure Haskell library, so it does not depend on any external programs. It is also highly extendable, new color spaces and image representations can be introduced.
One thing that it does not do is packing multiple binary pixels in a Word, instead it uses a Word per binary pixel, maybe in a future...

Which haskell array implementation to use? AKA what are the pros and cons of each

What do I need? [an unordered list]
VERY easy parallelization
support for map, filter etc.
ability to perform array based computations efficiently, like A=B+C, sort of like matlab arrays.
Generation of SIMD code. I guess this is out of the question in the near future for anything, but hey, I can ask :)
support for matrices should be there at a minimum, higher dimensions are lower priority right now.
ability to get a pointer to it and create one from a C pointer.
Support from other libraries. IE, bindings to popular C math packages, i/o to disk or images if the arrays are 2D
What do I see?
Array package in haskell-platform. It's the blessed one and can do parallel
Data.Vector. Has loop fusion, but not in platform, so its maturity is unknown to me.
repa package, contributed by the DPH team, but doesn't work well with any stable ghc today.
Lots of variation in the level of support for array implementations. For instance, there doesn't seem to be an easy way to dump a 2D vector to a image file. IOW, the haskell community apparently hasn't settled on an array implementation.
So please, help me choose.
EDIT A=B+C refers to element wise addition, and not list concatenation

Correct, the community hasn't settled on a good array implementation. I think it would be a good Haskell Prime submission to put forward the Vector API and remove Data.Array.
Vector is very mature! It has:
VERY easy parallelization
support for map, filter etc.
performs array based computations efficiently, like A=B+C (but I'm not in tune with how matlab does it)
vector creation from a pointer via Vector.Storable
It does not:
have enough support from other libraries. IE, bindings to popular C math packages
support matrices, but you can have vectors of vectors. If you build some vector-based matrix operations then perhaps you could upload to hackage as vector-matrix.
Generate SIMD code.
NOTE: You can turn bytestrings into vectors of whatever, so if you have an image as a bytestring then, via Vector.Storable, you might be able to do what you want with the image as a vector.

(I am not allowed to comment)
rpg: Does hmatrix accept Data.Vector? It has a Data.Packed.Vector but are they the same?
Yes. The last version of hmatrix uses by default Data.Vector.Storable for 1D vectors (previously it was optional). The dependency on vector is not shown in Hackage, probably because it is in a configuration flag.
For LAPACK compatibility matrices are not Vector or Vector t, but they can be easily converted (e.g.: Data.Vector.fromList . toRows).

If you want bindings to popular C libraries, the best options are probably hmatrix and blas. Blas is just a binding to a BLAS library, whereas hmatrix provides some higher-level operations. There are also many libraries built upon hmatrix offering further functionality. If you're doing any sort of matrix work, that's what I would start with.
The vector package is also a good choice; it's stable and provides excellent performance. The Data.Vector.Storable types are represented as C arrays, so it's trivial to interface from them to other C libraries. The biggest drawback is that there's no matrix support, so you'd have to do that yourself.
As for exporting to an image format, most haskell image libraries seem to use ByteStrings. You could either convert to a ByteString, or bind to a C library that does what you want. If you find a Haskell library that does what you want, it should be easy enough to convert hmatrix data to the proper format.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight