Performance of MKL `dgesvd` vs MKL `LAPACKE_dgesvd`?

Performance of MKL `dgesvd` vs MKL `LAPACKE_dgesvd`? - c

dgesvd seems to be a LAPACKE_dgesvd with the layout LAPACK_COL_MAJOR, but looking at the examples for dgesvd and LAPACKE_dgesvd it seems there is an extra step in the dgesvd example where an optimal workspace is queried and allocated.
Is it correct to assume that this step is to figure out if the input matrix is COL_MAJOR or ROW_MAJOR?
Is it correct to assume that once the optimal workspace is figured out, 'dgesvd' internally calls LAPACKE_dgesvd with the appropriate layout?
If I already know the matrix layout to be COL_MAJOR is using LAPACKE_dgesvd better (faster/less expensive) than dgesvd?

We have two functions here which refer to two different interfaces:
i. dgesvd : calls fortran interface
ii. LAPACKE_dgesvd : calls C interface
For detail see this.
No it is not correct. As you will notice in the first call to dgesvd the value of lwork is set to -1 which as documented here is used for just calculating the size of lwork. So if you already know about the size of lwork you don't need to call it twice. Input matrix has to be LAPACK_COL_MAJOR for dgesvd as this is default for fortran. Also there is no way to calculate if the matrix is row major or column major.
No it's not true. dgesvd is the fortran interface which was implemented first.
This would depend on the compiler optimization. If the matrix is small it probably wouldn't matter. For me if its column major I would use fortran interface.
For matrix layout infotmation see this. Here is the technical paper for C interface

Related

Fixed size arrays in Haskell

I was trying to write a library for linear algebra operations in Haskell. In order to be able to define safe operations for matrices and vectors I wanted to encode their dimensions in their types. After some research I found that using DataKinds one is able to do that, similar to the way it's done here. For example:
data Vector (n :: Nat) a
dot :: Num a => Vector n a -> Vector n a -> a
In the aforementioned article, as well as in some libraries, the size of the vectors is a phantom type and the vector type itself is a wrapper around an Array. In trying to figure out if there is a array type with its size at the type-level in the standard library I started wondering about the underlying representation of arrays. From what I could gather form this commentary on GHC memory layout, arrays need to store their size on the heap so a 3-dimensional vector would need to take up 1 more word than necessary. Of course we could use the following definition:
data Vector3 a = Vector3 a a a
which might be fine if we only care about 3D geometry, but it doesn't allow for vectors of arbitrary size and also it makes indexing awkward.
So, my question is this. Wouldn't it be useful and a potential memory optimization to have an array type with statically known size in the standard library? As far, as I understand the only thing that it would need is a different info table, which would store the size, instead of it being stored for at each heap object. Also, the compiler could choose between Array and SmallArray automatically.

Wouldn't it be useful and a potential memory optimization to have an array type with statically known size in the standard library?
Sure. I suspect if you wrote up your use case carefully and implemented this, GHC HQ would accept a patch. You might want to do the writeup first and double-check that they're into it to avoid wasting time on a patch they won't accept, though; I certainly don't speak for them.
Also, the compiler could choose between Array and SmallArray automatically.
I'm not an expert here, but I kinda doubt this. Usually supporting polymorphism means you need a uniform representation.

Does Intel MKL or some similar library provide a vectorized way to count the number of elements in an array fulfilling some condition in C?

The problem
I'm working on implementing and refining an optimization algorithm with some fairly large arrays (from tens of millions of floats and up) and using mainly Intel MKL in C (not C++, at least not so far) to squeeze out every possible bit of performance. Now I've run into a silly problem - I have a parameter that sets maxima and minima for subsets of a set of (tens of millions) of coefficients. Actually applying these maxima and minima using MKL functions is easy - I can create equally-sized vectors with the limits for every element and use V?Fmax and V?Fmin to apply them. But I also need to account for this clipping in my error metric, which requires me to count the number of elements that fall outside these constraints.
However, I can't find an MKL function that allows me to do things like counting the number of elements that fulfill some condition, the way you can create and sum logical arrays with e.g. NumPy in Python or in MATLAB. Irritatingly, when I try to google this question, I only get answers relating to Python and R.
Obviously I can just write a loop that increments a counter for each element that fulfills one of the conditions, but if there is an already optimized implementation that allows me to achieve this, I would much prefer that just owing to the size of my arrays.
Does anyone know of a clever way to achieve this robustly and very efficiently using Intel MKL (maybe with the statistics toolbox or some creative use of elementary functions?), a similarly optimized library that does this, or a highly optimized way to hand-code this? I've been racking my brain trying to come up with some out-of-the box method, but I'm coming up empty.
Note that it's necessary for me to be able to do this in C, that it's not viable for me to shift this task to my Python frontend, and that it is indeed necessary for me to code this particular subprogram in C in the first place.
Thanks!

If you were using c++, count_if from the algorithms library with an execution policy of par_unseq may parallelize and vectorize the count. On Linux at least, it typically uses Intel TBB to do this.
It's not likely to be as easy in c. Because c doesn't have concepts like templates, callables or lambdas, the only way to specialize a generic (library-provided) count()-function would be to pass a function pointer as a callback (like qsort() does). Unless the compiler manages to devirtualize and inline the callback, you can't vectorize at all, leaving you with (possibly thread parallelized) scalar code. OTOH, if you use for example gcc vector intrinsics (my favourite!), you get vectorization but not parallelization. You could try to combine the approaches, but I'd say get over yourself and use c++.
However, if you only need vectorization, you can almost certainly just write sequential code and have the compiler autovectorize, unless the predicate for what should be counted is poorly written, or your compiler is braindamaged.
For example. gcc vectorizes the code on x86 if at least sse4 instructions are available (-msse4). With AVX[2/512] (-mavx / -mavx2 / -mavx512f) you can get wider vectors to do more elements at once. In general, if you're compiling on the same hardware you will be running the program on, I'd recommend letting gcc autodetect the optimal instruction set extensions (-march=native).
Note that in the provided code, the conditions should not use short-circuiting or (||), because then the read from the max-vector is semantically forbidden if the comparison with the min-vector was already true for the current element, severely hindering vectorization (though avx512 could potentially vectorize this with somewhat catastrophic slowdown).
I'm pretty sure gcc is not nearly optimal in the code it generates for avx512, since it could do the k-reg (mask register) or in the mask registers with kor[b/w/d/q], but maybe somebody with more experience in avx512 (*cougth* Peter Cordes *cough*) could weigh in on that.

MKL doesn't provide such functions but You may try to check another performance library - IPP which contains a set of threshold functions that could be useful to your case. Please refer to the IPP Developer Reference to check more details - https://software.intel.com/content/www/us/en/develop/documentation/ipp-dev-reference/top/volume-1-signal-and-data-processing/essential-functions/conversion-functions/threshold.html

Any way to vectorize in C

My question may seem primitive or dumb because, I've just switched to C.
I have been working with MATLAB for several years and I've learned that any computation should be vectorized in MATLAB and I should avoid any for loop to get an acceptable performance.
It seems that if I want to add two vectors, or multiply matrices, or do any other matrix computation, I should use a for loop.
It is appreciated if you let me know whether or not there is any way to do the computations in a vectorized sense, e.g. reading all elements of a vector using only one command and adding those elements to another vector using one command.
Thanks

MATLAB suggests you to avoid any for loop because most of the operations available on vectors and matrices are already implements in its API and ready to be used. They are probably optimized and they work directly on underlying data instead that working at MATLAB language level, a sort of opaque implementation I guess.
Even MATLAB uses for loops underneath to implement most of its magic (or delegates them to highly specialized assembly instructions or through CUDA to the GPU).
What you are asking is not directly possible, you will need to use loops to work on vectors and matrices, actually you would search for a library which allows you to do most of the work without directly using a for loop but by using functions already defined that wraps them.

As it was mentioned, it is not possible to hide the for loops. However, I doubt that the code MATLAB produces is in any way faster the the one produced by C. If you compile your C code with the -O3 it will try to use every hardware feature your computer has available, such as SIMD extensions and multiple issue. Moreover, if your code is good and it doesn't cause too many pipeline stalls and you use the cache, it will be really fast.
But i think what you are looking for are some libraries, search google for LAPACK or BLAS, they might be what you are looking for.

In C there is no way to perform operations in a vectorized way. You can use structures and functions to abstract away the details of operations but in the end you will always be using fors to process your data.
As for speed C is a compiled language and you will not get a performance hit from using for loops in C. C has the benefit (compared to MATLAB) that it does not hide anything from you, so you can always see where your time is being used. On the downside you will notice that things that MATLAB makes trivial (svd,cholesky,inv,cond,imread,etc) are challenging in C.

how to incorporate C or C++ code into my R code to speed up a MCMC program, using a Metropolis-Hastings algorithm

I am seeking advice on how to incorporate C or C++ code into my R code to speed up a MCMC program, using a Metropolis-Hastings algorithm. I am using an MCMC approach to model the likelihood, given various covariates, that an individual will be assigned a particular rank in a social status hierarchy by a 3rd party (the judge): each judge (approx 80, across 4 villages) was asked to rank a group of individuals (approx 80, across 4 villages) based on their assessment of each individual's social status. Therefore, for each judge I have a vector of ranks corresponding to their judgement of each individual's position in the hierarchy.
To model this I assume that, when assigning ranks, judges are basing their decisions on the relative value of some latent measure of an individual's utility, u. Given this, it can then be assumed that a vector of ranks, r, produced by a given judge is a function of an unobserved vector, u, describing the utility of the individuals being ranked, where the individual with the kth highest value of u will be assigned the kth rank. I model u, using the covariates of interest, as a multivariate normally distributed variable and then determine the likelihood of the observed ranks, given the distribution of u generated by the model.
In addition to estimating the effect of, at most, 5 covariates, I also estimate hyperparameters describing variance between judges and items. Therefore, for every iteration of the chain I estimate a multivariate normal density approximately 8-10 times. As a result, 5000 iterations can take up to 14 hours. Obviously, I need to run it for much more than 5000 runs and so I need a means for dramatically speeding up the process. Given this, my questions are as follows:
(i) Am I right to assume that the best speed gains will be had by running some, if not all of my chain in C or C++?
(ii) assuming the answer to question 1 is yes, how do I go about this? For example, is there a way for me to retain all my R functions, but simply do the looping in C or C++: i.e. can I call my R functions from C and then do looping?
(iii) I guess what I really want to know is how best to approach the incorporation of C or C++ code into my program.

First make sure your slow R version is correct. Debugging R code might be easier than debugging C code. Done that? Great. You now have correct code you can compare against.
Next, find out what is taking the time. Use Rprof to run your code and see what is taking the time. I did this for some code I inherited once, and discovered it was spending 90% of the time in the t() function. This was because the programmer had a matrix, A, and was doing t(A) in a zillion places. I did one tA=t(A) at the start, and replaced every t(A) with tA. Massive speedup for no effort. Profile your code first.
Now, you've found your bottleneck. Is it code you can speed up in R? Is it a loop that you can vectorise? Do that. Check your results against your gold standard correct code. Always. Yes, I know its hard to compare algorithms that rely on random numbers, so set the seeds the same and try again.
Still not fast enough? Okay, now maybe you need to rewrite parts (the lowest level parts, generally, and those that were taking the most time in the profiling) in C or C++ or Fortran, or if you are really going for it, in GPU code.
Again, really check the code is giving the same answers as the correct R code. Really check it. If at this stage you find any bugs anywhere in the general method, fix them in what you thought was the correct R code and in your latest version, and rerun all your tests. Build lots of automatic tests. Run them often.
Read up about code refactoring. It's called refactoring because if you tell your boss you are rewriting your code, he or she will say 'why didn't you write it correctly first time?'. If you say you are refactoring your code, they'll say "hmmm... good". THIS ACTUALLY HAPPENS.
As others have said, Rcpp is made of win.

A complete example using R, C++ and Rcpp is provided by this blog post which was inspired by a this post on Darren Wilkinson's blog (and he has more follow-ups). The example is also included with recent releases of Rcpp in a directory RcppGibbs and should get you going.

I have a blog post which discusses exactly this topic which I suggest you take a look at:
http://darrenjw.wordpress.com/2011/07/31/faster-gibbs-sampling-mcmc-from-within-r/
(this post is more relevant than the post of mine that Dirk refers to).

I think the best method currently to integrate C or C++ is the Rcpp package of Dirk Eddelbuettel. You can find a lot of information at his website. There is also a talk at Google that is available through youtube that might be interesting.

Check out this project:
https://github.com/armstrtw/rcppbugs
Also, here is a link to the R/Fin 2012 talk:
https://github.com/downloads/armstrtw/rcppbugs/rcppbugs.pdf

I would suggest to benchmark each step of the MCMC sampler and identify the bottleneck. If you put each full conditional or M-H-step into a function, you can use the R compiler package which might give you 5%-10% speed gain. The next step is to use RCPP.
I think it would be really nice to have a general-purpose RCPP function which generates just one single draw using the M-H algorithm given a likelihood function.
However, with RCPP some things become difficult if you only know the R language: non-standard random distributions (especially truncated ones) and using arrays. You have to think more like a C programmer there.
Multivariate Normal is actually a big issue in R. Dmvnorm is very inefficient and slow. Dmnorm is faster, but it would give me NaNs quicker than dmvnorm in some models.
Neither does take an array of covariance matrices, so it is impossible to vectorize code in many instances. As long as you have a common covariance and means, however, you can vectorize, which is the R-ish strategy to speed up (and which is the oppositve of what you would do in C).

Which haskell array implementation to use? AKA what are the pros and cons of each

What do I need? [an unordered list]
VERY easy parallelization
support for map, filter etc.
ability to perform array based computations efficiently, like A=B+C, sort of like matlab arrays.
Generation of SIMD code. I guess this is out of the question in the near future for anything, but hey, I can ask :)
support for matrices should be there at a minimum, higher dimensions are lower priority right now.
ability to get a pointer to it and create one from a C pointer.
Support from other libraries. IE, bindings to popular C math packages, i/o to disk or images if the arrays are 2D
What do I see?
Array package in haskell-platform. It's the blessed one and can do parallel
Data.Vector. Has loop fusion, but not in platform, so its maturity is unknown to me.
repa package, contributed by the DPH team, but doesn't work well with any stable ghc today.
Lots of variation in the level of support for array implementations. For instance, there doesn't seem to be an easy way to dump a 2D vector to a image file. IOW, the haskell community apparently hasn't settled on an array implementation.
So please, help me choose.
EDIT A=B+C refers to element wise addition, and not list concatenation

Correct, the community hasn't settled on a good array implementation. I think it would be a good Haskell Prime submission to put forward the Vector API and remove Data.Array.
Vector is very mature! It has:
VERY easy parallelization
support for map, filter etc.
performs array based computations efficiently, like A=B+C (but I'm not in tune with how matlab does it)
vector creation from a pointer via Vector.Storable
It does not:
have enough support from other libraries. IE, bindings to popular C math packages
support matrices, but you can have vectors of vectors. If you build some vector-based matrix operations then perhaps you could upload to hackage as vector-matrix.
Generate SIMD code.
NOTE: You can turn bytestrings into vectors of whatever, so if you have an image as a bytestring then, via Vector.Storable, you might be able to do what you want with the image as a vector.

(I am not allowed to comment)
rpg: Does hmatrix accept Data.Vector? It has a Data.Packed.Vector but are they the same?
Yes. The last version of hmatrix uses by default Data.Vector.Storable for 1D vectors (previously it was optional). The dependency on vector is not shown in Hackage, probably because it is in a configuration flag.
For LAPACK compatibility matrices are not Vector or Vector t, but they can be easily converted (e.g.: Data.Vector.fromList . toRows).

If you want bindings to popular C libraries, the best options are probably hmatrix and blas. Blas is just a binding to a BLAS library, whereas hmatrix provides some higher-level operations. There are also many libraries built upon hmatrix offering further functionality. If you're doing any sort of matrix work, that's what I would start with.
The vector package is also a good choice; it's stable and provides excellent performance. The Data.Vector.Storable types are represented as C arrays, so it's trivial to interface from them to other C libraries. The biggest drawback is that there's no matrix support, so you'd have to do that yourself.
As for exporting to an image format, most haskell image libraries seem to use ByteStrings. You could either convert to a ByteString, or bind to a C library that does what you want. If you find a Haskell library that does what you want, it should be easy enough to convert hmatrix data to the proper format.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight