How to do transpose for tptrs in blas? - c

How to do transpose for tptrs in blas?
I want to solve:
XA = B
But it seems that tptrs only lets me solve:
AX = B
Or, using the 'transpose' flag, in tptrs:
A'X = B
which, rearranging is:
(A'X)' = B'
X'A = B'
So, I can use it to solve XA = B, but I have to first transpose B manually myself, and then, again, transpose the answer. Am I missing some trick to avoid having to do the transpose?

TPTRS isn't a BLAS routine; it's an LAPACK routine.
If A is relatively small compared to B and X, then a good option to unpack it into a "normal" triangular matrix and use the BLAS routine TRSM which takes a "side" argument allowing you to specify XA = B. If A is mxm and B is nxm, the unpacking adds m^2 operations which will be a small amount of overhead compared to the O(nm^2) operations to do the solve.

Related

Output arrayfun into trid dimension of the matrix in MATLAB

Assume that I have a matrix A = rand(n,m). I want to compute matrix B with size n x n x m, where B(:,:,i) = A(:,i)*A(:,i)';
The code that can produce this is quite simple:
A = rand(n,m); B = zeros(n,n,m);
for i=1:m
B(:,:,i) = A(:,i)*A(:,i)'
end
However, I am concerned about speed and would like to ask you help to tell me how to implement it without using loops. Very likely that I need to use either bsxfun, arrayfun or rowfun, but I am not sure.
All answers are appreciated.
I don't have MATLAB at hand right now, but I think this code should produce the same result as your loop:
A1 = reshape(A,n,1,m);
A2 = reshape(A,1,n,m);
B = bsxfun(#times,A1,A2);
If you have a newer version of MATLAB, you don't need bsxfun any more, you can just write
B = A1 .* A2;
On older versions this last line will give an error message.
Whether any of this is faster than your loop depends also on the version of MATLAB. Newer MATLAB versions are not slow any more with loops. I think the loop is more readable, it's worth using more readable code, or at least keep the loop in a comment to clarify what the vectorized code does.
arrayfun and bsxfun does not speed up the calculations in my attempt as below:
clc;close all;
clear all;
m=300;n=400;
A = rand(n,m); B = zeros(n,n,m);
tic
for i=1:m
B(:,:,i) = A(:,i)*A(:,i)';
end
t1=toc
C = reshape(cell2mat(arrayfun(#(k) bsxfun(#times, A(:,k), A(:,k)'), ...
1:m, 'UniformOutput',false)),n,n,m);
%C=reshape(C,n,n,m);
t2=toc-t1
% t1 =0.3079
% t2 =0.5112

Tensorflow: analogue for numpy.take?

Is there analogue for numpy.take?
I want to form N+1-dimensional array from N-dimensional array, more precisely from array with shape (B, H, W, C) I want to make (B, H, W, X, C) array.
I suppose that for my case there is solution even without such general operation. But I'm really unsure that if I will write code with multiple intermediate operations and tensors (shifting, repeating and so on) TF will be able to optimize it and remove unnecessary operations. Moreover I suppose that such code will be unclean and just awful.
I want to add dimension with shifted values. I.e. for (H,W)->(H,W,3) dimensions case indices must be
[
[[0,0], #[0,-1], may be padding with zeros but for now pad with edge value
[0,0],
[0,1]],
[[0,0],
[0,1]
[0,2]]
...
[[1,0],
[1,0],
[1,1]],
[[1,0],
[1,1],
[1,2]],
...
]
I thought about tf.scatter_nd (https://www.tensorflow.org/api_docs/python/tf/scatter_nd) but for now I don't understand how to use it. If I understand correctly, I can't use indices with shapes larger than shapes of update array (i.e. I can't use indices with shape (3,4,5,3) and update with shape (3,4,3) or even (3,4,1,3). If it's so then this operation seems useless until I make intermediate array with shape that I need to form in result.
UPD: may be I'm wrong and tensors operations (shifting, tiling and so on) is more appropriate and efficient solution.
But in any case I think that analogue for np.take will be useful.
The closest function in tensorflow to np.take are tf.gather and tf.gather_nd.
tf.gather_nd is more general than tf.gather (and np.take) as it can slices through several dimensions at once.
A noticeable restriction of tf.gather[_nd] compared to np.take is that they slice through the first dimensions of the tensor only -- you can't slice through inner dimensions. When you want to slice through an arbitrary dimension (as in your case), you need to transpose the array to put the slice dimensions first, gather, then transpose back.
Exemplary code for tf.gather replacing np.take:
import numpy as np
a = np.array([5, 7, 42])
b = np.random.randint(0, 3, (2, 3, 4))
c = a[b]
result_numpy = np.take(a, b)
print(a, b, c, result_numpy)
import tensorflow as tf
a = tf.convert_to_tensor(a)
b = tf.convert_to_tensor(b)
# c = a[b] # does not work
result_tf = tf.gather(a, b)
print(a, b, result_tf)
assert(np.array_equal(result_numpy, result_tf.numpy()))

Efficiently calculating weighted distance in MATLAB

Several posts exist about efficiently calculating pairwise distances in MATLAB. These posts tend to concern quickly calculating euclidean distance between large numbers of points.
I need to create a function which quickly calculates the pairwise differences between smaller numbers of points (typically less than 1000 pairs). Within the grander scheme of the program i am writing, this function will be executed many thousands of times, so even small gains in efficiency are important. The function needs to be flexible in two ways:
On any given call, the distance metric can be euclidean OR city-block.
The dimensions of the data are weighted.
As far as i can tell, no solution to this particular problem has been posted. The statstics toolbox offers pdist and pdist2, which accept many different distance functions, but not weighting. I have seen extensions of these functions that allow for weighting, but these extensions do not allow users to select different distance functions.
Ideally, i would like to avoid using functions from the statistics toolbox (i am not certain the user of the function will have access to those toolboxes).
I have written two functions to accomplish this task. The first uses tricky calls to repmat and permute, and the second simply uses for-loops.
function [D] = pairdist1(A, B, wts, distancemetric)
% get some information about the data
numA = size(A,1);
numB = size(B,1);
if strcmp(distancemetric,'cityblock')
r=1;
elseif strcmp(distancemetric,'euclidean')
r=2;
else error('Function only accepts "cityblock" and "euclidean" distance')
end
% format weights for multiplication
wts = repmat(wts,[numA,1,numB]);
% get featural differences between A and B pairs
A = repmat(A,[1 1 numB]);
B = repmat(permute(B,[3,2,1]),[numA,1,1]);
differences = abs(A-B).^r;
% weigh difference values before combining them
differences = differences.*wts;
differences = differences.^(1/r);
% combine features to get distance
D = permute(sum(differences,2),[1,3,2]);
end
AND:
function [D] = pairdist2(A, B, wts, distancemetric)
% get some information about the data
numA = size(A,1);
numB = size(B,1);
if strcmp(distancemetric,'cityblock')
r=1;
elseif strcmp(distancemetric,'euclidean')
r=2;
else error('Function only accepts "cityblock" and "euclidean" distance')
end
% use for-loops to generate differences
D = zeros(numA,numB);
for i=1:numA
for j=1:numB
differences = abs(A(i,:) - B(j,:)).^(1/r);
differences = differences.*wts;
differences = differences.^(1/r);
D(i,j) = sum(differences,2);
end
end
end
Here are the performance tests:
A = rand(10,3);
B = rand(80,3);
wts = [0.1 0.5 0.4];
distancemetric = 'cityblock';
tic
D1 = pairdist1(A,B,wts,distancemetric);
toc
tic
D2 = pairdist2(A,B,wts,distancemetric);
toc
Elapsed time is 0.000238 seconds.
Elapsed time is 0.005350 seconds.
Its clear that the repmat-and-permute version works much more quickly than the double-for-loop version, at least for smaller datasets. But i also know that calls to repmat often slow things down, however. So I am wondering if anyone in the SO community has any advice to offer to improve the efficiency of either function!
EDIT
#Luis Mendo offered a nice cleanup of the repmat-and-permute function using bsxfun. I compared his function with my original on datasets of varying size:
As the data become larger, the bsxfun version becomes the clear winner!
EDIT #2
I have finished writing the function and it is available on github [link]. I ended up finding a pretty good vectorized method for computing euclidean distance [link], so i use that method in the euclidean case, and i took #Divakar's advice for city-block. It is still not as fast as pdist2, but its must faster than either of the approaches i laid out earlier in this post, and easily accepts weightings.
You can replace repmat by bsxfun. Doing so avoids explicit repetition, therefore it's more memory-efficient, and probably faster:
function D = pairdist1(A, B, wts, distancemetric)
if strcmp(distancemetric,'cityblock')
r=1;
elseif strcmp(distancemetric,'euclidean')
r=2;
else
error('Function only accepts "cityblock" and "euclidean" distance')
end
differences = abs(bsxfun(#minus, A, permute(B, [3 2 1]))).^r;
differences = bsxfun(#times, differences, wts).^(1/r);
D = permute(sum(differences,2),[1,3,2]);
end
For r = 1 ("cityblock" case), you can use bsxfun to get elementwise subtractions and then use matrix-multiplication, which must speed up things. The implementation would look something like this -
%// Calculate absolute elementiwse subtractions
absm = abs(bsxfun(#minus,permute(A,[1 3 2]),permute(B,[3 1 2])));
%// Perform matrix multiplications with the given weights and reshape
D = reshape(reshape(absm,[],size(A,2))*wts(:),size(A,1),[]);

How to improve the execution time of this function?

Suppose that f(x,y) is a bivariate function as follows:
function [ f ] = f(x,y)
UN=(g)1.6*(1-acos(g)/pi)-0.8;
f= 1+UN(cos(0.5*pi*x+y));
end
How to improve execution time for function F(N) with the following code:
function [VAL] = F(N)
x=0:4/N:4;
y=0:2*pi/1000:2*pi;
VAL=zeros(N+1,3);
for i = 1:N+1
val = zeros(1,N+1);
for j = 1:N+1
val(j) = trapz(y,f(0,y).*f(x(i),y).*f(x(j),y))/2/pi;
end
val = fftshift(fft(val))/N;
l = (length(val)+1)/2;
VAL(i,:)= val(l-1:l+1);
end
VAL = fftshift(fft(VAL,[],1),1)/N;
L = (size(VAL,1)+1)/2;
VAL = VAL(L-1:L+1,:);
end
Note that N=2^p where p>10, so please consider the memory limitations while optimizing the code using ndgrid, arrayfun, etc.
FYI: The code intends to find the central 3-by-3 submatrix of the fftn of
fun=#(a,b) trapz(y,f(0,y).*f(a,y).*f(b,y))/2/pi;
where a,b are in [0,4]. The key idea is that we can save memory using the code above specially when N is very large. But the execution time is still an issue because of nested loops. See the figure below for N=2^2:
This is not a full answer, but some possibly helpful hints:
0) The trivial: Are you sure you need numerics? Can't you do the computation analytically?
1) Do not use function handles:
function [ f ] = f(x,y)
f= 1+1.6*(1-acos(cos(0.5*pi*x+y))/pi)-0.8
end
2) Simplify analytically: acos(cos(x)) is the same as abs(mod(x + pi, 2 * pi) - pi), which should compute slightly faster. Or, instead of sampling and then numerically integrating, first integrate analytically and sample the result.
3) The FFT is a very efficient algorithm to compute the full DFT, but you don't need the full DFT. Since you only want the central 3 x 3 coefficients, it might be more efficient to directly apply the DFT definition and evaluate the formula only for those coefficients that you want. That should be both fast and memory-efficient.
4) If you repeatedly do this computation, it might be helpful to precompute DFT coefficients. Here, dftmtx from the Signal Processing toolbox can assist.
5) To get rid of the loops, think about the problem not in the form of computation instructions, but a single matrix operation. If you consider your input N x N matrix as a vector with N² elements, and your output 3 x 3 matrix as a 9-element vector, then the whole operation you apply (numerical integration via trapz and DFT via fft) appears to be a simple linear transform, which it should be possible to express as an N² x 9 matrix.

Labview: element-wise array multiplication operations

Does there exist a function similar to that of numpy's * operator for two arrays to multiply their elements in an element-wise manner, returning an array of the similar type?
For example:
#Lets define:
a = [0,1,2,3]
b = [1,2,3,4]
d = [[1,2] , [3,4], [5,6]]
e = [3,4,5]
#I want:
a * 2 == [2*0, 1*2, 2*2, 2*3]
a * b == [0*1, 1*2, 2*3, 3*4]
d * e == [[1*3, 2*3], [3*4, 4*4], [5*5, 6*5]]
d * d == [[1*1, 2*2], [3*3, 4*4], [5*5, 6*6]]
Note how * IS NOT regular matrix multiplication it is element-wise multiplication.
My current best solution is to write some c code, which does this, and import a compiled dll.
There must exist a better solution.
EDIT:
Using LabVIEW 2011 - Needs to be fast.
The first two multiplications can be done by using the 'multiply' primitive. Make sure the arrays in the second case are of the same length.
For the third multipllication you can use a for loop (with auto-indexing). This is needed because you need to instruct LabVIEW what the basic index is.
The last multiplication can (again) be done using the multiply primitive.
My result is different (opposite) from the previous posters. I generated a 4x1000 array of random numbers (magnitude 1000) which I multiplied by a 4x4 array of integers (1,2,3,4,...). I did this 100,000 times using the matrix multiplication VI and also using for loops to perform the operation on the arrays. I'm seeing times on the order of 0.328s for the matrix VIs and 0.051s for the for loops. Using a compiled DLL may be faster than Labview, but this does not seem to be true for the built-in functions.
This is certainly not what I expected, but it is consistent over many cycles. The VI is standard execution thread. All data types are set before the timed operations - no coercion takes place in the loops. The operations are performed separately, staged by a flat sequence structure, as is the time measurement. Parallelism is turned off.

Resources