In SSE, if I have a 128-bit register containing 4 floats i.e.
A = a b c d ('a','b','c','d' are floats and 'A' is a 128-bit SSE register)
and
B = e f g h
then if I want
C = a e b f
I can simply do:
C = _mm_unpacklo_ps(A,B);
Similarly if I want
D = c g d h
I can do:
D = _mm_unpackhi_ps(A,B);
If I have an AVX register containing doubles, is it possible to do the same with a single instruction?
Based on how these intrinsics work, I know that I can't use _mm256_unpacklo_pd(), _mm256_shuffle_pd(), _mm256_permute2f128_pd() or _mm256_blend_pd(). Is there any instruction apart from these that I can use or do I have to use a combination of the above instructions?
One way that I can think of is the following:
A1 = _mm256_unpacklo_pd(A,B);
A2 = _mm256_unpackhi_pd(A,B);
C = _mm256_permute2f128_pd(A1,A2,0x20);
D = _mm256_permute2f128_pd(A1,A2,0x31);
If anyone has a better solution, please do post below.
Related
Assume that I have a matrix A = rand(n,m). I want to compute matrix B with size n x n x m, where B(:,:,i) = A(:,i)*A(:,i)';
The code that can produce this is quite simple:
A = rand(n,m); B = zeros(n,n,m);
for i=1:m
B(:,:,i) = A(:,i)*A(:,i)'
end
However, I am concerned about speed and would like to ask you help to tell me how to implement it without using loops. Very likely that I need to use either bsxfun, arrayfun or rowfun, but I am not sure.
All answers are appreciated.
I don't have MATLAB at hand right now, but I think this code should produce the same result as your loop:
A1 = reshape(A,n,1,m);
A2 = reshape(A,1,n,m);
B = bsxfun(#times,A1,A2);
If you have a newer version of MATLAB, you don't need bsxfun any more, you can just write
B = A1 .* A2;
On older versions this last line will give an error message.
Whether any of this is faster than your loop depends also on the version of MATLAB. Newer MATLAB versions are not slow any more with loops. I think the loop is more readable, it's worth using more readable code, or at least keep the loop in a comment to clarify what the vectorized code does.
arrayfun and bsxfun does not speed up the calculations in my attempt as below:
clc;close all;
clear all;
m=300;n=400;
A = rand(n,m); B = zeros(n,n,m);
tic
for i=1:m
B(:,:,i) = A(:,i)*A(:,i)';
end
t1=toc
C = reshape(cell2mat(arrayfun(#(k) bsxfun(#times, A(:,k), A(:,k)'), ...
1:m, 'UniformOutput',false)),n,n,m);
%C=reshape(C,n,n,m);
t2=toc-t1
% t1 =0.3079
% t2 =0.5112
I have a vector u and a number t with Unitful units, and I want du to have the units of typeof(oneunit(u)/oneunit(t)). I want to find a single generic line of code which constructs an SArray or MArray output which matches the input. There are a few cases which I have tried:
Obviously copy(u) doesn't match the units.
u/oneunit(t) and u./oneunit(t) both create SArrays, even when u <: MArray.
similar always creates a mutable type, so it always creates an MArray
Do I need to directly use the constructor (which would be a pain because it would add an odd branch to an otherwise generic code, but it's fine if that's the answer)?
Edit
Example that a simple convert does not work with MArrays
u = #MArray [1u"g",2u"g",3u"g"]
t = 1u"s"
convert(typeof(u),u/t)
DimensionError: g and 1.0 g s^-1 are not dimensionally compatible.
While similar is hopeless:
u = #SArray [1u"g",2u"g",3u"g"]
similar(u)
3-element MVector{3,Quantity{Int64, Dimensions:{𝐌}, Units:{g}}}:
72559480 g
581132080 g
29791 g
How is:
static_similar(s, v) =
( isimmutable(s) ? StaticArrays.mutable_similar_type :
StaticArrays.default_similar_type)(eltype(v),
Size(s), StaticArrays.length_val(s) )(v)
Giving:
julia> u = #MArray [1u"g",2u"g",3u"g"];
julia> s = #SArray [1u"g",2u"g",3u"g"];
julia> static_similar(u,u./oneunit(t))
3-element SVector{3,Quantity{Float64, Dimensions:{𝐌 𝐓^-1}, Units:{g s^-1}}}:
1.0 g s^-1
2.0 g s^-1
3.0 g s^-1
julia> static_similar(s,s./oneunit(t))
3-element MVector{3,Quantity{Float64, Dimensions:{𝐌 𝐓^-1}, Units:{g s^-1}}}:
1.0 g s^-1
2.0 g s^-1
3.0 g s^-1
The relevant functions are defined in StaticArrays/src/abstractarray.jl. Especially, note comment: https://github.com/JuliaArrays/StaticArrays.jl/blob/715fefe58bef7ef1d9b2e693d3468d4fd585e11f/src/abstractarray.jl#L60-L62
Great question!
Basically, as a static arrays user I always use SArray and a function programming approach to them. (When I need to manage memory I can use Array{SArray{...}} or whatever, and replace elements of the outer Array).
Probably not the answer you are looking for but I'd tend to chill out about the fact that operations return SArray and just learn to replace SArrays in their entirety. In most cases this is faster than fiddling with MArray because LLVM naturally invokes SIMD instructions for stack variables while the heap-allocated MArray operations do not.
Was it your expectation that operations like division would preserve the ability (or not) to mutate?
EDIT: yes, using the constructor or convert is totally a viable approach.
I'm currently dealing with the 2-Phase-Lock Protocol considering the following schedule S:
S = R_3 D R_1 A W_2 A W_2 C R_3 B W_3 B R_1 B
Where R = Read, W = Write, {A, B, C} = objects and {1,2,3} = transactions.
Now I shall show that the 2PL can't be used for S. But I actually don't see why, I would set the Locks(L)/Unlocks(U) like:
L_3 D R_3 D U_3 D L_1 A R_1 A U_1 A L_2 C W_2 C U_2 C L_3 B R_3 B W_3 B U_3 B R_1 B
So, I used at maximum 1 L/U per Object of a Transaction. What I am doing wrong here?
I finally found out why I was wrong. It is correct that a transaction can perform at maximum 1 Lock/Unlock per object. But I was forgetting that after a transactions performed its first Unlock (doesn't matter on which object) it is not allowed to perform a Lock again at all, even on another object that has not been locked before.
How to do transpose for tptrs in blas?
I want to solve:
XA = B
But it seems that tptrs only lets me solve:
AX = B
Or, using the 'transpose' flag, in tptrs:
A'X = B
which, rearranging is:
(A'X)' = B'
X'A = B'
So, I can use it to solve XA = B, but I have to first transpose B manually myself, and then, again, transpose the answer. Am I missing some trick to avoid having to do the transpose?
TPTRS isn't a BLAS routine; it's an LAPACK routine.
If A is relatively small compared to B and X, then a good option to unpack it into a "normal" triangular matrix and use the BLAS routine TRSM which takes a "side" argument allowing you to specify XA = B. If A is mxm and B is nxm, the unpacking adds m^2 operations which will be a small amount of overhead compared to the O(nm^2) operations to do the solve.
Take this simple example:
a = [1 2i];
x = zeros(1,length(a));
for n=1:length(a)
x(n) = isreal(a(n));
end
In an attempt to vectorize the code, I tried:
y = arrayfun(#isreal,a);
But the results are not the same:
x =
1 0
y =
0 0
What am I doing wrong?
This certainly appears to be a bug, but here's a workaround:
>> y = arrayfun(#(x) isreal(x(1)),a)
ans =
1 0
Why does this work? I'm not totally sure, but it appears that when you perform an indexing operation on the variable before calling ISREAL it removes the "complex" attribute from the array element if the imaginary component is zero. Try this in the Command Window:
>> a = [1 2i]; %# A complex array
>> b = a(1); %# Indexing element 1 removes the complex attribute...
>> c = complex(a(1)); %# ...but we can put that attribute back
>> whos
Name Size Bytes Class Attributes
a 1x2 32 double complex
b 1x1 8 double %# Not complex
c 1x1 16 double complex %# Still complex
Apparently, ARRAYFUN must internally maintain the "complex" attribute of the array elements it passes to ISREAL, thus treating them all as being complex numbers even if the imaginary component is zero.
It might help to know that MATLAB stores the real/complex parts of a matrix separately. Try the following:
>> format debug
>> a = [1 2i];
>> disp(a)
Structure address = 17bbc5b0
m = 1
n = 2
pr = 1c6f18a0
pi = 1c6f0420
1.0000 0 + 2.0000i
where pr is a pointer to the memory block containing the real part of all values, and pi pointer to the complex part of all values in the matrix. Since all elements are stored together, then in this case they all have a complex part.
Now compare these two approaches:
>> arrayfun(#(x)disp(x),a)
Structure address = 17bbcff8
m = 1
n = 1
pr = 1bb8a8d0
pi = 1bb874d0
1
Structure address = 17c19aa8
m = 1
n = 1
pr = 1c17b5d0
pi = 1c176470
0 + 2.0000i
versus
>> for n=1:2, disp(a(n)), end
Structure address = 17bbc930
m = 1
n = 1
pr = 1bb874d0
pi = 0
1
Structure address = 17bbd180
m = 1
n = 1
pr = 1bb874d0
pi = 1bb88310
0 + 2.0000i
So it seems that when you access a(1) in the for loop, the value returned (in the ans variable) has a zero complex-part (null pi), thus is considered real.
One the other hand, ARRAYFUN seems to be directly accessing the values of the matrix (without returning them in ANS variable), thus it has access to both pr and pi pointers which are not null, thus are all elements are considered non-real.
Please keep in mind this just my interpretation, and I could be mistaken...
Answering really late on this one... The MATLAB function ISREAL operates in a really rather counter-intuitive way for many purposes. It tells you if a given array taken as a whole has no complex part at all - it tells you about the storage, it doesn't really tell you anything about the values in the array. It's a bit like the ISSPARSE function in that regard. So, for example
isreal(complex(1)) % returns FALSE
What you'll find in MATLAB is that certain operations automatically trim any all-zero imaginary parts. So, for example
x = complex(1);
isreal(x); % FALSE, we just forced there to be an imaginary part
isreal(x(1)); % TRUE - indexing realised it could drop the zero imaginary part
isreal(x(:)); % FALSE - "(:)" indexing is just a reshape, not real indexing
In short, MATLAB really needs a function which answers the question "does this value have zero imaginary part", in an elementwise way on an array.