In order to maximize speed I am trying to vectorize the following (to enable the compiler to vectorize as it deems good):
integer :: i,j
real :: a(4),b(4,5),c(4,5)
!... setting values to a and b...
do i=1,5
do j=1,4
c(j,i)=b(j,i)/a(i)
end do
end do
I have tried the following
c=b/a
but that doesn't work:
error #6366: The shapes of the array expressions do not conform.
My thought was that since you can do a/i, (array / scalar), I was hoping that it was possible to do (2d array / array). To begin with the dimension of b and c were (5,4) and I thought that was the problem, that it needs to conform to the variable with smaller rank on the first ranks, but this didn't seem to be the case. As of now, I am wondering if it is at all possible??? Or do I have to stick with the do loops? (of course I could be satisfied with vectorize the inner loop)
Very happy with any comments or ideas with this.
(I am using ifort 16 on windows)
In case you haven't already got your answer, the seemingly non-vectorized code looks like this:
!Non-vectorized
do i=1,5
do j=1,4
c(j,i) = b(j,i) / a(j)
enddo
enddo
and the seemingly vectorized version like this:
!Vectorized
do i=1,5
c(:,i) = b(:,i) / a(:)
enddo
But the intel compiler vectorizes both of them. To make sure if a certain loop has been vectorized or not, use the flag -qopt-report-phase=vec. This generates the vectorization report about your compiled program and is a neat way of knowing if a certain loop has been vectorized or not.
The generated vectorization report of the above code is as shown:
.... Beginning stuff...
LOOP BEGIN at vect_div.f90(11,5)
remark #15542: loop was not vectorized: inner loop was already vectorized
LOOP BEGIN at vect_div.f90(12,5)
remark #15300: LOOP WAS VECTORIZED
LOOP END
LOOP END
LOOP BEGIN at vect_div.f90(18,5)
remark #15542: loop was not vectorized: inner loop was already vectorized
LOOP BEGIN at vect_div.f90(19,7)
remark #15300: LOOP WAS VECTORIZED
LOOP END
LOOP END
Here, (11,5), (12,5) etc. are the row and column numbers in my .f90 text file where the do keyword is present. As you may notice, the outer loops are not vectorized and the inner ones are. They are both vectorized without any noticeable difference too.
More detailed reports can be generated by changing the 'n' value in the ifort flag -qopt-report=[n]
Related
Let's imagine that we have an array A of dimensions (i,j). I've read in a number of places that - as Fortran is column-major - loops should be coded as:
do j=1,n
do i=1,n
! operate on A(i,j)
enddo
enddo
At the same time, we should minimize loop overhead, such that, if n>m, then:
do j=1,m
do i=1,n
! ...
enddo
enddo
Is more efficient than the other way around. As a consequence of these two statements, if we want to define a matrix with dimensions (2,3) or (3,2), we should go for the second option. Am I right? I have not seen this statement anywhere and I was just wondering if I am missing something. Thank you
Having a large loop count is better for vectorization because you have a peeling loop and a tail loop, which become negligible when the loop count is large. (https://en.wikipedia.org/wiki/Loop_splitting)
But more important : you should but have as a 1st index the dimension where you do stride-1 loops and put the random accesses in the other dimensions. This is much more important than loop count.
Something else which is important is if your 1st index can be a multiple of the vector size. For example, AVX instructions operate on 256-bit vectors which correspond to 4 double precision floats. If your 1st dimension is a multiple of 4 and if your array is 256-bit aligned, all your columns will be aligned and you will be able to get 100% of the vectorization potentiality.
In you example, it could be better to have a matrix declared as (4,2) instead of (3,2) !
I have a (3x5) matrix and I want to get its reduced row echelon form.
I want to implement it in C, so I first impelemnted it in Matlab like follows:
[L,U]=lu(a);
[m,n]=size(U);
disp('convert elements in major diagonal to 1')
for s=1:m
U(s,:)=U(s,:)/U(s,s);
end
for j=m:-1:2
for i=j-1:-1:1
U(i,:)=U(i,:)-U(j,:)*(U(i,j)/U(j,j));
end
end
The above code and rref function gave the same result.
When converting this code to C, I successfully implemented the LU decomposition and the conversion of elements in major diagonal to 1 but when imlemented these nested loops
for j=m:-1:2
for i=j-1:-1:1
U(i,:)=U(i,:)-U(j,:)*(U(i,j)/U(j,j));
end
end
as follows:
for(j=m-1;j>0;j--){
for(i=j-1;i=0;i--){
for(k=0;k<n;k++){
U[i*n+k]=U[i*n+k]-(U[j*n+k]*(U[i*n+j]/U[j*n+j]));
}
}
}
I got a wrong result. How to correct it please?
If you take a closer look at your inner loop. As soon as you reach k=j you write the element U[i*n+j] or U(i,j) and use this updated value in all following iterations. Your matlab code uses the old value because you implemented vectorized operations. If you calculate *(U[i*n+j]/U[j*n+j]) outside the inner loop it should be fine.
I need cells index numbers, which fulfil following conditions:
Q(i)<=5 and V(i)/=1
(size(Q)==size(V)). I wrote something like this:
program test
implicit none
integer, allocatable, dimension(:):: R
integer Q(5)
integer V(5)
integer counter,limit,i
counter=0
limit=5
V=(/0,0,1,0,0/)
Q=(/5,10,2,7,2/)
do i=1,5
if((Q(i)<=5).AND.(V(i)/=1)) then
counter=counter+1
end if
end do
allocate(R(counter))
counter=1
do i=1,5
if((Q(i)<=5).AND.(V(i)/=1)) then
R(counter)=i
counter=counter+1
end if
end do
deallocate(R)
end program test
but I don't think it is a very efficient . Is there any better solution for this problem?
I can remove one loop by writing
program test
implicit none
integer, allocatable, dimension(:):: R
integer Q(5)
integer V(5)
integer counter,limit,i
counter=0
limit=5
V=(/0,0,1,0,0/)
Q=(/5,10,2,7,2/)
V=-V+1
allocate(R((count(V*Q<=5)-count(V*Q==0))))
counter=1
do i=1,size(Q)
if((Q(i)<=5).AND.(V(i)==1)) then
R(counter)=i
counter=counter+1
end if
end do
end program test
The question is very close to being a duplicate but explaining why would be a cumbersome comment.
Answers to that question take advantage of a common idiom:
PACK((/(i,i=1,SIZE(mask))/), mask)
This returns an array of 1-based indexes corresponding to .TRUE. elements of the logical array mask. For that question mask was the result of arr.gt.min but mask can be any rank-1 logical array.
Here, mask could well be Q.le.5.and.V.ne.1 (noting Q and V are the same length`).
In Fortran 95 (which is why I'm using (/../) and .ne.) one doesn't have access to the modern feature of automatic array allocation, so a manual allocation will be required. Something like
logical mask(5)
mask = Q.le.5.and.V.ne.1
ALLOCATE(R(COUNT(mask))
R = PACK((/(i,i=1,5)/),mask)
As an incentive to use a modern compiler, with Fortran 2003 compliance enabled, this is the same as
R = PACK((/(i,i=1,5)/), Q.le.5.and.V.ne.1)
(with appropriate other declarations, etc.)
When considering doing this creation in a subroutine it is exceptionally important to think about array bounds if using non-1-based indexing or subarrays. See my answer in the linked question for details.
Here's a array A with length N, and its values are between 1 and N (no duplication).
I want to get the array B which satisfies B[A[i]]=i , for i in [1,N]
e.g.
for A=[4,2,1,3], I want to get
B=[3,2,4,1]
I've writen a fortran code with openmp as showed below, array A is given by other procedure. For N = 1024^3(~10^9), it takes about 40 seconds, and assigning more threads do little help (it takes similar time for OMP_NUM_THREADS=1, 4 or 16). It seens openmp does not work well for very large N. (However it works well for N=10^7)
I wonder if there are other better algorithm to do assignment to B or make openmp valid.
the code:
subroutine fill_inverse_array(leng, A, B)
use omp_lib
implicit none
integer*4 intent(in) :: leng, i
integer*4 intent(in) :: A(leng)
integer*4 intent(out) :: B(leng)
!$omp parallel do private(i) firstprivate(leng) shared(A, B)
do i=1,leng
B(A(i))=i
enddo
!$omp end parallel do
end subroutine
It's a slow day here so I ran some tests. I managed to squeeze out a useful increase in speed by rewriting the expression inside the loop, from B(A(i))=i to the equivalent B(i) = A(A(i)). I think this has a positive impact on performance because it is a little more cache-friendly.
I used the following code to test various alternatives:
A = random_permutation(length)
CALL system_clock(st1)
B = A(A)
CALL system_clock(nd1)
CALL system_clock(st2)
DO i = 1, length
B(i) = A(A(i))
END DO
CALL system_clock(nd2)
CALL system_clock(st3)
!$omp parallel do shared(A,B,length) private(i)
DO i = 1, length
B(i) = A(A(i))
END DO
!$omp end parallel do
CALL system_clock(nd3)
CALL system_clock(st4)
DO i = 1, length
B(A(i)) = i
END DO
CALL system_clock(nd4)
CALL system_clock(st5)
!$omp parallel do shared(A,B,length) private(i)
DO i = 1, length
B(A(i)) = i
END DO
!$omp end parallel do
CALL system_clock(nd5)
As you can see, there are 5 timed sections in this code. The first is a simple one-line revision of your original code, to provide a baseline. This is followed by an unparallelised and then a parallelised version of your loop, rewritten along the lines I outlined above. Sections 4 and 5 reproduce your original order of operations, first unparallelised, then parallelised.
Over a series of four runs I got the following average times. In all cases I was using arrays of 10**9 elements and 8 threads. I tinkered a little and found that using 16 (hyperthreads) gave very little improvement, but that 8 was a definite improvement on fewer. Some average timings
Sec 1: 34.5s
Sec 2: 32.1s
Sec 3: 6.4s
Sec 4: 31.5s
Sec 5: 8.6s
Make of those numbers what you will. As noted above, I suspect that my version is marginally faster than your version because it makes better use of cache.
I'm using Intel Fortran 14.0.1.139 on a 64-bit Windows 7 machine with 10GB RAM. I used the '/O2' option for compiler optimisation.
Does there exist a function similar to that of numpy's * operator for two arrays to multiply their elements in an element-wise manner, returning an array of the similar type?
For example:
#Lets define:
a = [0,1,2,3]
b = [1,2,3,4]
d = [[1,2] , [3,4], [5,6]]
e = [3,4,5]
#I want:
a * 2 == [2*0, 1*2, 2*2, 2*3]
a * b == [0*1, 1*2, 2*3, 3*4]
d * e == [[1*3, 2*3], [3*4, 4*4], [5*5, 6*5]]
d * d == [[1*1, 2*2], [3*3, 4*4], [5*5, 6*6]]
Note how * IS NOT regular matrix multiplication it is element-wise multiplication.
My current best solution is to write some c code, which does this, and import a compiled dll.
There must exist a better solution.
EDIT:
Using LabVIEW 2011 - Needs to be fast.
The first two multiplications can be done by using the 'multiply' primitive. Make sure the arrays in the second case are of the same length.
For the third multipllication you can use a for loop (with auto-indexing). This is needed because you need to instruct LabVIEW what the basic index is.
The last multiplication can (again) be done using the multiply primitive.
My result is different (opposite) from the previous posters. I generated a 4x1000 array of random numbers (magnitude 1000) which I multiplied by a 4x4 array of integers (1,2,3,4,...). I did this 100,000 times using the matrix multiplication VI and also using for loops to perform the operation on the arrays. I'm seeing times on the order of 0.328s for the matrix VIs and 0.051s for the for loops. Using a compiled DLL may be faster than Labview, but this does not seem to be true for the built-in functions.
This is certainly not what I expected, but it is consistent over many cycles. The VI is standard execution thread. All data types are set before the timed operations - no coercion takes place in the loops. The operations are performed separately, staged by a flat sequence structure, as is the time measurement. Parallelism is turned off.